System Architecture¶
This document provides a detailed overview of the MAS Arena system's architecture. It explains the core components, their interactions, and the overall data flow when running a benchmark.
High-Level Architecture¶
The system is designed to be modular and extensible, allowing for easy addition of new agent systems and benchmarks. The core components are the BenchmarkRunner
, AgentSystem
, and Evaluator
. The BenchmarkRunner
orchestrates the process, while the AgentSystem
encapsulates the logic for both solving a problem and evaluating its own solution.
graph TD
subgraph User Interaction
A[run_benchmark.sh]
end
subgraph Core Orchestration
B[main.py]
C[BenchmarkRunner]
end
subgraph Agent System Abstraction
D[agents.create_agent_system]
E[agents.AgentSystemRegistry]
F[agents.base.AgentSystem]
G[agents.run_agent]
end
subgraph Concrete Agent Systems
direction LR
H[MetaGPT]
I[AgentVerse]
J[Swarm]
K[...]
end
subgraph Tool Abstraction
U[tools.base.BaseTool]
V[tools.ToolManager]
W[tools.ToolSelector]
end
subgraph Concrete Tools
direction LR
X[ShellTool]
Y[BrowserTool]
Z[PythonExecuteTool]
AA[...]
end
subgraph Evaluator Abstraction
L[evaluators.base_evaluator.BaseEvaluator]
M[evaluators.evaluate]
end
subgraph Concrete Evaluators
direction LR
N[HumanEvalEvaluator]
O[MBPPEvaluator]
P[MathEvaluator]
Q[...]
end
subgraph Data
R[Benchmark Datasets]
end
subgraph Results
S[Results]
T[Metrics]
end
A -- "Executes with args (agent, benchmark)" --> B
B -- "Instantiates & calls" --> C
C -- "Calls with agent name" --> D
D -- "Looks up in" --> E
E -- "Instantiates" --> F
F -- "Is subclassed by" --> H
F -- "Is subclassed by" --> I
F -- "Is subclassed by" --> J
F -- "Is subclassed by" --> K
F -- "Uses" --> V
V -- "Uses" --> W
W -- "Selects from" --> U
U -- "Is subclassed by" --> X
U -- "Is subclassed by" --> Y
U -- "Is subclassed by" --> Z
U -- "Is subclassed by" --> AA
F -- "Initializes" --> L
L -- "Is subclassed by" --> N
L -- "Is subclassed by" --> O
L -- "Is subclassed by" --> P
L -- "Is subclassed by" --> Q
C -- "Loads" --> R
C -- "For each problem in dataset, calls" --> F
F -- "evaluate(problem) calls" --> G
G -- "Gets result, then calls" --> M
C -- "Saves" --> S
C -- "Saves" --> T
style F fill:#f9f,stroke:#333,stroke-width:2px
style L fill:#ccf,stroke:#333,stroke-width:2px
style U fill:#ffc,stroke:#333,stroke-width:2px
Execution Workflow¶
The following sequence diagram illustrates the step-by-step workflow when a benchmark is executed. A key design choice is that the AgentSystem
is responsible for its own evaluation. It creates an appropriate Evaluator
during its initialization and uses it to score the solutions it generates.
sequenceDiagram
participant main.py
participant BenchmarkRunner
participant AgentSystem
participant ToolManager
participant Evaluator
main.py->>BenchmarkRunner: Instantiate & run(...)
BenchmarkRunner->>AgentSystem: create_agent_system(...)
activate AgentSystem
AgentSystem->>ToolManager: Instantiate()
activate ToolManager
Note over AgentSystem, ToolManager: AgentSystem creates a ToolManager
deactivate ToolManager
AgentSystem->>Evaluator: Instantiate(...)
activate Evaluator
Note over AgentSystem, Evaluator: AgentSystem creates an Evaluator
deactivate Evaluator
BenchmarkRunner-->>AgentSystem: Return agent instance
loop For each problem in dataset
BenchmarkRunner->>AgentSystem: evaluate(problem)
AgentSystem->>AgentSystem: run_agent(problem)
activate AgentSystem
Note right of AgentSystem: Core agent logic starts
AgentSystem->>ToolManager: execute_tool(tool_name, args)
activate ToolManager
ToolManager-->>AgentSystem: Return tool_output
deactivate ToolManager
Note right of AgentSystem: Agent uses tool output
deactivate AgentSystem
AgentSystem->>Evaluator: evaluate(solution, ground_truth)
activate Evaluator
Evaluator-->>AgentSystem: Return score & metrics
deactivate Evaluator
AgentSystem-->>BenchmarkRunner: Return evaluation results
end
deactivate AgentSystem
BenchmarkRunner->>main.py: Return summary
Core Components Decomposition¶
The framework's modularity comes from its use of abstract base classes and registries for dynamic discovery.
Agent Systems¶
All agent systems inherit from the AgentSystem
abstract base class. This ensures they conform to a common interface, which includes the run_agent()
and evaluate()
methods. The AgentSystemRegistry
is used to discover and list available agents.
classDiagram
direction LR
class AgentSystem {
<<Abstract>>
+name: str
+config: dict
+evaluator: BaseEvaluator
+run_agent(problem) dict
+evaluate(problem) dict
}
class MetaGPT {
}
class AgentVerse {
}
class Swarm {
}
AgentSystem <|-- MetaGPT
AgentSystem <|-- AgentVerse
AgentSystem <|-- Swarm
Evaluators¶
Similarly, all evaluators inherit from a BaseEvaluator
class (though not strictly enforced as an ABC in the current implementation, it serves this role conceptually). The AVAILABLE_EVALUATORS
dictionary in mas_arena/evaluators/__init__.py
acts as a registry.
classDiagram
direction LR
class BaseEvaluator {
<<Interface>>
+name: str
+config: dict
+evaluate(prediction, expected) dict
}
class HumanEvalEvaluator {
}
class MBPPEvaluator {
}
class MathEvaluator {
}
BaseEvaluator <|-- HumanEvalEvaluator
BaseEvaluator <|-- MBPPEvaluator
BaseEvaluator <|-- MathEvaluator
Extensibility¶
Adding a new agent or evaluator to the system is straightforward.
Adding a New Agent¶
- Create a new Python file in
mas_arena/agents/
. - Implement a new class that inherits from
agents.base.AgentSystem
. - Implement the abstract
run_agent()
method with the agent's unique logic. - Register the new agent in
mas_arena/agents/__init__.py
by adding it to theAVAILABLE_AGENT_SYSTEMS
dictionary and the__all__
list.
Adding a New Evaluator¶
- Create a new Python file in
mas_arena/evaluators/
. - Implement a new class that provides an
evaluate()
method. - Register the new evaluator in
mas_arena/evaluators/__init__.py
by adding it to theAVAILABLE_EVALUATORS
dictionary.