Usage¶

You can run benchmarks using main.py or the provided shell script.

Configuration¶

First, create a .env file in the project root and set the following:

OPENAI_API_KEY=your_openai_api_key
MODEL_NAME=gpt-4o-mini
OPENAI_API_BASE=https://api.openai.com/v1

Using `main.py`¶

Basic Usage¶

# Run a math benchmark with a single agent
python main.py --benchmark math --agent-system single_agent --limit 5

# Run with supervisor-based multi-agent system
python main.py --benchmark math --agent-system supervisor_mas --limit 10

# Run with swarm-based multi-agent system
python main.py --benchmark math --agent-system swarm --limit 5

Using the Shell Runner¶

A convenience script run_benchmark.sh is provided for quick runs.

# Syntax: ./run_benchmark.sh <benchmark_name> <agent_system> <limit>
./run_benchmark.sh math supervisor_mas 10

Advanced Usage: Asynchronous Execution¶

For benchmarks that support concurrency, you can run them asynchronously to speed up evaluation.

# Run the humaneval benchmark with a concurrency of 10
python main.py --benchmark humaneval --async-run --concurrency 10

Note: Benchmarks that do not support concurrency (e.g., math, aime) will automatically run in synchronous mode, even if --async-run is specified.

Command-Line Arguments¶

Here are some of the most common arguments for main.py:

Argument	Description	Default
`--benchmark`	The name of the benchmark to run.	`math`
`--agent-system`	The agent system to use for the benchmark.	`single_agent`
`--limit`	The maximum number of problems to evaluate.	`10`
`--data`	Path to a custom benchmark data file (JSONL format).	`data/{benchmark}_test.jsonl`
`--async-run`	Run the benchmark asynchronously for faster evaluation.	`False`
`--concurrency`	Set the concurrency level for asynchronous runs.	`10`
`--results-dir`	Directory to store detailed JSON results.	`results/`
`--metrics-dir`	Directory to store performance and operational metrics.	`metrics/`
`--use-tools`	Enable the agent to use integrated tools (e.g., code interpreter).	`False`
`--use-mcp-tools`	Enable the agent to use tools via the Multi-Agent Communication Protocol.	`False`
`--mcp-config-file`	Path to the MCP server configuration file. Required if using MCP tools.	`None`

Example Output¶

After a run, a summary is printed to the console:

================================================================================
Benchmark Summary
================================================================================
Agent system: swarm
Accuracy: 70.00% (7/10)
Total duration: 335125ms
Results saved to: results/math_swarm_20250616_203434.json
Summary saved to: results/math_swarm_20250616_203434_summary.json

Run visualization:
$ python mas_arena/visualization/visualize_benchmark.py visualize \
  --summary results/math_swarm_20250616_203434_summary.json