Usage¶
This guide explains how to run benchmarks and use the automated workflow optimizer with MASArena.
Prerequisites¶
-
Install dependencies: If you haven't already, install the required packages. We recommend using
uv
. -
Configure Environment Variables: Create a
.env
file in the project root and set your OpenAI API key and desired model.
Running Benchmarks¶
You can run benchmarks using the convenience shell script run_benchmark.sh
(recommended) or by directly calling main.py
.
Using the Shell Script (run_benchmark.sh
)¶
The run_benchmark.sh
script is the simplest way to run evaluations.
Syntax:
# Usage: ./run_benchmark.sh [benchmark] [agent_system] [limit] [mcp_config] [concurrency] [optimizer]
./run_benchmark.sh math supervisor_mas 10
Examples:
# Run the 'math' benchmark on 10 problems with the 'supervisor_mas' agent system
./run_benchmark.sh math supervisor_mas 10
# Run the 'humaneval' benchmark asynchronously with a concurrency of 10
# The "" is a placeholder for the mcp_config argument.
./run_benchmark.sh humaneval single_agent 20 "" 10
Automated Workflow Optimization (AFlow)¶
MASArena includes AFlow implementation, an automated optimizer for agent workflows.
Example:
To run AFlow to optimize an agent for the humaneval
benchmark, provide aflow
as the optimizer argument to the shell script:
# The "" arguments are placeholders for mcp_config and concurrency.
./run_benchmark.sh humaneval single_agent 10 "" "" aflow
You can also specify the training and test set sizes for the optimizer. Note that when using the aflow
optimizer, the number of problems is determined by train_size
and test_size
, and the limit
argument is ignored for data selection.
Example with custom training and test sizes:
# Run AFlow with a training set of 30 and a test set of 15.
# The "" arguments are placeholders for mcp_config and concurrency.
# The limit argument (10) is ignored in this case.
./run_benchmark.sh humaneval single_agent 10 "" "" aflow 30 15
Command-Line Arguments¶
Here are the most common arguments for main.py
.
Main Arguments¶
Argument | Description | Default |
---|---|---|
--benchmark |
The name of the benchmark to run. | math |
--agent-system |
The agent system to use for the benchmark. | single_agent |
--limit |
The maximum number of problems to evaluate. | None (all) |
--data |
Path to a custom benchmark data file (JSONL format). | data/{benchmark}_test.jsonl |
--results-dir |
Directory to store detailed JSON results. | results/ |
--verbose |
Print progress information. | True |
--async-run |
Run the benchmark asynchronously for faster evaluation. | False |
--concurrency |
Set the concurrency level for asynchronous runs. | 10 |
--use-tools |
Enable the agent to use integrated tools (e.g., code interpreter). | False |
--use-mcp-tools |
Enable the agent to use tools via MCP. | False |
--mcp-config-file |
Path to the MCP server configuration file. Required for MCP tools. | None |
--data-id |
Data ID to use. | None |
Optimizer Arguments¶
These arguments are used when running an optimizer like AFlow via --run-optimizer
.
Argument | Type | Default | Description |
---|---|---|---|
--run-optimizer |
str | None |
Specifies the optimizer to run. Use aflow . |
--graph_path |
str | mas_arena/configs/aflow |
Path to the base AFlow graph configuration. |
--optimized_path |
str | example/aflow/humaneval/optimization |
Path to save the optimized AFlow graph. |
--validation_rounds |
int | 1 | Number of validation rounds per optimization cycle. |
--eval_rounds |
int | 1 | Number of evaluation rounds per optimization cycle. |
--max_rounds |
int | 3 | Maximum number of optimization rounds. |
--train_size |
int | 40 | Size of the training set for evaluation. |
--test_size |
int | 20 | Size of the test set for evaluation. |
Example Output¶
After a run, a summary is printed to the console:
================================================================================
Benchmark Summary
================================================================================
Agent system: swarm
Accuracy: 70.00% (7/10)
Total duration: 335125ms
Results saved to: results/math_swarm_20250616_203434.json
Summary saved to: results/math_swarm_20250616_203434_summary.json
Run visualization:
$ python mas_arena/visualization/visualize_benchmark.py visualize \
--summary results/math_swarm_20250616_203434_summary.json