๐ Extending MASArena Framework¶
A comprehensive guide to extending MASArena with custom Multi-Agent Systems and Evaluators.
๐ Table of Contents¶
- ๐ค Multi-Agent System Extension
- ๐ Implementation Requirements
- ๐ Implementation Steps
- โก Advanced Features
- ๐ก Complete Example
- ๐ฏ Evaluator Extension
- ๐ง Basic Implementation
- โก Advanced Features
- ๐ป Code Evaluation
- ๐ Complete Examples
- โ Best Practices
- ๐จ Common Issues
๐ค Multi-Agent System Extension¶
๐ Implementation Requirements¶
โ
Essential Requirements:
- Extend AgentSystem
base class
- Implement run_agent()
method (abstract method - required)
- Include evaluator
in config during initialization
- Return proper message format with usage metadata
- Register with AgentSystemRegistry
๐ก Optional but Recommended:
- Implement _create_agents()
for tool integration support
- Use self.format_prompt
for benchmark-specific formatting
- Handle async execution properly if needed
๐ Implementation Steps¶
Step 1: Create Agent System Class Structure¶
โ Langgraph supported โ Customizable agent and multi-agent interaction
๐ Implementation Guide:
- Inherit from AgentSystem
base class
- Initialize configuration parameters (num_agents, num_rounds, model_name)
- Set up agent components using _create_agents()
method
- Extract workers and result extractors from created components
- Validate that required components are available
๐ก SupervisorMAS Implementation Example (LangGraph Structure):
# mas_arena/agents/supervisor_mas.py
def _init_graph_if_needed(self, problem_input: Optional[Any] = None, feedback: Optional[Any] = None):
if self.graph is not None:
return
# _create_agents now returns a dict {"researcher": researcher_node, "coder": coder_node}
# If wrapped by ToolIntegrationWrapper, the nodes will have been modified in-place.
worker_nodes_map = self._create_agents(problem_input=problem_input, feedback=feedback)
research_node_obj = worker_nodes_map.get("researcher")
coder_node_obj = worker_nodes_map.get("coder")
if not research_node_obj or not coder_node_obj:
raise RuntimeError("Could not find researcher or coder agent nodes from _create_agents dictionary.")
builder = StateGraph(State)
checkpointer = InMemorySaver()
supervisor_model = self.config.get("supervisor_model_name", self.config.get("model_name", os.getenv("MODEL_NAME", "gpt-4o-mini")))
builder.add_node("supervisor", create_supervisor(model_name=supervisor_model))
builder.add_node("researcher", research_node_obj)
builder.add_node("coder", coder_node_obj)
builder.add_edge(START, "supervisor")
builder.add_conditional_edges(
"supervisor",
lambda x: x["next"],
{"researcher": "researcher", "coder": "coder", END: END},
)
builder.add_edge("researcher", "supervisor")
builder.add_edge("coder", "supervisor")
self.graph = builder.compile(checkpointer=checkpointer)
๐ก ChatEval Implementation Example (Basic Structure):
# mas_arena/agents/chateval.py
class ChatEval(AgentSystem):
"""Multi-agent evaluation system based on iterative debate"""
def __init__(self, name: str = "chateval", config: Dict[str, Any] = None):
super().__init__(name, config)
self.config = config or {}
self.num_agents = self.config.get("num_agents", 3)
self.num_rounds = self.config.get("num_rounds", 2)
self.model_name = self.config.get("model_name") or os.getenv("MODEL_NAME", "gpt-4o-mini")
# Initialize agents and extractor via _create_agents
# self.agents and self.extractor will be set by _create_agents
agent_components = self._create_agents()
self.agents = [w for w in agent_components["workers"] if isinstance(w, Agent)]
extractors = [w for w in agent_components["workers"] if isinstance(w, ResultExtractor)]
if not extractors:
raise ValueError("ResultExtractor not found in components created by _create_agents.")
self.extractor = extractors[0]
Step 2: Implement Core run_agent
Method¶
๐ Implementation Guide: - Extract problem text from input dictionary - Initialize message storage for tracking LLM responses - Implement multi-round agent interaction logic - Collect and process agent responses with proper metadata - Extract final answer using result extractor - Return formatted result with messages and final answer
๐ก ChatEval Implementation Example (run_agent Core Method):
# mas_arena/agents/chateval.py
async def run_agent(self, problem: Dict[str, Any], **kwargs) -> Dict[str, Any]:
"""Run iterative debate process"""
problem_text = problem["problem"]
# store all LLM response objects
all_messages = []
agent_histories = []
# iterative discussion process
agent_names = ["Math Expert", "Logic Expert", "Critical Thinking Expert"]
for t in range(self.num_rounds):
for n, agent in enumerate(self.agents):
# generate response for current agent
context = self._build_context(problem_text, n, t)
response_data = agent.generate_response(context)
# save response object
if "message" in response_data:
all_messages.append(response_data["message"])
# add response to context of subsequent agents
solution_text = response_data.get("solution", "")
for m in range(n + 1, len(self.agents)):
self.agents[m].chat_history.append({
"role": "human",
"human": f"{agent_names[n]}'s response: {solution_text}"
})
# extract all agent chat histories
agent_histories = [agent.chat_history for agent in self.agents]
# extract final answer
extractor_result = self.extractor.extract(agent_histories, problem_text)
# add evaluator message
if "message" in extractor_result and extractor_result["message"]:
all_messages.append(extractor_result["message"])
return {
"messages": all_messages, # contains all LLM response objects
"final_answer": extractor_result["message"].content
}
Step 3: Implement _create_agents
Method (Tool Integration Support)¶
๐ Implementation Guide:
- Create specialized AgentNode
instances for each role
- Set agent names, models, and system prompts
- Create result extractor with format prompt integration
- Return dictionary with "workers" key containing all components
- Ensure each worker has .name
and .llm
attributes for tool binding
๐ก ChatEval Implementation Example (_create_agents Tool Integration):
# mas_arena/agents/chateval.py
def _create_agents(self) -> List[Agent]:
"""Create multiple agent instances and result extractor"""
# This method will be patched by ToolIntegrationWrapper if this system is wrapped.
# The wrapper expects a dictionary: {"workers": [worker1, worker2, ...]}
# Each worker should have a .name and .llm attribute.
debate_agents = []
agent_names = ["Math Expert", "Logic Expert", "Critical Thinking Expert"]
for i in range(self.num_agents):
agent = Agent(
agent_id=f"agent_{i+1}",
name=agent_names[i],
model_name=self.model_name,
system_prompt=self._get_agent_prompt(i)
)
debate_agents.append(agent)
# Create and assign the extractor here
extractor = ResultExtractor(self.model_name, self.format_prompt)
# self.extractor = extractor # Assign to self if needed elsewhere before run_agent completes,
# but __init__ already handles setting self.extractor.
return {
"workers": debate_agents + [extractor]
}
Step 4: Register System with Framework¶
๐ Implementation Guide:
- Use AgentSystemRegistry.register()
to make system available
- Provide system name as string identifier
- Pass class reference (not instance)
- Include default configuration parameters
- These defaults can be overridden during initialization
๐ก ChatEval Implementation Example (Registration):
# mas_arena/agents/chateval.py
# register agent system
AgentSystemRegistry.register(
"chateval",
ChatEval,
num_agents=3,
num_rounds=2
)
โก Advanced Features¶
๐จ Format Prompt Integration¶
๐ Implementation Guide:
- Accept format_prompt
parameter in initialization
- Store format prompt for benchmark-specific requirements
- Use format prompt in result extraction and agent prompts
- Configure timeout and retry settings for robust operation
๐ก ChatEval Implementation Example (Format Prompt Integration):
# mas_arena/agents/chateval.py
def __init__(self, model_name: str = None, format_prompt: str = ""):
self.model_name = model_name or os.getenv("MODEL_NAME", "gpt-4o-mini")
self.format_prompt = format_prompt
self.llm = ChatOpenAI(
model=self.model_name,
request_timeout=60, # Set request timeout to 60 seconds
max_retries=2 # Set maximum retry attempts to 2
)
self.name = "result_extractor"
๐ค Agent Node Pattern¶
๐ Implementation Guide: - Use dataclass decorator for clean agent definition - Include required attributes: agent_id, name, model_name, system_prompt - Initialize chat history as empty list - Set up LLM instance with timeout and retry configuration - Ensure compatibility with tool integration framework
๐ก ChatEval Implementation Example (Agent Class Definition):
# mas_arena/agents/chateval.py
@dataclass
class Agent:
"""Represents an LLM agent"""
agent_id: str
name: str
model_name: str
system_prompt: str
chat_history: List[Dict[str, str]] = None
def __post_init__(self):
self.chat_history = []
self.llm = ChatOpenAI(
model=self.model_name,
request_timeout=60, # Set request timeout to 60 seconds
max_retries=2 # Set maximum retry attempts to 2
)
๐ Usage Metadata Handling¶
๐ Implementation Guide:
- For native OpenAI API calls or non-structured output: No manual handling required
- For structured output: Use self.llm.with_structured_output(schema=AgentResponse, include_raw=True)
- Usage metadata is automatically handled by the framework
- Focus on implementing the structured output schema instead
๐ Key Implementation Summary¶
๐ง Implementation Points:
- Inherit from AgentSystem
base class
- Implement required run_agent()
method
- Ensure config includes evaluator
key
- Return dictionary containing messages
and final_answer
- Optional: Implement _create_agents()
for tool integration support
๐ Registration Process:
Use AgentSystemRegistry.register()
to register system and provide default configuration parameters.
๐ Complete Implementation Reference:
mas_arena/agents/chateval.py
๐ฏ Evaluator Extension¶
๐ง Basic Implementation¶
Step 1: Basic Structure and Registration¶
๐ Implementation Guide:
- Use @register_benchmark
decorator to register evaluator
- Define normalization keys mapping for data field standardization
- Inherit from BaseEvaluator
base class
- Provide comprehensive docstring explaining evaluator purpose
- Set up evaluator name and supported answer formats
๐ก MMLU_pro Implementation Example (Registration and Class Definition):
# mas_arena/evaluators/mmlu_pro_evaluator.py
@register_benchmark(
name="mmlu_pro",
normalization_keys={
"id": "id",
"problem": "question",
"solution": "answer",
}
)
class MMLU_ProEvaluator(BaseEvaluator):
"""
Evaluator for the MMLU Professional mas_arena.
This evaluator assesses agent performance on the MMLU_pro dataset
using exact matching of answers (A, B, C, etc.).
"""
Step 2: Initialize Configuration¶
๐ Implementation Guide: - Call parent class initialization with name and config - Set up evaluation-specific weights and parameters - Configure dataset loading and validation - Set up logging and error handling - Define evaluation metrics and scoring methods
๐ก MMLU_pro Implementation Example (Initialization):
# mas_arena/evaluators/mmlu_pro_evaluator.py
def __init__(self, name="mmlu_pro", config=None):
"""
Initialize the MMLU Professional evaluator.
Args:
name: Name of the evaluator
config: Configuration dictionary containing:
- data_path: Path to the MMLU_pro dataset
- log_path: Path to save evaluation logs
"""
super().__init__(name, config or {})
# Weight for exact match score is always 1.0 as it's the only metric
self.exact_match_weight = 1.0
# Load the dataset
self._load_dataset()
Step 3: Implement Core Evaluation Method¶
๐ Implementation Guide: - Extract final answer and reference solution from inputs - Use specialized answer extraction method for response parsing - Apply scoring logic (exact match, numerical comparison, etc.) - Calculate evaluation metrics and scores - Return standardized evaluation results dictionary - Include extracted answer and original final answer
๐ก MMLU_pro Implementation Example (evaluate Method):
# mas_arena/evaluators/mmlu_pro_evaluator.py
def evaluate(self, problem: Dict[str, Any], run_result: Dict[str, Any]) -> Dict[str, Any]:
"""
Evaluate an agent's solution to a MMLU_pro problem.
Args:
problem: Problem dictionary containing:
- question: Problem text (with options)
- answer: Correct answer (letter)
- answer_index: Index of correct answer (optional)
run_result: Results from agent's execution, containing:
- final_answer: Agent's final answer text
- messages: Agent's message history
Returns:
Evaluation results
"""
final_answer = run_result.get("final_answer", "")
reference_letter = problem.get("solution", "")
# Extract the final letter from the agent's response
extracted_answer = self.extract_answer_from_response(final_answer)
# Calculate exact match score (letter-based)
score = self.check_exact_match(reference_letter, extracted_answer)
# Record evaluation results
return {
"final_answer": final_answer,
"extracted_answer": extracted_answer,
"score": score,
}
โก Advanced Features¶
๐ Answer Extraction¶
๐ Implementation Guide: - Use regular expressions to extract formatted answers - Handle multiple answer formats (tags, patterns, raw text) - Implement fallback strategies for unformatted responses - Clean and normalize extracted text - Support flexible answer parsing for different benchmarks
๐ก MMLU_pro Implementation Example (Answer Extraction):
# mas_arena/evaluators/mmlu_pro_evaluator.py
def extract_answer_from_response(self, response: str) -> str:
"""
Extract answer from agent response.
Args:
response: Complete response text from agent
Returns:
Extracted answer letter
"""
# Try to extract answer from <answer> tags, allowing for whitespace
match = re.search(r'<answer>\s*(.*?)\s*</answer>', response, re.DOTALL)
if match:
return match.group(1).strip()
# If no tags found, return original response
return response.strip()
โ Answer Verification¶
๐ Implementation Guide: - Implement case-insensitive comparison for text answers - Handle numerical index to letter conversion (1โA, 2โB, etc.) - Apply normalization and cleaning to both reference and candidate - Return numerical score (1.0 for match, 0.0 for no match) - Include error handling for malformed inputs
๐ก MMLU_pro Implementation Example (Exact Match Verification):
# mas_arena/evaluators/mmlu_pro_evaluator.py
def check_exact_match(self, reference: str, candidate: str) -> float:
"""
Check if the candidate exactly matches the reference (case-insensitive).
Args:
reference: Reference answer (e.g., 'A', 'B', 'C', etc.)
candidate: Candidate answer
Returns:
1.0 if exact match, 0.0 otherwise
"""
# Clean and normalize both answers
ref_clean = reference.strip().upper()
cand_clean = candidate.strip().upper()
# Check for exact match
if cand_clean == ref_clean:
return 1.0
# Check if candidate is an index (e.g., "1", "2", "3") converted to letter
try:
if cand_clean.isdigit():
cand_index = int(cand_clean) - 1
cand_letter = chr(ord('A') + cand_index)
if cand_letter == ref_clean:
return 1.0
except Exception:
pass
return 0.0
๐ Batch Evaluation¶
๐ Implementation Guide: - Iterate through all problems in the batch - Extract problem IDs and reference answers for each item - Apply evaluation logic consistently across all problems - Collect comprehensive results with metadata - Log evaluation progress and summary statistics - Return standardized results format for benchmark runner
๐ก MMLU_pro Implementation Example (Batch Evaluation):
# mas_arena/evaluators/mmlu_pro_evaluator.py
def batch_evaluate(self, problems: List[Dict[str, Any]], **kwargs) -> List[Dict[str, Any]]:
"""
Evaluate a batch of problems.
Args:
problems: List of problem dictionaries
Returns:
List of evaluation results
"""
results = []
# Evaluate each problem individually
for i, problem in enumerate(problems):
problem_id = problem.get("id", problem.get("question_id", f"unknown_{i}"))
reference_letter = problem.get("solution", problem.get("answer", ""))
reference_text = self.get_correct_answer_text(problem)
response = problem.get("response", "")
# Calculate exact match score
exact_match = self.check_exact_match(reference_letter, response)
# Record results
result = {
"problem_id": problem_id,
"exact_match": exact_match,
"combined_score": exact_match, # Combined score is just the exact match
"extracted_answer": response,
"reference_answer": reference_letter,
"reference_text": reference_text,
"execution_time_ms": 0, # Will be updated by the benchmark runner
"math_score": 1.0 if exact_match >= 0.9 else 0.0 # For compatibility with benchmark runner
}
results.append(result)
# Log the results
self.logger.info(f"Problem {problem_id}: Exact={exact_match:.1f}, Combined={exact_match:.4f}")
return results
๐ป Code Evaluation¶
๐ง Code Evaluator Key Points:
- Inherit from BaseCodeEvaluator
base class (not BaseEvaluator)
- Implement check_solution(code, test, entry_point)
method
- Implement extract_code(text)
to extract code from responses
- Must include timeout protection mechanisms
- Use isolated environments for code execution
๐ Core Process Flow: 1. Code Extraction - Extract Python code from agent responses 2. Environment Isolation - Create secure execution environment 3. Test Execution - Run test cases to verify code correctness 4. Timeout Control - Prevent infinite loops or long execution
๐ Evaluator Implementation Summary¶
๐ง Core Components:
- Use @register_benchmark
decorator for registration
- Inherit from BaseEvaluator
base class
- Implement required evaluate()
method
- Configure normalization_keys
for data mapping
- Optional: Implement answer extraction and verification methods
๐ Evaluation Process: 1. Data Normalization - Map fields using normalization_keys 2. Answer Extraction - Extract final answer from messages 3. Answer Verification - Compare predicted vs reference answers 4. Result Return - Return score, extracted_answer, final_answer fields
๐ Complete Implementation References: - Text Evaluator:
mas_arena/evaluators/mmlu_pro_evaluator.py
- Code Evaluator:mas_arena/evaluators/humaneval_evaluator.py
โ Best Practices¶
๐ Performance & Security¶
- โก Batch Processing: Implement
batch_evaluate()
for better performance - โฑ๏ธ Timeout Handling: Always set timeouts for external calls and code execution
- ๐ Input Validation: Validate all inputs before processing
- ๐ก๏ธ Error Handling: Implement comprehensive exception handling
- ๐ Logging: Add detailed logging for debugging and monitoring
๐งช Testing & Validation¶
- ๐ฏ Unit Tests: Test individual components thoroughly
- ๐ Integration Tests: Test full evaluation pipeline
- โ ๏ธ Edge Cases: Test with malformed inputs and edge cases
- ๐ Performance Tests: Benchmark evaluation speed for large datasets
๐จ Common Issues¶
๐ Implementation Checklist¶
For MAS Extensions:
- [ ] โ
Config includes evaluator
key
- [ ] ๐ Messages have usage_metadata
for token tracking
- [ ] ๐ท๏ธ Agents have name
and llm
attributes (for tool integration)
- [ ] โก run_agent
should be async
- [ ] ๐ค Return format includes messages
and final_answer
- [ ] ๐ Proper registration with AgentSystemRegistry
For Evaluator Extensions:
- [ ] ๐ฏ Used @register_benchmark
decorator
- [ ] โ
Implemented evaluate
method
- [ ] ๐๏ธ Proper normalization_keys mapping
- [ ] ๐ก๏ธ Error handling for malformed inputs
- [ ] โฑ๏ธ Timeout handling for long operations