🧪 Agent Evaluation & Testing
Evaluating AI agents is highly complex due to their non-deterministic nature. Traditional test suites check for exact match outputs, whereas agent evaluation requires scoring reasoning trajectories, tool invocations, and semantic correctness.
1. 📐 The Evaluation Pyramid
Structuring testing into distinct tiers optimizes latency, API spend, and developer confidence.
2. 🛠️ Unit & Integration Testing
Before evaluating the entire agent loop, verify individual component units (tool schemas, parsing logic) and integration interfaces (memory saving, router actions).
import pytest
from unittest.mock import MagicMock
# 1. Unit Test: Verify tool return parser logic
def test_tool_output_parser():
raw_payload = {"status": "success", "data": {"order_id": "ORD-123", "amount": 45.50}}
parsed = parse_tool_response(raw_payload)
assert parsed.order_id == "ORD-123"
assert parsed.amount == 45.50
# 2. Integration Test: Verify routing state transition
def test_agent_routing_logic():
mock_llm = MagicMock()
# Mock LLM returning a tool call intent
mock_llm.generate.return_value = {"tool_calls": [{"name": "query_database", "args": {"order_id": "ORD-123"}}]}
state = {"query": "Check order ORD-123", "next_step": None}
updated_state = route_next_action(state, mock_llm)
assert updated_state["next_step"] == "query_database"
assert updated_state["args"]["order_id"] == "ORD-123"3. 🗂️ Golden Datasets (Test Collections)
A Golden Dataset is a static, curated list of representative user inputs paired with target execution paths and expected answers.
[
{
"input": "Refund my order ORD-9912 for $25.00",
"expected_steps": ["get_order_status", "issue_refund"],
"target_output": "Refund of $25.00 has been successfully issued for order ORD-9912.",
"eval_criteria": {
"requires_exact_refund_amount": true,
"forbidden_words": ["credit", "coupon"]
}
}
]⚖️ 4. Outcome vs. Process Evaluation
Effective evaluation scores both what the agent achieved and how it achieved it.
| Evaluation Dimension | Focus | Example Metric | Evaluated By |
|---|---|---|---|
| Outcome Evaluation | Verifies the final response meets the user’s criteria. | Semantic similarity, factual accuracy. | LLM-as-a-judge, Exact match rules. |
| Process Evaluation | Audits the execution trace, tool choice, and loop efficiency. | Tool correctness, step efficiency, infinite loop detection. | Trajectory parsing, JSON schema validation. |
🤝 5. Human Evaluation Methodologies
Automated evals must be grounded in human grading. Standardize human audits using these approaches:
A. SME (Subject Matter Expert) Reviews
For complex domains (e.g., medical, legal, or code generation), compile outputs and route them to domain experts for verification.
B. Pairwise Comparison (A/B Testing)
Present evaluators with two blind agent outputs (A and B) generated under different prompt setups or model versions, and ask them to select the superior response.
C. Standardized Rubric Scoring
Grade qualitative dimensions (e.g., tone, completeness) on a 1-5 scale:
| Score | Rating | Criteria |
|---|---|---|
| 5 | Excellent | The response is accurate, complete, matches tone instructions, and contains no unnecessary steps. |
| 3 | Acceptable | The response solves the user query but contains minor formatting flaws or inefficient tool steps. |
| 1 | Failed | The response is factually incorrect, hallucinated, or failed to execute the requested tools. |
🧠 6. LLM-as-a-Judge & Calibration
Once human rubrics are defined, automate evals using a highly capable LLM (e.g., GPT-4o or Claude 3.5 Sonnet) as an evaluator.
Calibration Rule: Always calculate the correlation (e.g., Cohen’s Kappa) between your LLM Judge and Human Evaluators. Iterate on the evaluator’s system prompt until the judge’s agreement rate exceeds 80%.
Visual Evaluator Prompt Structure:
You are an expert evaluator. Rate the agent's response on a scale of 1-5 based on the following rubric:
[INSERT RUBRIC TABLE]
User Input: {user_input}
Agent Output: {agent_output}
Expected Steps: {expected_steps}
Provide your rating in JSON format:
{
"score": <1-5>,
"reasoning": "<explanation>"
}📊 7. Agent-Specific Metrics
Track these indicators to audit the operational health of your agents:
- Task Success Rate: Percentage of golden runs resolving the user’s query correctly.
- Step Count Efficiency: The average number of loop turns taken to complete tasks (flagging bloated loops).
- Tool Invocation Accuracy: Percentage of tool calls executed with correct schemas and valid variables.
- Token Cost Efficiency: Token consumption trends mapped per task success.
🔄 8. Regression Testing Pipeline
Run evaluation scripts on every git commit before code merges.
🔗 9. Tool & Multi-Agent Evaluation
Tool Invocation Confusion Matrix
Plot tool selection failures to identify which tool descriptions are causing LLM routing errors:
Predicted Tool Call
[Search] [Database]
Actual [Search] 98 2 (98% Correct)
Tool [Database] 12 88 (12% Misrouted)Multi-Agent Coordination Checkpoints
In systems with multiple collaborating agents, audit handoffs by tracking:
- Handoff Accuracy: Did the supervisor delegate to the correct specialist?
- State Preservation: Did the payload parameters remain valid across handoff state boundaries?
- Ping-Pong Detection: Flag cycles where Agent A and Agent B continuously pass messages back and forth.
🚀 10. Production Evaluation Workflow
Continuous evaluation requires auditing a subset of actual production traffic.
- Log Sampling: Capture 1-5% of production traces.
- Shadow Execution: Re-run sampled inputs against your staging branch to detect regressions.
- Manual Verification: Route low-confidence traces directly to human annotators for manual grading.