🛡️

Running production systems? Exemplar brings SRE, uptime monitoring, and incident management together so your team resolves outages faster and proves reliability to the business. Visit exemplar.dev →

🧪 Agent Evaluation & Testing

Evaluating AI agents is highly complex due to their non-deterministic nature. Traditional test suites check for exact match outputs, whereas agent evaluation requires scoring reasoning trajectories, tool invocations, and semantic correctness.

1. 📐 The Evaluation Pyramid

Structuring testing into distinct tiers optimizes latency, API spend, and developer confidence.

2. 🛠️ Unit & Integration Testing

Before evaluating the entire agent loop, verify individual component units (tool schemas, parsing logic) and integration interfaces (memory saving, router actions).

import pytest
from unittest.mock import MagicMock
 
# 1. Unit Test: Verify tool return parser logic
def test_tool_output_parser():
    raw_payload = {"status": "success", "data": {"order_id": "ORD-123", "amount": 45.50}}
    parsed = parse_tool_response(raw_payload)
    assert parsed.order_id == "ORD-123"
    assert parsed.amount == 45.50
 
# 2. Integration Test: Verify routing state transition
def test_agent_routing_logic():
    mock_llm = MagicMock()
    # Mock LLM returning a tool call intent
    mock_llm.generate.return_value = {"tool_calls": [{"name": "query_database", "args": {"order_id": "ORD-123"}}]}
    
    state = {"query": "Check order ORD-123", "next_step": None}
    updated_state = route_next_action(state, mock_llm)
    
    assert updated_state["next_step"] == "query_database"
    assert updated_state["args"]["order_id"] == "ORD-123"

3. 🗂️ Golden Datasets (Test Collections)

A Golden Dataset is a static, curated list of representative user inputs paired with target execution paths and expected answers.

[
  {
    "input": "Refund my order ORD-9912 for $25.00",
    "expected_steps": ["get_order_status", "issue_refund"],
    "target_output": "Refund of $25.00 has been successfully issued for order ORD-9912.",
    "eval_criteria": {
      "requires_exact_refund_amount": true,
      "forbidden_words": ["credit", "coupon"]
    }
  }
]

⚖️ 4. Outcome vs. Process Evaluation

Effective evaluation scores both what the agent achieved and how it achieved it.

Evaluation Dimension	Focus	Example Metric	Evaluated By
Outcome Evaluation	Verifies the final response meets the user’s criteria.	Semantic similarity, factual accuracy.	LLM-as-a-judge, Exact match rules.
Process Evaluation	Audits the execution trace, tool choice, and loop efficiency.	Tool correctness, step efficiency, infinite loop detection.	Trajectory parsing, JSON schema validation.

🤝 5. Human Evaluation Methodologies

Automated evals must be grounded in human grading. Standardize human audits using these approaches:

A. SME (Subject Matter Expert) Reviews

For complex domains (e.g., medical, legal, or code generation), compile outputs and route them to domain experts for verification.

B. Pairwise Comparison (A/B Testing)

Present evaluators with two blind agent outputs (A and B) generated under different prompt setups or model versions, and ask them to select the superior response.

C. Standardized Rubric Scoring

Grade qualitative dimensions (e.g., tone, completeness) on a 1-5 scale:

Score	Rating	Criteria
5	Excellent	The response is accurate, complete, matches tone instructions, and contains no unnecessary steps.
3	Acceptable	The response solves the user query but contains minor formatting flaws or inefficient tool steps.
1	Failed	The response is factually incorrect, hallucinated, or failed to execute the requested tools.

🧠 6. LLM-as-a-Judge & Calibration

Once human rubrics are defined, automate evals using a highly capable LLM (e.g., GPT-4o or Claude 3.5 Sonnet) as an evaluator.

Calibration Rule: Always calculate the correlation (e.g., Cohen’s Kappa) between your LLM Judge and Human Evaluators. Iterate on the evaluator’s system prompt until the judge’s agreement rate exceeds 80%.

Visual Evaluator Prompt Structure:

You are an expert evaluator. Rate the agent's response on a scale of 1-5 based on the following rubric:
[INSERT RUBRIC TABLE]

User Input: {user_input}
Agent Output: {agent_output}
Expected Steps: {expected_steps}

Provide your rating in JSON format:
{
  "score": <1-5>,
  "reasoning": "<explanation>"
}

📊 7. Agent-Specific Metrics

Track these indicators to audit the operational health of your agents:

Task Success Rate: Percentage of golden runs resolving the user’s query correctly.
Step Count Efficiency: The average number of loop turns taken to complete tasks (flagging bloated loops).
Tool Invocation Accuracy: Percentage of tool calls executed with correct schemas and valid variables.
Token Cost Efficiency: Token consumption trends mapped per task success.

🔄 8. Regression Testing Pipeline

Run evaluation scripts on every git commit before code merges.

🔗 9. Tool & Multi-Agent Evaluation

Tool Invocation Confusion Matrix

Plot tool selection failures to identify which tool descriptions are causing LLM routing errors:

                  Predicted Tool Call
                 [Search]   [Database]
Actual  [Search]    98          2      (98% Correct)
Tool    [Database]  12         88      (12% Misrouted)

Multi-Agent Coordination Checkpoints

In systems with multiple collaborating agents, audit handoffs by tracking:

Handoff Accuracy: Did the supervisor delegate to the correct specialist?
State Preservation: Did the payload parameters remain valid across handoff state boundaries?
Ping-Pong Detection: Flag cycles where Agent A and Agent B continuously pass messages back and forth.

🚀 10. Production Evaluation Workflow

Continuous evaluation requires auditing a subset of actual production traffic.

Log Sampling: Capture 1-5% of production traces.
Shadow Execution: Re-run sampled inputs against your staging branch to detect regressions.
Manual Verification: Route low-confidence traces directly to human annotators for manual grading.

🤝 Multi-Agent Systems 🔭 Agent Observability & Tracing