🛡️

Running production systems? Exemplar brings SRE, uptime monitoring, and incident management together so your team resolves outages faster and proves reliability to the business. Visit exemplar.dev →

🔭 Agent Observability & Tracing

Traditional logging (e.g., flat stdout streams or database logs) is insufficient for debugging autonomous agents. Because agentic execution is non-deterministic and multi-turn, you must capture the entire execution trajectory—including prompt iterations, tool routing, sub-agent handoffs, and memory lookups—as a structured, hierarchical call graph.

1. 📐 Trajectory Tracing Architecture

Agentic systems must model executions using hierarchical Traces and Spans (conforming to distributed tracing standards). A single trace represents the complete request lifecycle, while individual spans represent logical units of work.

[!IMPORTANT] Parent-Child Span Relationships: Every tool call and sub-agent step must be registered as a child span of the active reasoning turn. If the context hierarchy is broken, you lose the ability to track exactly why a model chose to execute a specific action.

2. 🔌 OpenTelemetry GenAI Conventions

To avoid vendor lock-in, instrument your agents using the official OpenTelemetry GenAI Semantic Conventions. This standardizes how model inputs, parameters, and metadata are named across all monitoring backends.

Standard GenAI Attribute Reference

Attribute Key	Type	Description	Example
`gen_ai.system`	String	The LLM provider / system.	`openai`, `anthropic`, `ollama`
`gen_ai.request.model`	String	The identifier of the model requested.	`gpt-4o`, `claude-3-5-sonnet`
`gen_ai.response.model`	String	The actual model name used to generate.	`gpt-4o-2024-05-13`
`gen_ai.request.temperature`	Double	The temperature parameter used.	`0.0`
`gen_ai.usage.prompt_tokens`	Integer	Tokens consumed by the input prompt.	`1240`
`gen_ai.usage.completion_tokens`	Integer	Tokens consumed by the model output.	`312`
`gen_ai.response.finish_reasons`	String[]	Reason for ending generation.	`["stop"]`, `["tool_calls"]`

OTel Manual Instrumentation Implementation

Below is a framework-agnostic Python implementation demonstrating how to manually instrument an LLM reasoning node with standard OTel convention attributes.

from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode
import time
 
tracer = trace.get_tracer("agent.orchestrator")
 
def call_reasoning_model(messages: list, model_name: str = "gpt-4o") -> dict:
    # Start a span for the LLM execution node
    with tracer.start_as_current_span("llm_reasoning_call") as span:
        # Set standard GenAI semantic conventions
        span.set_attribute("gen_ai.system", "openai")
        span.set_attribute("gen_ai.request.model", model_name)
        span.set_attribute("gen_ai.request.temperature", 0.0)
        
        start_time = time.perf_counter()
        try:
            # Execute model call
            response = execute_openai_call(messages, model=model_name)
            latency = time.perf_counter() - start_time
            
            # Extract metrics
            prompt_tokens = response["usage"]["prompt_tokens"]
            completion_tokens = response["usage"]["completion_tokens"]
            finish_reason = response["choices"][0]["finish_reason"]
            
            # Record metrics as span attributes
            span.set_attribute("gen_ai.usage.prompt_tokens", prompt_tokens)
            span.set_attribute("gen_ai.usage.completion_tokens", completion_tokens)
            span.set_attribute("gen_ai.response.finish_reasons", [finish_reason])
            span.set_attribute("gen_ai.latency_seconds", latency)
            
            span.set_status(Status(StatusCode.OK))
            return response
        except Exception as e:
            span.record_exception(e)
            span.set_status(Status(StatusCode.ERROR, description=str(e)))
            raise e

3. 🛠️ First-Class Tool Call Instrumentation

Tools must not execute in a black box. Tracing tool execution as a nested child span of the agent reasoning turn allows you to catch:

Schema Validation Exceptions: Bad parameters generated by the LLM.
Silent Tool Failures: When a tool returns a 200 OK status code but contains text error details or empty lists, causing the LLM to hallucinate.

from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode
import json
 
tool_tracer = trace.get_tracer("agent.tools")
 
def instrumented_tool_execution(tool_name: str, arguments: dict, tool_function) -> str:
    # Nest this span directly under the current active parent trace/span
    with tool_tracer.start_as_current_span(f"tool.{tool_name}") as span:
        span.set_attribute("tool.name", tool_name)
        span.set_attribute("tool.arguments", json.dumps(arguments))
        
        try:
            # Execute actual tool logic
            result = tool_function(**arguments)
            
            # Record output details
            span.set_attribute("tool.output_length", len(str(result)))
            span.set_status(Status(StatusCode.OK))
            
            # Detect silent failures in tool text responses
            if "error" in str(result).lower() or "failed" in str(result).lower():
                span.set_attribute("tool.silent_failure_detected", True)
                span.set_status(Status(StatusCode.ERROR, description="Silent failure inside tool return string"))
                
            return result
        except Exception as e:
            # Capture complete stack traces for diagnostic parsing
            span.record_exception(e)
            span.set_status(Status(StatusCode.ERROR, description=str(e)))
            # Feed parsing exception details back to the agent loop for self-correction
            return f"Error executing tool '{tool_name}': {str(e)}. Correct arguments and try again."

4. 🗂️ Prompt Version Correlative Tracing

Because prompts are code, runtime traces must include metadata indicating which prompt version or template generated the input payload. This allows engineers to detect prompt drift and isolate regressions when prompt templates are updated.

Trace Metadata Payload
├── trace_id: "8f9a2b..."
├── prompt_template: "refund_order_v1.2"
├── prompt_hash: "sha256:4a5c9d..."
└── variables: {"order_id": "ORD-109"}

[!TIP] Implementation Strategy: Before passing a rendered string to the LLM, inject the template configuration metadata (name, commit SHA, or template version hash) as custom attributes into the active tracing span. Do not store the raw rendered prompt text directly on production spans if privacy compliance requires it.

5. 👥 Human-in-the-Loop Feedback Integration

Continuous evaluation relies on converting live production traffic into validated datasets. Observability pipelines must support linking human evaluations (e.g., user thumbs-up, expert annotations) back to specific Trace IDs.

6. 💰 Production Cost & Sampling Analytics

Autonomous agents can execute dozens of LLM calls in a single execution loop. Without guardrails, tracing pipelines can generate massive telemetry logs that dramatically increase storage and egress costs.

A. Trace-Level Sampling Heuristic

Use trace-level sampling rather than individual span sampling. If you sample spans independently, you will end up with fragmented traces that lack context.

Sampling Rate = min(1.0, Budget Limit / Estimated Traces per Second)

For standard success paths, sample 1% to 5% of trace sessions.
For traces containing exceptions, silent tool failures, or loops exceeding 5 iterations, force 100% trace retention by updating the span’s sampling flag dynamically.

B. OTel PII Redaction SpanProcessor

To comply with data privacy policies, implement a custom OpenTelemetry SpanProcessor to intercept and scrub sensitive data (e.g., credit cards, API keys, emails) from prompts and completion parameters before they leave your servers.

from opentelemetry.sdk.trace import SpanProcessor, ReadableSpan
import re
 
class PIIRedactingSpanProcessor(SpanProcessor):
    def __init__(self):
        # Basic patterns for email, credit cards, and API keys
        self.email_regex = re.compile(r"[\w\.-]+@[\w\.-]+\.\w+")
        self.cc_regex = re.compile(r"\b(?:\d[ -]*?){13,16}\b")
        self.key_regex = re.compile(r"(?:api_key|secret|password|token)\s*[:=]\s*['\"][a-zA-Z0-9_\-]{16,}['\"]", re.IGNORECASE)
 
    def redact_text(self, text: str) -> str:
        if not isinstance(text, str):
            return text
        text = self.email_regex.sub("[REDACTED_EMAIL]", text)
        text = self.cc_regex.sub("[REDACTED_CREDIT_CARD]", text)
        text = self.key_regex.sub("[REDACTED_SECRET]", text)
        return text
 
    def on_start(self, span, parent_context=None):
        pass
 
    def on_end(self, span: ReadableSpan):
        # Inspect and sanitize standard GenAI and custom attributes
        if span.attributes:
            mutable_attributes = dict(span.attributes)
            modified = False
            
            for key, val in mutable_attributes.items():
                if isinstance(val, str):
                    redacted_val = self.redact_text(val)
                    if redacted_val != val:
                        mutable_attributes[key] = redacted_val
                        modified = True
            
            if modified:
                # Re-write the attributes back to the span
                span._attributes = mutable_attributes

7. ⚖️ Platform Integration Comparison

Selecting the right observability backend depends on framework ties and data sovereignty requirements:

Platform	Strengths	Weaknesses	Best For
LangSmith	Native debugging UI; deep integration with LangChain/LangGraph; trace playground.	High SaaS cost; vendor lock-in to the LangChain ecosystem.	LangGraph-heavy applications and enterprise SaaS teams.
Langfuse	Open-source (MIT); framework-agnostic; strong prompt version management & user feedback integration.	Requires self-hosting or managed hosting; fewer automated clustering diagnostics.	Teams requiring complete data sovereignty and custom telemetry.
Arize Phoenix	OpenTelemetry native; strong retrieval/RAG vector search diagnostics; local notebook support.	Lacks native prompt versioning repository.	ML-centric engineering teams focusing on evaluation metrics and OTel pipelines.

Building AI Agents — Orchestration runtimes and ReAct execution loops.
Agent Evaluation & Testing — Test suites, golden datasets, and offline LLM judges.
Agent Skills & Capabilities — Interface definitions and registry schemas.
AI Security, Safety, & Ethics — Enterprise data governance and compliance guidelines.

🧪 Agent Evaluation & Testing 🛡️ Agent Security & Guardrails