AI Engineering🤖 AI Agents🛡️ Agent Security & Guardrails
🛡️
Running production systems? Exemplar brings SRE, uptime monitoring, and incident management together so your team resolves outages faster and proves reliability to the business. Visit exemplar.dev →

🛡️ Agent Security & Guardrails

Securing autonomous AI agents is uniquely challenging because agents dynamically generate plans, call APIs, and execute code. When an agent is granted access to tools and external data, security must shift from static access policies to real-time runtime containment, zero-trust protocols, and input/output validation.


1. ⚖️ Excessive Agency & Scoped Permissions

Excessive Agency (OWASP LLM06) occurs when an agent is granted permissions beyond what is required to complete its tasks (e.g., a customer support agent possessing access to a tool that deletes customer records).

Implement a least-privilege permission model by assigning specific tool access scopes to defined agent roles:

Agent RoleDB Access ScopeHTTP Action AllowedAllowed Tools
Reader AgentRead-OnlyGETread_document, search_knowledge_base
operator AgentRead-Write (Non-destructive)GET, POST, PUTupdate_order_status, send_email
Admin AgentFull Admin (Destructive)GET, POST, PUT, DELETErefund_order, provision_server

[!IMPORTANT] Least Privilege Scope Enforcement: Enforce authorization at the API Gateway or database connection layer, not inside the LLM prompt. Prompt instructions saying “Do not call refund_order” are easily bypassed via prompt injection.


2. 💻 Sandboxed Code Execution Environments

When agents generate and execute arbitrary code (e.g., running python scripts for math calculation or data analysis), the runtime environment must be isolated to prevent host kernel compromises, local network attacks, or resource exhaustion.

Isolation Runtimes Comparison

Sandbox TechnologyIsolation TypeStartup LatencySyscall RestrictionEgress Network
Docker ContainerOS-level NamespaceLow (~100ms)Weak (Shares host kernel)Allowed by default
gVisor ContainerUser-space Kernel virtualizationLow (~150ms)Strong (Intercepts system calls)Configurable
Firecracker MicroVMHardware-level VirtualizationMedium (~150ms)Full guest kernel isolationIsolated by default
WebAssembly (WASM)Software-level bytecode sandboxExtremely Low (less than 5ms)Limited to WASI importsBlocked by default

Ephemeral MicroVM Sandbox Implementation (E2B SDK)

from e2b_code_interpreter import Sandbox
 
def execute_untrusted_agent_code(python_code: str) -> str:
    # 1. Instantiate a hardware-isolated Firecracker microVM sandbox
    with Sandbox() as sandbox:
        try:
            # 2. Run the code in the guest VM
            execution = sandbox.run_code(python_code)
            
            # 3. Retrieve standard output or error details
            if execution.error:
                return f"Execution Error: {execution.error.message}"
            return execution.results[0].text if execution.results else "Success (No stdout)"
        except Exception as e:
            raise RuntimeError(f"Sandbox virtualization failed: {str(e)}")

3. 🚨 Prompt Injection Defense (Direct & Indirect)

Prompt injection attacks hijack the model’s control flow, overriding system instructions to execute malicious agent actions.

Injection Mitigation Strategies

  • System/User Message Isolation: Always use the API-native messaging format (separating "role": "system" from "role": "user"). Do not concatenate system prompts and inputs into a single text block.
  • Content Delimiters: When injecting untrusted context (e.g., retrieved RAG documents or tool outputs), enclose them in XML tags or distinct brackets (e.g., <document_context> and </document_context>) and instruct the system prompt to treat content within these blocks strictly as data.
  • Prompt Sanitization Middleware: Scan inputs using lightweight classifier models (such as Llama Guard or Microsoft Prompt Shield) before processing them in the agent loop.

4. 🛠️ Tool Argument Sanitization & Code Injection

Agents must validate tool parameters strictly to prevent injection attacks (e.g., executing ; rm -rf / in a bash command execution tool).

from pydantic import BaseModel, Field, field_validator
import shlex
 
class ShellExecuteSchema(BaseModel):
    # Enforce strict variable typing and parameter validation
    script_path: str = Field(description="The absolute path of the target script to execute")
    script_arguments: str = Field(description="Alphanumeric arguments to pass to the script")
 
    @field_validator("script_path")
    @classmethod
    def validate_safe_directory(cls, val: str) -> str:
        # Prevent directory traversal attacks
        if "../" in val or val.startswith("/etc") or val.startswith("/var"):
            raise ValueError("Directory traversal or system file access detected.")
        return val
 
    @field_validator("script_arguments")
    @classmethod
    def sanitize_arguments(cls, val: str) -> str:
        # Escape arguments to prevent command splitting and bash command execution
        return shlex.quote(val)

🔌 5. Model Context Protocol (MCP) Security Boundaries

The Model Context Protocol (MCP) defines trust boundaries between clients (e.g., Cursor, desktop LLM runtimes) and servers (local/remote databases, filesystem APIs).

When connecting agents to MCP servers, enforce these security parameters:

[Agent Client Application]
           │ (OTel Context / Scoped JWT)
   ┌───────▼───────┐
   │  MCP Client   │
   └───────┬───────┘
           │ (JSON-RPC over STDIO / SSE)
   ┌───────▼───────┐
   │  MCP Server   │ (Verifies scopes & runs tools in Sandboxes)
   └───────────────┘
  1. Local Transport Isolation: Enforce communication over stdio transport using local Unix sockets or named pipes. If using Server-Sent Events (SSE) over HTTP, enforce mutual TLS (mTLS) and API token validation.
  2. Server-Side Argument Validation: MCP servers must treat all incoming arguments as untrusted. Never assume the client has validated the variables.
  3. Process Sandboxing: Run local MCP servers in isolated containers or shell sandboxes, restricted from reading the developer’s root filesystem.

6. 🔒 Secrets Resolution & Output Guardrails

A. Secrets Resolution (Prompt Protection)

Never inject database passwords, API keys, or SaaS secrets directly into system prompts or thread history. Instead, use metadata placeholders (e.g., SECRET_STRIPE_KEY) and resolve the actual secret inside the tool execution runtime, keeping credentials isolated from the model’s output context.

B. Outbound Output Verification (Llama Guard)

Filter model outputs before delivering them to users or systems. This catches instances where the agent was hijacked and is attempting to extract internal instructions, secrets, or system prompts.

from openai import OpenAI
 
client = OpenAI()
 
def validate_agent_output(agent_response: str) -> bool:
    # Use Llama Guard as a middleware output filter
    moderation_response = client.chat.completions.create(
        model="meta-llama/Llama-Guard-3-8B",
        messages=[{"role": "user", "content": agent_response}]
    )
    result = moderation_response.choices[0].message.content.strip()
    # Returns True if output is safe, False if flagged as unsafe
    return "safe" in result.lower()


🚀 10K+ page views in last 7 days
Developer Handbook 2026 © Exemplar.