🛡️ Agent Security & Guardrails
Securing autonomous AI agents is uniquely challenging because agents dynamically generate plans, call APIs, and execute code. When an agent is granted access to tools and external data, security must shift from static access policies to real-time runtime containment, zero-trust protocols, and input/output validation.
1. ⚖️ Excessive Agency & Scoped Permissions
Excessive Agency (OWASP LLM06) occurs when an agent is granted permissions beyond what is required to complete its tasks (e.g., a customer support agent possessing access to a tool that deletes customer records).
Implement a least-privilege permission model by assigning specific tool access scopes to defined agent roles:
| Agent Role | DB Access Scope | HTTP Action Allowed | Allowed Tools |
|---|---|---|---|
| Reader Agent | Read-Only | GET | read_document, search_knowledge_base |
| operator Agent | Read-Write (Non-destructive) | GET, POST, PUT | update_order_status, send_email |
| Admin Agent | Full Admin (Destructive) | GET, POST, PUT, DELETE | refund_order, provision_server |
[!IMPORTANT] Least Privilege Scope Enforcement: Enforce authorization at the API Gateway or database connection layer, not inside the LLM prompt. Prompt instructions saying “Do not call refund_order” are easily bypassed via prompt injection.
2. 💻 Sandboxed Code Execution Environments
When agents generate and execute arbitrary code (e.g., running python scripts for math calculation or data analysis), the runtime environment must be isolated to prevent host kernel compromises, local network attacks, or resource exhaustion.
Isolation Runtimes Comparison
| Sandbox Technology | Isolation Type | Startup Latency | Syscall Restriction | Egress Network |
|---|---|---|---|---|
| Docker Container | OS-level Namespace | Low (~100ms) | Weak (Shares host kernel) | Allowed by default |
| gVisor Container | User-space Kernel virtualization | Low (~150ms) | Strong (Intercepts system calls) | Configurable |
| Firecracker MicroVM | Hardware-level Virtualization | Medium (~150ms) | Full guest kernel isolation | Isolated by default |
| WebAssembly (WASM) | Software-level bytecode sandbox | Extremely Low (less than 5ms) | Limited to WASI imports | Blocked by default |
Ephemeral MicroVM Sandbox Implementation (E2B SDK)
from e2b_code_interpreter import Sandbox
def execute_untrusted_agent_code(python_code: str) -> str:
# 1. Instantiate a hardware-isolated Firecracker microVM sandbox
with Sandbox() as sandbox:
try:
# 2. Run the code in the guest VM
execution = sandbox.run_code(python_code)
# 3. Retrieve standard output or error details
if execution.error:
return f"Execution Error: {execution.error.message}"
return execution.results[0].text if execution.results else "Success (No stdout)"
except Exception as e:
raise RuntimeError(f"Sandbox virtualization failed: {str(e)}")3. 🚨 Prompt Injection Defense (Direct & Indirect)
Prompt injection attacks hijack the model’s control flow, overriding system instructions to execute malicious agent actions.
Injection Mitigation Strategies
- System/User Message Isolation: Always use the API-native messaging format (separating
"role": "system"from"role": "user"). Do not concatenate system prompts and inputs into a single text block. - Content Delimiters: When injecting untrusted context (e.g., retrieved RAG documents or tool outputs), enclose them in XML tags or distinct brackets (e.g.,
<document_context>and</document_context>) and instruct the system prompt to treat content within these blocks strictly as data. - Prompt Sanitization Middleware: Scan inputs using lightweight classifier models (such as Llama Guard or Microsoft Prompt Shield) before processing them in the agent loop.
4. 🛠️ Tool Argument Sanitization & Code Injection
Agents must validate tool parameters strictly to prevent injection attacks (e.g., executing ; rm -rf / in a bash command execution tool).
from pydantic import BaseModel, Field, field_validator
import shlex
class ShellExecuteSchema(BaseModel):
# Enforce strict variable typing and parameter validation
script_path: str = Field(description="The absolute path of the target script to execute")
script_arguments: str = Field(description="Alphanumeric arguments to pass to the script")
@field_validator("script_path")
@classmethod
def validate_safe_directory(cls, val: str) -> str:
# Prevent directory traversal attacks
if "../" in val or val.startswith("/etc") or val.startswith("/var"):
raise ValueError("Directory traversal or system file access detected.")
return val
@field_validator("script_arguments")
@classmethod
def sanitize_arguments(cls, val: str) -> str:
# Escape arguments to prevent command splitting and bash command execution
return shlex.quote(val)🔌 5. Model Context Protocol (MCP) Security Boundaries
The Model Context Protocol (MCP) defines trust boundaries between clients (e.g., Cursor, desktop LLM runtimes) and servers (local/remote databases, filesystem APIs).
When connecting agents to MCP servers, enforce these security parameters:
[Agent Client Application]
│ (OTel Context / Scoped JWT)
┌───────▼───────┐
│ MCP Client │
└───────┬───────┘
│ (JSON-RPC over STDIO / SSE)
┌───────▼───────┐
│ MCP Server │ (Verifies scopes & runs tools in Sandboxes)
└───────────────┘- Local Transport Isolation: Enforce communication over
stdiotransport using local Unix sockets or named pipes. If using Server-Sent Events (SSE) over HTTP, enforce mutual TLS (mTLS) and API token validation. - Server-Side Argument Validation: MCP servers must treat all incoming arguments as untrusted. Never assume the client has validated the variables.
- Process Sandboxing: Run local MCP servers in isolated containers or shell sandboxes, restricted from reading the developer’s root filesystem.
6. 🔒 Secrets Resolution & Output Guardrails
A. Secrets Resolution (Prompt Protection)
Never inject database passwords, API keys, or SaaS secrets directly into system prompts or thread history. Instead, use metadata placeholders (e.g., SECRET_STRIPE_KEY) and resolve the actual secret inside the tool execution runtime, keeping credentials isolated from the model’s output context.
B. Outbound Output Verification (Llama Guard)
Filter model outputs before delivering them to users or systems. This catches instances where the agent was hijacked and is attempting to extract internal instructions, secrets, or system prompts.
from openai import OpenAI
client = OpenAI()
def validate_agent_output(agent_response: str) -> bool:
# Use Llama Guard as a middleware output filter
moderation_response = client.chat.completions.create(
model="meta-llama/Llama-Guard-3-8B",
messages=[{"role": "user", "content": agent_response}]
)
result = moderation_response.choices[0].message.content.strip()
# Returns True if output is safe, False if flagged as unsafe
return "safe" in result.lower()🔗 Related Sections
- Building AI Agents — Human-in-the-loop (HITL) gate patterns.
- Agent Skills & Capabilities — Strict JSON schema registration for tools.
- Agent State Management — Persistent storage security and locking.
- Agent Observability & Tracing — Scrubbing and redacting PII from trace logs.