📥 Agentic Document Workflows (ADW)
In enterprise AI, documents (PDFs, contracts, invoices, DOCX files, and raw emails) represent the largest unstructured data store. Traditional Intelligent Document Processing (IDP) systems rely on static regex rules and simple OCR templates that break on formatting changes. Agentic Document Workflows (ADW) combine Large Language Models (LLMs) with stateful multi-agent execution loops to autonomously parse, chunk, index, retrieve, evaluate, and act on document-based knowledge.
🏗️ 1. Document Ingestion Architecture
Document ingestion requires handling diverse layouts, scanned text, embedded tables, and multi-column formats. The ingestion pipeline must ingest raw files, route them to specialized parsers, and build a unified text/markdown representation.
Ingestion & Processing Pipeline
Ingestion & Parsing Engines Comparison
| Ingestion Parser | Layout Awareness | Scanned PDF Support | Table Extraction Accuracy | API Overhead |
|---|---|---|---|---|
| AWS Textract | Medium (Grid-based detection) | High (Excellent cloud OCR) | High (Exposes structured cells) | High (Network API latency) |
| Azure Doc Intelligence | High (Semantic block classification) | High (Strong cloud OCR) | Very High (Merges table rows) | High (Network API latency) |
| LlamaParse | Very High (Optimized for LLM ingestion) | Medium (Requires backend OCR) | High (Outputs markdown tables) | Medium (Cloud-based parsing) |
| unstructured.io (Local) | Medium (Rule-based structure parsing) | Low (Requires local Tesseract) | Medium (Can break on complex tables) | Low (Runs locally inside container) |
⚙️ 2. Document Processing & Structure Extraction
Extracting unstructured document text into structured JSON is the cornerstone of downstream retrieval. Layout-aware processing parses headers, strips page numbers, and normalizes unicode characters.
The python code block below illustrates using Pydantic to extract structured metadata and table arrays from a legal document.
from pydantic import BaseModel, Field
from typing import List, Optional
from datetime import date
class TableCell(BaseModel):
row_index: int
column_index: int
content: str
class ExtractedTable(BaseModel):
table_id: str = Field(description="Unique identifier for the table in the document")
headers: List[str] = Field(description="Headers of the table columns")
cells: List[TableCell] = Field(description="Individual cells mapping table data")
class DocumentMetadataSchema(BaseModel):
title: str = Field(description="The formal title of the document")
document_date: Optional[date] = Field(description="The signing or effective date of the document")
signatories: List[str] = Field(description="Parties executing the contract or agreement")
governing_law: str = Field(description="Jurisdiction governing the document terms")
extracted_tables: List[ExtractedTable] = Field(description="List of all tables detected in the document")
# Example usage with structured output parser (e.g. OpenAI SDK)
# completion = client.beta.chat.completions.parse(
# model="gpt-4o",
# messages=[{"role": "user", "content": "Extract schema from raw contract markdown..."}],
# response_format=DocumentMetadataSchema
# )✂️ 3. Chunking Strategies & Hierarchical Indexing
Simple character-count splitting splits sentences in half, severing semantic context. Instead, you must align chunking strategies to the document structure:
Chunking Strategy Matrix
| Strategy | Boundary Type | Semantic Preservation | Context Efficiency | System Overhead |
|---|---|---|---|---|
| Fixed-size Overlapping | Character or Token limits | Low (Splits mid-sentence) | Medium (Includes redundant text) | Extremely Low |
| Semantic / Header-based | Section headers (#, ##) | High (Keeps sections whole) | High (Context remains cohesive) | Medium |
| Hierarchical / Parent-Child | Nested hierarchy mappings | Very High (Keeps details linked) | High (Retrieves target + parent) | High (Requires recursive indexes) |
Parent-Child Chunk Hierarchical Map
Hierarchical chunking indexes small passage detail snippets (child chunks) for vector matching, but returns the larger parent section chunk to the LLM context when a match is hit. This maintains low retrieval search distances while providing rich context.
🔍 4. Advanced Retrieval Architectures
To build production RAG (Retrieval-Augmented Generation) document agents, standard vector search is insufficient. Implement a multi-stage retrieval architecture:
- Hybrid Search (Sparse + Dense): Combine dense embedding vector similarity (cosine distance) with sparse keyword matching (BM25).
- Metadata Pre-Filtering: Apply strict metadata filters (e.g., matching the tenant ID or execution date range) to prune the search space before executing vector distance calculations.
- Multi-Stage Reranking: Fetch a broad candidate list (e.g.,
K=50) from the hybrid index, then run those candidates through a cross-encoder reranker model (like Cohere or BGE-Reranker) to select thetop-5most semantically relevant chunks.
🤖 5. Agentic Document Workflow Patterns
Enterprise workflows require agents to iteratively refine searches, cross-check compliance terms, and draft responses.
The Python code below demonstrates a multi-agent contract review workflow where a coordinator routes a document review payload to a clause extractor agent, then routes the results to a compliance validator agent.
from typing import Dict, Any
class ClauseExtractorAgent:
def execute(self, doc_text: str) -> Dict[str, Any]:
# Extract target contract clauses
extracted_clauses = {
"termination_notice": "30 days written notice",
"liability_cap": "$10,000"
}
return {"extracted_clauses": extracted_clauses}
class ComplianceValidatorAgent:
def execute(self, clauses: Dict[str, Any]) -> Dict[str, Any]:
# Validate extracted clauses against company policy
violations = []
cap = clauses.get("liability_cap", "")
if "$10,000" in cap:
violations.append("Liability cap cap of $10,000 violates the minimum policy cap of $50,000.")
return {
"compliant": len(violations) == 0,
"violations": violations
}
class DocumentOrchestrator:
def __init__(self):
self.extractor = ClauseExtractorAgent()
self.validator = ComplianceValidatorAgent()
def review_contract(self, doc_text: str) -> Dict[str, Any]:
# 1. Extract contract clauses
extraction_result = self.extractor.execute(doc_text)
# 2. Validate extracted clauses
validation_result = self.validator.execute(extraction_result["extracted_clauses"])
# 3. Consolidate report
return {
"clauses": extraction_result["extracted_clauses"],
"compliance": validation_result
}
# orchestrator = DocumentOrchestrator()
# report = orchestrator.review_contract("This agreement dictates that liability cap is $10,000...")👥 6. Human-in-the-Loop Review Gates
When the compliance agent flags a policy violation (or when metadata extraction confidence falls below a set threshold), the execution loop must pause and request human validation.
- Pause Loop & Persist State: The orchestrator serializes the active state graph and updates the database thread status to
SUSPENDED. - Alert Queue Push: The payload containing the violating clause and the validator’s reasoning is pushed to an approval dashboard queue.
- Rehydrate & Resume: Once a compliance manager approves or overrides the violation, the state graph is rehydrated from persistent storage, injecting the human response variables, and the loop resumes.
🧪 7. Evaluation & Tracing
RAG Evaluation Metrics
Document agents must be evaluated continuously using golden test datasets:
- Context Precision: Assesses whether the retrieved chunks are relevant to the user query.
- Context Recall: Verifies if the retrieval pipeline fetched all necessary chunks required to formulate the answer.
- Faithfulness (Groundedness): Measures if the generated response is derived only from the retrieved context, preventing hallucinations.
- Answer Relevance: Verifies if the final generated answer directly addresses the user’s initial question.
OpenTelemetry Telemetry Spans
Trace execution latency across the document pipeline using nested OTel spans:
Parent Trace: Document QA Loop
├── Span 1: Ingest & Parse PDF (LlamaParse API latency)
├── Span 2: Vector DB Hybrid Retrieval (Search query + metadata pre-filters)
├── Span 3: Reranker Execution (Cross-encoder reranking time)
└── Span 4: Agent Reasoning Loop (Token usage metrics & LLM latency)🔒 8. Security & Tenant Isolation
Row-Level Security (RLS) Metadata Filters
To prevent data leaks in multi-tenant enterprise environments, vector index searches must apply strict RLS pre-filters. Users must not search the entire index space; instead, inject active authorization permissions directly into the vector database query payload:
{
"vector": [0.12, -0.43, 0.89, "..."],
"filter": {
"tenant_id": { "$eq": "tenant-908" },
"authorized_roles": { "$in": ["admin", "compliance-auditor"] }
},
"top_k": 5
}PII Redaction at Ingestion Barrier
Before writing chunks to the vector database or passing them to third-party LLMs:
- Run text through a named entity recognition (NER) engine (e.g., Microsoft Presidio).
- Mask sensitive fields: replace credit card numbers, SSNs, and personal addresses with metadata placeholders (e.g.,
[REDACTED_SSN]).
🔗 Related Sections
- Memory Systems — Context window limits and conversation memory storage.
- Agent State Management — HITL gates and graph state serialization patterns.
- Building AI Agents — Orchestrating agent workflows and prompt routing loops.
- Model Context Protocol — Structuring file retrieval tools as MCP server resources.
- Agent Observability & Tracing — OpenTelemetry GenAI span semantic conventions.
- Agent Security & Guardrails — Sandboxing untrusted process runtimes.