Acquisition + Ingestion Architecture
Container Diagram
C4Container
title Container Diagram - Acquisition + Ingestion Pipeline
Person(user, "User", "Sends invoices via email")
System_Boundary(orcha, "Orcha System") {
Container_Boundary(acquisition, "Acquisition Workers") {
Container(acq-orchestrator, "Acquisition Orchestrator", "Clojure", "Polls SQS, dispatches to handlers")
Container(ses-handler, "SES Handler", "Clojure", "Processes SES emails, parses MIME")
Container(gmail-handler, "Gmail Handler", "Clojure", "Syncs via Gmail Watch API")
Container(outlook-handler, "Outlook Handler", "Clojure", "Syncs via Microsoft Graph API")
Container(email-triage, "Email Triage", "Clojure + LLM", "Filters spam, extracts attachments")
}
Container_Boundary(ingestion, "Ingestion Workers") {
Container(ing-orchestrator, "Ingestion Orchestrator", "Clojure", "Polls SQS, runs pipeline")
Container(transcription, "Transcription", "Clojure", "OCR via Document AI, PDFBox, Gemini Vision")
Container(classification, "Classification", "Clojure + Gemini", "LLM classifies document type")
Container(extraction, "Extraction", "Clojure + Claude", "LLM extracts structured data")
Container(validation, "Validation", "Clojure", "Deterministic checks (VAT, IBAN, dates)")
Container(fraud-detection, "Fraud Detection", "Clojure", "Deterministic fraud rules")
Container(post-processing, "Post-Processing", "Clojure + Claude", "GL matching, tax compliance")
}
ContainerDb(db, "PostgreSQL", "PostgreSQL", "Stores organizations, tenants, documents, ingestion state")
}
System_Boundary(aws, "AWS") {
ContainerQueue(acq-queue, "email-acquire", "SQS", "Triggers acquisition")
ContainerQueue(ing-queue, "ingest", "SQS", "Triggers ingestion")
ContainerDb(s3-ses, "ses-emails", "S3", "Raw SES email storage")
ContainerDb(s3-storage, "storage", "S3", "Processed document storage")
}
System_Ext(claude, "Claude Sonnet 4.5", "Anthropic", "Main extraction, post-processing LLM")
System_Ext(gemini, "Gemini 2.5 Flash", "Google", "Fast classification, triage LLM")
System_Ext(docai, "Document AI", "Google", "OCR for scanned PDFs")
Rel(user, ses-handler, "Sends email to @mail.getorcha.com", "SES")
Rel(ses-handler, gmail-handler, "Represents different email sources")
Rel(outlook-handler, gmail-handler, "Represents different email sources")
Rel(ses-handler, acq-queue, "S3 event notification")
Rel(acq-queue, acq-orchestrator, "Polls")
Rel(acq-orchestrator, ses-handler, "Dispatches")
Rel(acq-orchestrator, gmail-handler, "Dispatches")
Rel(acq-orchestrator, outlook-handler, "Dispatches")
Rel(ses-handler, email-triage, "Analyzes")
Rel(gmail-handler, email-triage, "Analyzes")
Rel(outlook-handler, email-triage, "Analyzes")
Rel(email-triage, ing-queue, "Queues relevant emails")
Rel(email-triage, s3-ses, "Stores raw emails")
Rel(ing-queue, ing-orchestrator, "Polls")
Rel(ing-orchestrator, transcription, "Extracts text")
Rel(ing-orchestrator, classification, "Classifies")
Rel(ing-orchestrator, extraction, "Extracts data")
Rel(ing-orchestrator, validation, "Validates")
Rel(ing-orchestrator, fraud-detection, "Checks fraud")
Rel(ing-orchestrator, post-processing, "Post-processes")
Rel(transcription, docai, "OCR calls")
Rel(transcription, s3-storage, "Reads PDF")
Rel(transcription, s3-ses, "Reads raw email")
Rel(classification, gemini, "Classifies")
Rel(email-triage, gemini, "Triage LLM")
Rel(extraction, claude, "Extracts")
Rel(post-processing, claude, "Matches")
Rel(ing-orchestrator, db, "Stores document & ingestion state")
Rel(validation, db, "Reads/writes validation state")
Rel(fraud-detection, db, "Checks duplicates")
Rel(post-processing, s3-storage, "Uploads final PDF")
Data Flow Summary
Acquisition Flow
- Email arrives via SES → S3
ses-emails bucket
- S3 event → message to
email-acquire SQS queue
- Acquisition worker polls queue → dispatches to handler (SES/Gmail/Outlook)
- Handler parses email, extracts attachments
- Email triage (LLM) filters spam, determines relevance
- Relevant emails → queued to ingestion, uploaded to S3 storage
Ingestion Pipeline
- Ingestion worker polls
ingest SQS queue
- Transcription: PDFBox → Document AI OCR → Gemini Vision (fallback)
- Classification: Gemini classifies document type (invoice, PO, etc.)
- Extraction: Claude extracts line items, totals, dates, IBAN
- Validation: Deterministic checks (VAT format, IBAN, date consistency)
- Fraud Detection: Duplicate detection, unusual amounts
- Post-Processing: Claude matches GL accounts, tax compliance
- Final document stored in PostgreSQL + S3 storage
Key Components
| Component |
Technology |
Purpose |
| Acquisition Workers |
Clojure |
Poll acquisition queue, handle email sources |
| Ingestion Workers |
Clojure |
Poll ingestion queue, run processing pipeline |
| Email Triage |
Clojure + Gemini |
Filter spam, extract attachments |
| Transcription |
PDFBox + Document AI + Gemini |
Convert PDFs/images to text |
| Classification |
Gemini 2.5 Flash |
Classify document type |
| Extraction |
Claude Sonnet 4.5 |
Extract structured data |
| Post-Processing |
Claude Sonnet 4.5 |
GL matching, tax compliance |