Acquisition + Ingestion Architecture

Container Diagram

C4Container
  title Container Diagram - Acquisition + Ingestion Pipeline

  Person(user, "User", "Sends invoices via email")

  System_Boundary(orcha, "Orcha System") {
    
    Container_Boundary(acquisition, "Acquisition Workers") {
      Container(acq-orchestrator, "Acquisition Orchestrator", "Clojure", "Polls SQS, dispatches to handlers")
      Container(ses-handler, "SES Handler", "Clojure", "Processes SES emails, parses MIME")
      Container(gmail-handler, "Gmail Handler", "Clojure", "Syncs via Gmail Watch API")
      Container(outlook-handler, "Outlook Handler", "Clojure", "Syncs via Microsoft Graph API")
      Container(email-triage, "Email Triage", "Clojure + LLM", "Filters spam, extracts attachments")
    }

    Container_Boundary(ingestion, "Ingestion Workers") {
      Container(ing-orchestrator, "Ingestion Orchestrator", "Clojure", "Polls SQS, runs pipeline")
      Container(transcription, "Transcription", "Clojure", "OCR via Document AI, PDFBox, Gemini Vision")
      Container(classification, "Classification", "Clojure + Gemini", "LLM classifies document type")
      Container(extraction, "Extraction", "Clojure + Claude", "LLM extracts structured data")
      Container(validation, "Validation", "Clojure", "Deterministic checks (VAT, IBAN, dates)")
      Container(fraud-detection, "Fraud Detection", "Clojure", "Deterministic fraud rules")
      Container(post-processing, "Post-Processing", "Clojure + Claude", "GL matching, tax compliance")
    }

    ContainerDb(db, "PostgreSQL", "PostgreSQL", "Stores organizations, tenants, documents, ingestion state")
  }

  System_Boundary(aws, "AWS") {
    ContainerQueue(acq-queue, "email-acquire", "SQS", "Triggers acquisition")
    ContainerQueue(ing-queue, "ingest", "SQS", "Triggers ingestion")
    ContainerDb(s3-ses, "ses-emails", "S3", "Raw SES email storage")
    ContainerDb(s3-storage, "storage", "S3", "Processed document storage")
  }

  System_Ext(claude, "Claude Sonnet 4.5", "Anthropic", "Main extraction, post-processing LLM")
  System_Ext(gemini, "Gemini 2.5 Flash", "Google", "Fast classification, triage LLM")
  System_Ext(docai, "Document AI", "Google", "OCR for scanned PDFs")

  Rel(user, ses-handler, "Sends email to @mail.getorcha.com", "SES")
  Rel(ses-handler, gmail-handler, "Represents different email sources")
  Rel(outlook-handler, gmail-handler, "Represents different email sources")

  Rel(ses-handler, acq-queue, "S3 event notification")
  Rel(acq-queue, acq-orchestrator, "Polls")
  Rel(acq-orchestrator, ses-handler, "Dispatches")
  Rel(acq-orchestrator, gmail-handler, "Dispatches")
  Rel(acq-orchestrator, outlook-handler, "Dispatches")

  Rel(ses-handler, email-triage, "Analyzes")
  Rel(gmail-handler, email-triage, "Analyzes")
  Rel(outlook-handler, email-triage, "Analyzes")

  Rel(email-triage, ing-queue, "Queues relevant emails")
  Rel(email-triage, s3-ses, "Stores raw emails")

  Rel(ing-queue, ing-orchestrator, "Polls")
  Rel(ing-orchestrator, transcription, "Extracts text")
  Rel(ing-orchestrator, classification, "Classifies")
  Rel(ing-orchestrator, extraction, "Extracts data")
  Rel(ing-orchestrator, validation, "Validates")
  Rel(ing-orchestrator, fraud-detection, "Checks fraud")
  Rel(ing-orchestrator, post-processing, "Post-processes")

  Rel(transcription, docai, "OCR calls")
  Rel(transcription, s3-storage, "Reads PDF")
  Rel(transcription, s3-ses, "Reads raw email")

  Rel(classification, gemini, "Classifies")
  Rel(email-triage, gemini, "Triage LLM")

  Rel(extraction, claude, "Extracts")
  Rel(post-processing, claude, "Matches")

  Rel(ing-orchestrator, db, "Stores document & ingestion state")
  Rel(validation, db, "Reads/writes validation state")
  Rel(fraud-detection, db, "Checks duplicates")
  Rel(post-processing, s3-storage, "Uploads final PDF")

Data Flow Summary

Acquisition Flow

  1. Email arrives via SES → S3 ses-emails bucket
  2. S3 event → message to email-acquire SQS queue
  3. Acquisition worker polls queue → dispatches to handler (SES/Gmail/Outlook)
  4. Handler parses email, extracts attachments
  5. Email triage (LLM) filters spam, determines relevance
  6. Relevant emails → queued to ingestion, uploaded to S3 storage

Ingestion Pipeline

  1. Ingestion worker polls ingest SQS queue
  2. Transcription: PDFBox → Document AI OCR → Gemini Vision (fallback)
  3. Classification: Gemini classifies document type (invoice, PO, etc.)
  4. Extraction: Claude extracts line items, totals, dates, IBAN
  5. Validation: Deterministic checks (VAT format, IBAN, date consistency)
  6. Fraud Detection: Duplicate detection, unusual amounts
  7. Post-Processing: Claude matches GL accounts, tax compliance
  8. Final document stored in PostgreSQL + S3 storage

Key Components

Component Technology Purpose
Acquisition Workers Clojure Poll acquisition queue, handle email sources
Ingestion Workers Clojure Poll ingestion queue, run processing pipeline
Email Triage Clojure + Gemini Filter spam, extract attachments
Transcription PDFBox + Document AI + Gemini Convert PDFs/images to text
Classification Gemini 2.5 Flash Classify document type
Extraction Claude Sonnet 4.5 Extract structured data
Post-Processing Claude Sonnet 4.5 GL matching, tax compliance