Standardized terminology for the document processing pipeline, particularly around email ingestion.
ACQUISITION INGESTION
---------- ---------
Email Source --> Relevancy Filter --> Email Triage --> S3 Upload --> Ingestion Queue
| | |
rejects extracts v
spam/marketing actionable items Transcription --> Extraction --> Complete
Key insight: Acquisition validates and prepares documents. Ingestion assumes all input is pre-validated and processes it.
The stage that handles document discovery and retrieval from external sources. All acquisition logic lives in com.getorcha.workers.acquisition.*.
Quick heuristic check during email processing that rejects obvious non-invoices.
| Aspect | Description |
|---|---|
| Location | Acquisition stage, runs BEFORE email triage |
| Purpose | Avoid wasting LLM calls on spam, newsletters, marketing |
| Method | Heuristic rules (no LLM) |
| Bias | Errs toward accepting; false positives handled by triage |
Accepts if:
Rejects if:
LLM-based analysis that determines what is actionable in an email.
| Aspect | Description |
|---|---|
| Location | Acquisition stage, runs AFTER relevancy filter |
| Input | Full email: subject, sender, body, all attachments |
| Output | Array of extractable items |
| Previously called | "LLM classification" (deprecated term at this stage) |
Each extractable item in the output specifies:
attachment, body, or download-linkExample output:
[{:kind :attachment
:filename "invoice-march-2026.pdf"
:document-type-hint :invoice}
{:kind :attachment
:filename "packing-slip.pdf"
:document-type-hint nil}]
Emails with no extractable items are marked as rejected with a reason.
The stage that processes acquired documents. Assumes all input is pre-validated and relevant. Lives in com.getorcha.workers.ingestion.*.
Converts a document (PDF or image) into machine-readable text.
com.getorcha.workers.ingestion.transcription:pdf-lib (embedded text) or :ocr (Google Document AI)Extracts structured data from transcribed text using an LLM.
com.getorcha.workers.ingestion.extractionDetermines the document type (invoice, purchase order, GRN, etc.) during the ingestion pipeline.
| Aspect | Description |
|---|---|
| Location | Ingestion stage (planned) |
| Distinct from | Email triage (which operates at acquisition) |
| Input | Single document (already extracted from email) |
| Output | Document type enum |
Why separate from email triage:
| Term | Stage | Meaning |
|---|---|---|
| Relevancy filter | Acquisition | Heuristic spam rejection |
| Email triage | Acquisition | LLM determines extractable items |
| Document classification | Ingestion | Determines document type for processing |
| Transcription | Ingestion | Document to text |
| Extraction | Ingestion | Text to structured data |