Orcha Ingestion: Terminology

Standardized terminology for the document processing pipeline, particularly around email ingestion.


Pipeline Flow

                      ACQUISITION                                    INGESTION
                      ----------                                    ---------
Email Source --> Relevancy Filter --> Email Triage --> S3 Upload --> Ingestion Queue
                       |                   |                              |
                     rejects            extracts                          v
                  spam/marketing      actionable items         Transcription --> Extraction --> Complete

Key insight: Acquisition validates and prepares documents. Ingestion assumes all input is pre-validated and processes it.


Acquisition Stage

The stage that handles document discovery and retrieval from external sources. All acquisition logic lives in com.getorcha.workers.acquisition.*.

Relevancy Filter

Quick heuristic check during email processing that rejects obvious non-invoices.

Aspect Description
Location Acquisition stage, runs BEFORE email triage
Purpose Avoid wasting LLM calls on spam, newsletters, marketing
Method Heuristic rules (no LLM)
Bias Errs toward accepting; false positives handled by triage

Accepts if:

Rejects if:

Email Triage

LLM-based analysis that determines what is actionable in an email.

Aspect Description
Location Acquisition stage, runs AFTER relevancy filter
Input Full email: subject, sender, body, all attachments
Output Array of extractable items
Previously called "LLM classification" (deprecated term at this stage)

Each extractable item in the output specifies:

Example output:

[{:kind :attachment
  :filename "invoice-march-2026.pdf"
  :document-type-hint :invoice}
 {:kind :attachment
  :filename "packing-slip.pdf"
  :document-type-hint nil}]

Emails with no extractable items are marked as rejected with a reason.


Ingestion Stage

The stage that processes acquired documents. Assumes all input is pre-validated and relevant. Lives in com.getorcha.workers.ingestion.*.

Transcription

Converts a document (PDF or image) into machine-readable text.

Extraction

Extracts structured data from transcribed text using an LLM.


Future Terms

Document Classification

Determines the document type (invoice, purchase order, GRN, etc.) during the ingestion pipeline.

Aspect Description
Location Ingestion stage (planned)
Distinct from Email triage (which operates at acquisition)
Input Single document (already extracted from email)
Output Document type enum

Why separate from email triage:


Term Disambiguation

Term Stage Meaning
Relevancy filter Acquisition Heuristic spam rejection
Email triage Acquisition LLM determines extractable items
Document classification Ingestion Determines document type for processing
Transcription Ingestion Document to text
Extraction Ingestion Text to structured data