Orcha Ingestion: Terminology

Standardized terminology for the document processing pipeline, particularly around email ingestion.

Pipeline Flow

                      ACQUISITION                                    INGESTION
                      ----------                                    ---------
Email Source --> Relevancy Filter --> Email Triage --> S3 Upload --> Ingestion Queue
                       |                   |                              |
                     rejects            extracts                          v
                  spam/marketing      actionable items         Transcription --> Extraction --> Complete

Key insight: Acquisition validates and prepares documents. Ingestion assumes all input is pre-validated and processes it.

Acquisition Stage

The stage that handles document discovery and retrieval from external sources. All acquisition logic lives in com.getorcha.workers.acquisition.*.

Relevancy Filter

Quick heuristic check during email processing that rejects obvious non-invoices.

Aspect	Description
Location	Acquisition stage, runs BEFORE email triage
Purpose	Avoid wasting LLM calls on spam, newsletters, marketing
Method	Heuristic rules (no LLM)
Bias	Errs toward accepting; false positives handled by triage

Accepts if:

Has PDF/image attachments, OR
Body matches invoice keywords

Rejects if:

No attachments AND no invoice signals
Sender matches known spam patterns
Subject indicates newsletter/marketing

Email Triage

LLM-based analysis that determines what is actionable in an email.

Aspect	Description
Location	Acquisition stage, runs AFTER relevancy filter
Input	Full email: subject, sender, body, all attachments
Output	Array of extractable items
Previously called	"LLM classification" (deprecated term at this stage)

Each extractable item in the output specifies:

kind: attachment, body, or download-link
filename/format: Identifies the specific attachment or content
document-type-hint: Optional hint (e.g., "invoice", "purchase-order")

Example output:

[{:kind :attachment
  :filename "invoice-march-2026.pdf"
  :document-type-hint :invoice}
 {:kind :attachment
  :filename "packing-slip.pdf"
  :document-type-hint nil}]

Emails with no extractable items are marked as rejected with a reason.

Ingestion Stage

The stage that processes acquired documents. Assumes all input is pre-validated and relevant. Lives in com.getorcha.workers.ingestion.*.

Transcription

Converts a document (PDF or image) into machine-readable text.

Namespace: com.getorcha.workers.ingestion.transcription
Methods: :pdf-lib (embedded text) or :ocr (Google Document AI)
Output: Text content with quality metrics

Extraction

Extracts structured data from transcribed text using an LLM.

Namespace: com.getorcha.workers.ingestion.extraction
Output: Structured data (invoice fields, line items, etc.)

Future Terms

Document Classification

Determines the document type (invoice, purchase order, GRN, etc.) during the ingestion pipeline.

Aspect	Description
Location	Ingestion stage (planned)
Distinct from	Email triage (which operates at acquisition)
Input	Single document (already extracted from email)
Output	Document type enum

Why separate from email triage:

Triage sees the full email context; classification sees only the document
Triage decides WHAT to extract; classification decides HOW to process
Same document type determination may apply to uploads (no email context)

Term Disambiguation

Term	Stage	Meaning
Relevancy filter	Acquisition	Heuristic spam rejection
Email triage	Acquisition	LLM determines extractable items
Document classification	Ingestion	Determines document type for processing
Transcription	Ingestion	Document to text
Extraction	Ingestion	Text to structured data