Orcha Glossary

Standardized terminology for the Orcha codebase. Use these terms consistently in code, documentation, and communication.

Core Entities

A company or organization using Orcha. Billing and data isolation happen at the tenant level.

An individual human authenticated via Clerk. Users can belong to multiple tenants.

A configured channel through which documents arrive. Has tenant-specific configuration and maintains state. Currently AP-specific.

Table: ap_doc_source (base), with subtypes ap_doc_source_email, ap_doc_source_webhook
Enum: doc_source_type (email, webhook)
Examples: An email inbox watcher, a webhook endpoint

What is NOT a doc-source: Manual file uploads (these are tracked via document.uploaded_by instead).

A file (PDF or image) being processed through the ingestion pipeline.

Table: document (base), with subtype invoice (future: purchase_order, goods_received_note, contract)
Enum: document_type (invoice)
Key fields: file_path (S3 location), structured_data (extraction results)

The pipeline that processes documents through transcription and extraction. The noun form of "ingest."

A unit of work for processing a single document. Jobs are claimed by workers and processed asynchronously.

The atomic operation where a worker acquires exclusive rights to process a document. Prevents duplicate processing.

Periodic SQS visibility extension during long-running processing. Prevents message timeout while work is in progress.

An ingestion that is no longer in-progress—either completed or failed. Used when the outcome doesn't matter, only that processing has finished.

The process of converting a document (PDF or image) into machine-readable text.

Namespace: com.getorcha.workers.ap.ingestion.transcription
Methods:
- :pdf-lib — Extract embedded text from digital PDFs (fast, free)
- :ocr — Optical character recognition for scanned documents (e.g., Google Document AI)
Output: Text content with quality metrics
Result key: :transcription-result

The process of extracting structured data from transcribed text using an LLM.

Status	Meaning
`pending`	Queued, waiting to be claimed by a worker
`ingesting`	Worker claimed it, processing in progress
`transcribed`	Text extracted from PDF/image, ready for extraction
`ingested`	Successfully processed, structured data available
`failed`	Max retries exceeded, needs human review

JSONB field containing LLM extraction results. The canonical output of the ingestion pipeline.

A 0-1 metric from transcription indicating text extraction confidence. Lower scores may indicate poor scan quality or complex document layout.

JSONB field containing origin-specific data about how a document arrived.

Field: document.source_metadata
Contents: Email headers (for email sources), upload info (for manual uploads), etc.