Orcha Glossary

Standardized terminology for the Orcha codebase. Use these terms consistently in code, documentation, and communication.


Core Entities

Tenant

A company or organization using Orcha. Billing and data isolation happen at the tenant level.

User

An individual human authenticated via Clerk. Users can belong to multiple tenants.

Doc-Source

A configured channel through which documents arrive. Has tenant-specific configuration and maintains state. Currently AP-specific.

What is NOT a doc-source: Manual file uploads (these are tracked via document.uploaded_by instead).

Document

A file (PDF or image) being processed through the ingestion pipeline.


Ingestion Pipeline

Ingestion

The pipeline that processes documents through transcription and extraction. The noun form of "ingest."

Job

A unit of work for processing a single document. Jobs are claimed by workers and processed asynchronously.

Claim

The atomic operation where a worker acquires exclusive rights to process a document. Prevents duplicate processing.

Heartbeat

Periodic SQS visibility extension during long-running processing. Prevents message timeout while work is in progress.

Settled

An ingestion that is no longer in-progress—either completed or failed. Used when the outcome doesn't matter, only that processing has finished.


Pipeline Stages

Transcription

The process of converting a document (PDF or image) into machine-readable text.

Extraction

The process of extracting structured data from transcribed text using an LLM.


Document Statuses

Status Meaning
pending Queued, waiting to be claimed by a worker
ingesting Worker claimed it, processing in progress
transcribed Text extracted from PDF/image, ready for extraction
ingested Successfully processed, structured data available
failed Max retries exceeded, needs human review

Data Fields

Structured Data

JSONB field containing LLM extraction results. The canonical output of the ingestion pipeline.

Quality Score

A 0-1 metric from transcription indicating text extraction confidence. Lower scores may indicate poor scan quality or complex document layout.

Source Metadata

JSONB field containing origin-specific data about how a document arrived.