Standardized terminology for the Orcha codebase. Use these terms consistently in code, documentation, and communication.
A company or organization using Orcha. Billing and data isolation happen at the tenant level.
tenantAn individual human authenticated via Clerk. Users can belong to multiple tenants.
userclerk_user_id (external auth provider ID)A configured channel through which documents arrive. Has tenant-specific configuration and maintains state. Currently AP-specific.
ap_doc_source (base), with subtypes ap_doc_source_email, ap_doc_source_webhookdoc_source_type (email, webhook)What is NOT a doc-source: Manual file uploads (these are tracked via document.uploaded_by instead).
A file (PDF or image) being processed through the ingestion pipeline.
document (base), with subtype invoice (future: purchase_order, goods_received_note, contract)document_type (invoice)file_path (S3 location), structured_data (extraction results)The pipeline that processes documents through transcription and extraction. The noun form of "ingest."
com.getorcha.workers.ap.ingestionap_ingestionorcha-global-ingestA unit of work for processing a single document. Jobs are claimed by workers and processed asynchronously.
The atomic operation where a worker acquires exclusive rights to process a document. Prevents duplicate processing.
Periodic SQS visibility extension during long-running processing. Prevents message timeout while work is in progress.
An ingestion that is no longer in-progress—either completed or failed. Used when the outcome doesn't matter, only that processing has finished.
The process of converting a document (PDF or image) into machine-readable text.
com.getorcha.workers.ap.ingestion.transcription:pdf-lib — Extract embedded text from digital PDFs (fast, free):ocr — Optical character recognition for scanned documents (e.g., Google Document AI):transcription-resultThe process of extracting structured data from transcribed text using an LLM.
com.getorcha.workers.ap.ingestion.extraction:extraction-result| Status | Meaning |
|---|---|
pending |
Queued, waiting to be claimed by a worker |
ingesting |
Worker claimed it, processing in progress |
transcribed |
Text extracted from PDF/image, ready for extraction |
ingested |
Successfully processed, structured data available |
failed |
Max retries exceeded, needs human review |
document_statusJSONB field containing LLM extraction results. The canonical output of the ingestion pipeline.
document.structured_dataA 0-1 metric from transcription indicating text extraction confidence. Lower scores may indicate poor scan quality or complex document layout.
document.ocr_quality_scoreJSONB field containing origin-specific data about how a document arrived.
document.source_metadata