This document describes the design decisions and patterns for the Orcha database schema.
---
title: ORCHA
---
erDiagram
%% ============================================
%% LEGEND: futurePlanned = planned for future iteration
%% ============================================
%% ============================================
%% TENANT & USER
%% ============================================
tenant {
uuid id PK
text name
text slug UK "URL-safe identifier"
text company_address
text company_vat_id
text company_tax_id
text company_country "ISO 3166-1 alpha-2"
timestamptz created_at
timestamptz updated_at
}
user {
uuid id PK
text clerk_user_id UK "external auth provider ID"
boolean is_super_admin "platform-wide admin"
timestamptz created_at
timestamptz updated_at
}
tenant_membership {
uuid id PK
uuid tenant_id FK
uuid user_id FK
timestamptz created_at
}
%% ============================================
%% AP_DOC_SOURCE (Class Table Inheritance)
%% ============================================
ap_doc_source {
uuid id PK
uuid tenant_id FK
doc_source_type type "discriminator: email, webhook"
boolean is_active
timestamptz last_document_at
integer documents_received
timestamptz created_at
timestamptz updated_at
}
ap_doc_source_email {
uuid doc_source_id PK,FK "extends ap_doc_source"
text email_address UK
email_provider provider "gmail, outlook"
text folder
bigint last_uid_seen "IMAP UID watermark"
}
ap_doc_source_webhook {
uuid doc_source_id PK,FK "extends ap_doc_source"
text endpoint_path UK "unique URL path"
text secret_hash "HMAC verification"
}
%% ============================================
%% DOCUMENT (Content Entity)
%% ============================================
document {
uuid id PK "UUIDv7 for time-ordering"
uuid tenant_id FK
document_type type "discriminator: invoice"
text domain "ap or ar"
text content_hash UK "SHA-256 for deduplication"
text file_path "S3 path: documents/{id}.{ext}"
jsonb structured_data "final extraction output"
boolean needs_human_review "set by trigger on failed ingestion"
timestamptz created_at
timestamptz updated_at
}
invoice {
uuid document_id PK,FK "extends document"
text number
date date
date due_date
text currency
numeric total
text supplier_name
}
purchase_order_futurePlanned {
uuid document_id PK,FK "extends document"
text number
date date
text supplier_name
numeric total
}
goods_received_note_futurePlanned {
uuid document_id PK,FK "extends document"
text number
date date
text supplier_name
}
contract_futurePlanned {
uuid document_id PK,FK "extends document"
text title
contract_type type "service, license, nda, supplier"
date effective_date
date expiration_date
}
%% ============================================
%% AP_INGESTION (Processing Entity)
%% ============================================
ap_ingestion {
uuid id PK "UUIDv7 for time-ordering"
uuid document_id FK
uuid doc_source_id FK "XOR with uploaded_by"
uuid uploaded_by FK "XOR with doc_source_id"
ingestion_status status "in-progress, completed, failed"
integer attempt_count "processing attempts"
jsonb source_metadata "origin-specific data"
text ocr_file_path "S3: ingestions/{id}/ocr-output.edn"
text preprocessed_file_path "S3: ingestions/{id}/preprocessed.{ext}"
numeric ocr_quality_score "0-1 range"
integer ocr_page_count
text ocr_method "google-document-ai, pdf-lib"
timestamptz ocr_started_at
timestamptz ocr_ended_at
jsonb llm_response "raw LLM response"
integer llm_input_tokens
integer llm_output_tokens
text llm_model
timestamptz llm_started_at
timestamptz llm_ended_at
jsonb structured_data "extraction output"
boolean valid_structured_data
timestamptz started_at "when claimed"
timestamptz completed_at
timestamptz created_at
}
%% ============================================
%% THREE-WAY MATCHING (Future)
%% ============================================
po_invoice_match_futurePlanned {
uuid id PK
uuid purchase_order_id FK
uuid invoice_id FK
numeric confidence "0-100"
match_status status "pending, confirmed, rejected"
timestamptz created_at
timestamptz updated_at
}
po_grn_match_futurePlanned {
uuid id PK
uuid purchase_order_id FK
uuid goods_received_note_id FK
numeric confidence "0-100"
match_status status "pending, confirmed, rejected"
timestamptz created_at
timestamptz updated_at
}
%% ============================================
%% RELATIONSHIPS: Tenant & User
%% ============================================
tenant ||--o{ tenant_membership : "has"
user ||--o{ tenant_membership : "belongs to"
%% ============================================
%% RELATIONSHIPS: Doc source hierarchy
%% ============================================
tenant ||--o{ ap_doc_source : "has"
ap_doc_source ||--o| ap_doc_source_email : "extends"
ap_doc_source ||--o| ap_doc_source_webhook : "extends"
%% ============================================
%% RELATIONSHIPS: Document hierarchy
%% ============================================
tenant ||--o{ document : "has"
document ||--o| invoice : "extends"
document ||--o| purchase_order_futurePlanned : "extends"
document ||--o| goods_received_note_futurePlanned : "extends"
document ||--o| contract_futurePlanned : "extends"
%% ============================================
%% RELATIONSHIPS: Ingestion
%% ============================================
document ||--o{ ap_ingestion : "has"
ap_doc_source |o--o{ ap_ingestion : "triggered"
user |o--o{ ap_ingestion : "triggered"
%% ============================================
%% RELATIONSHIPS: Three-way matching
%% ============================================
purchase_order_futurePlanned ||--o{ po_invoice_match_futurePlanned : "matched to"
invoice ||--o{ po_invoice_match_futurePlanned : "matched to"
purchase_order_futurePlanned ||--o{ po_grn_match_futurePlanned : "matched to"
goods_received_note_futurePlanned ||--o{ po_grn_match_futurePlanned : "matched to"
Legend: Entities marked with
futurePlannedare planned for a future iteration.
The schema separates two distinct concepts:
| Entity | Represents | Examples |
|---|---|---|
tenant |
A company/organization paying for Orcha | "Acme Corp", "TechStartup GmbH" |
user |
An individual human using the system | "alice@acme.com", "bob@acme.com" |
Why the separation:
Design decisions:
tenant.slug is unique for URL namespacing (e.g., app.orcha.com/acme-corp/)user.clerk_user_id links to external auth provider (Clerk)tenant_membership is the join table enabling M:N relationshiptenant (1:1 relationship, no separate table)Tenant settings (on tenant table):
company_address, company_vat_id, company_tax_id, company_countryUser flags:
is_super_admin - platform-wide admin (not tenant-specific, can access all tenants)Documents can arrive through multiple channels. We model configured channels as ap_doc_source entities.
What qualifies as a doc-source:
What is NOT a doc-source:
Design pattern: Class Table Inheritance
The ap_doc_source table holds common fields. Subtype tables extend it:
ap_doc_source (base)
├── ap_doc_source_email (extends)
└── ap_doc_source_webhook (extends)
The subtype's primary key is also a foreign key to the parent, with ON DELETE CASCADE.
A document represents a unique piece of content, deduplicated by its content hash. It is a pure "content entity" that does not track:
Key design decisions:
structured_data IS NOT NULLcontent_hash ensures the same file is stored onceDocument fields:
| Field | Purpose |
|---|---|
content_hash |
SHA-256 for deduplication |
file_path |
S3 location of original file |
type |
Document type discriminator |
structured_data |
Final extraction output (copied from successful ingestion) |
needs_human_review |
Flag set when latest ingestion fails |
Derived state:
| Condition | Meaning |
|---|---|
structured_data IS NOT NULL |
Document is ready |
Has in-progress ingestion |
Document is being processed |
needs_human_review = true |
Latest ingestion failed validation |
An ingestion (table: ap_ingestion) represents a single processing request for a document. This is where:
doc_source_id OR uploaded_by)status, attempt_count)Key design decisions:
Why this separation:
XOR Constraint on source: Every ingestion must have exactly one source - either a doc_source_id OR an uploaded_by, never both, never neither.
doc_source_id |
uploaded_by |
Valid? |
|---|---|---|
| SET | NULL | Yes - from doc-source |
| NULL | SET | Yes - manual upload |
| SET | SET | No - rejected |
| NULL | NULL | No - rejected |
ap_ingestion fields:
| Field | Purpose |
|---|---|
document_id |
Reference to content |
doc_source_id / uploaded_by |
Source of this request (XOR) |
status |
in-progress, completed, failed |
attempt_count |
Number of processing attempts |
source_metadata |
Origin-specific data |
started_at |
When claimed (used as lock for crash recovery) |
ocr_* |
OCR timing, quality, method |
llm_* |
LLM timing, tokens, model |
structured_data |
Extraction output for this run |
Claim mechanism: The started_at field serves as a distributed lock. An ingestion can only be claimed if started_at IS NULL (never claimed) or older than a configurable threshold (previous worker crashed).
All document types share a common ingestion pipeline and base attributes. We model this using Class Table Inheritance:
document (base)
├── invoice (extends)
├── purchase_order (future)
├── goods_received_note (future)
└── contract (future)
Why this pattern:
Subtype tables are minimal - only fields that make sense to query/index are normalized. Everything else lives in structured_data.
When an ingestion completes (status changes to completed or failed), a trigger copies the result to the document:
CREATE OR REPLACE FUNCTION update_document_from_ingestion()
RETURNS TRIGGER AS $$
BEGIN
IF NEW.status = 'completed' AND OLD.status = 'in-progress' THEN
UPDATE document
SET
structured_data = NEW.structured_data,
needs_human_review = NOT COALESCE(NEW.valid_structured_data, true),
updated_at = now()
WHERE id = NEW.document_id;
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
This keeps the document's structured_data synchronized with the latest successful AP ingestion while maintaining the complete ingestion history.
This feature is planned for a future iteration.
In procurement, a "three-way match" verifies that:
All three should align before payment is approved.
Planned relationships:
purchase_order ←─── po_invoice_match ───→ invoice
│
└──── po_grn_match ───→ goods_received_note
Match tables will store:
tenant, not tenants)id (except in CTI subtypes: document_id, doc_source_id)ap_ prefix (e.g., ap_doc_source, ap_ingestion){referenced_table}_id (e.g., tenant_id)*_at suffix (created_at, last_document_at)is_* or has_* prefix (is_active, is_super_admin)type (ap_doc_source.type, document.type)UUID with DEFAULT gen_random_uuid() or uuidv7() for time-orderingTIMESTAMPTZ (timezone-aware)NUMERIC (not FLOAT)JSONB (e.g., source_metadata, structured_data)CREATE TYPE ... AS ENUM for fixed value setsPRIMARY KEYON DELETE CASCADE for owned relationshipsON DELETE SET NULL for optional referencesslug, email_address, endpoint_path)ingestion_source_xor)id as PK{base}_id as both PK and FKON DELETE CASCADE so deleting base deletes subtype