Database Schema Design

This document describes the design decisions and patterns for the Orcha database schema.

Entity Relationship Diagram

---
title: ORCHA
---
erDiagram
    %% ============================================
    %% LEGEND: futurePlanned = planned for future iteration
    %% ============================================

    %% ============================================
    %% TENANT & USER
    %% ============================================

    tenant {
        uuid id PK
        text name
        text slug UK "URL-safe identifier"
        text company_address
        text company_vat_id
        text company_tax_id
        text company_country "ISO 3166-1 alpha-2"
        timestamptz created_at
        timestamptz updated_at
    }

    user {
        uuid id PK
        text clerk_user_id UK "external auth provider ID"
        boolean is_super_admin "platform-wide admin"
        timestamptz created_at
        timestamptz updated_at
    }

    tenant_membership {
        uuid id PK
        uuid tenant_id FK
        uuid user_id FK
        timestamptz created_at
    }

    %% ============================================
    %% AP_DOC_SOURCE (Class Table Inheritance)
    %% ============================================

    ap_doc_source {
        uuid id PK
        uuid tenant_id FK
        doc_source_type type "discriminator: email, webhook"
        boolean is_active
        timestamptz last_document_at
        integer documents_received
        timestamptz created_at
        timestamptz updated_at
    }

    ap_doc_source_email {
        uuid doc_source_id PK,FK "extends ap_doc_source"
        text email_address UK
        email_provider provider "gmail, outlook"
        text folder
        bigint last_uid_seen "IMAP UID watermark"
    }

    ap_doc_source_webhook {
        uuid doc_source_id PK,FK "extends ap_doc_source"
        text endpoint_path UK "unique URL path"
        text secret_hash "HMAC verification"
    }

    %% ============================================
    %% DOCUMENT (Content Entity)
    %% ============================================

    document {
        uuid id PK "UUIDv7 for time-ordering"
        uuid tenant_id FK
        document_type type "discriminator: invoice"
        text domain "ap or ar"
        text content_hash UK "SHA-256 for deduplication"
        text file_path "S3 path: documents/{id}.{ext}"
        jsonb structured_data "final extraction output"
        boolean needs_human_review "set by trigger on failed ingestion"
        timestamptz created_at
        timestamptz updated_at
    }

    invoice {
        uuid document_id PK,FK "extends document"
        text number
        date date
        date due_date
        text currency
        numeric total
        text supplier_name
    }

    purchase_order_futurePlanned {
        uuid document_id PK,FK "extends document"
        text number
        date date
        text supplier_name
        numeric total
    }

    goods_received_note_futurePlanned {
        uuid document_id PK,FK "extends document"
        text number
        date date
        text supplier_name
    }

    contract_futurePlanned {
        uuid document_id PK,FK "extends document"
        text title
        contract_type type "service, license, nda, supplier"
        date effective_date
        date expiration_date
    }

    %% ============================================
    %% AP_INGESTION (Processing Entity)
    %% ============================================

    ap_ingestion {
        uuid id PK "UUIDv7 for time-ordering"
        uuid document_id FK
        uuid doc_source_id FK "XOR with uploaded_by"
        uuid uploaded_by FK "XOR with doc_source_id"
        ingestion_status status "in-progress, completed, failed"
        integer attempt_count "processing attempts"
        jsonb source_metadata "origin-specific data"
        text ocr_file_path "S3: ingestions/{id}/ocr-output.edn"
        text preprocessed_file_path "S3: ingestions/{id}/preprocessed.{ext}"
        numeric ocr_quality_score "0-1 range"
        integer ocr_page_count
        text ocr_method "google-document-ai, pdf-lib"
        timestamptz ocr_started_at
        timestamptz ocr_ended_at
        jsonb llm_response "raw LLM response"
        integer llm_input_tokens
        integer llm_output_tokens
        text llm_model
        timestamptz llm_started_at
        timestamptz llm_ended_at
        jsonb structured_data "extraction output"
        boolean valid_structured_data
        timestamptz started_at "when claimed"
        timestamptz completed_at
        timestamptz created_at
    }

    %% ============================================
    %% THREE-WAY MATCHING (Future)
    %% ============================================

    po_invoice_match_futurePlanned {
        uuid id PK
        uuid purchase_order_id FK
        uuid invoice_id FK
        numeric confidence "0-100"
        match_status status "pending, confirmed, rejected"
        timestamptz created_at
        timestamptz updated_at
    }

    po_grn_match_futurePlanned {
        uuid id PK
        uuid purchase_order_id FK
        uuid goods_received_note_id FK
        numeric confidence "0-100"
        match_status status "pending, confirmed, rejected"
        timestamptz created_at
        timestamptz updated_at
    }

    %% ============================================
    %% RELATIONSHIPS: Tenant & User
    %% ============================================

    tenant ||--o{ tenant_membership : "has"
    user ||--o{ tenant_membership : "belongs to"

    %% ============================================
    %% RELATIONSHIPS: Doc source hierarchy
    %% ============================================

    tenant ||--o{ ap_doc_source : "has"
    ap_doc_source ||--o| ap_doc_source_email : "extends"
    ap_doc_source ||--o| ap_doc_source_webhook : "extends"

    %% ============================================
    %% RELATIONSHIPS: Document hierarchy
    %% ============================================

    tenant ||--o{ document : "has"
    document ||--o| invoice : "extends"
    document ||--o| purchase_order_futurePlanned : "extends"
    document ||--o| goods_received_note_futurePlanned : "extends"
    document ||--o| contract_futurePlanned : "extends"

    %% ============================================
    %% RELATIONSHIPS: Ingestion
    %% ============================================

    document ||--o{ ap_ingestion : "has"
    ap_doc_source |o--o{ ap_ingestion : "triggered"
    user |o--o{ ap_ingestion : "triggered"

    %% ============================================
    %% RELATIONSHIPS: Three-way matching
    %% ============================================

    purchase_order_futurePlanned ||--o{ po_invoice_match_futurePlanned : "matched to"
    invoice ||--o{ po_invoice_match_futurePlanned : "matched to"

    purchase_order_futurePlanned ||--o{ po_grn_match_futurePlanned : "matched to"
    goods_received_note_futurePlanned ||--o{ po_grn_match_futurePlanned : "matched to"

Legend: Entities marked with futurePlanned are planned for a future iteration.

Core Concepts

Multi-tenancy: tenant vs user

The schema separates two distinct concepts:

Entity	Represents	Examples
`tenant`	A company/organization paying for Orcha	"Acme Corp", "TechStartup GmbH"
`user`	An individual human using the system	"alice@acme.com", "bob@acme.com"

Why the separation:

Billing is tenant-level, not user-level
Users can belong to multiple tenants (consultants, agencies)
Permissions are per-membership, not per-user
Data isolation is at tenant level

Design decisions:

tenant.slug is unique for URL namespacing (e.g., app.orcha.com/acme-corp/)
user.clerk_user_id links to external auth provider (Clerk)
tenant_membership is the join table enabling M:N relationship
Settings are stored directly on tenant (1:1 relationship, no separate table)

Tenant settings (on tenant table):

Company info: company_address, company_vat_id, company_tax_id, company_country

User flags:

is_super_admin - platform-wide admin (not tenant-specific, can access all tenants)

Document Sources: The Doc-Source Abstraction

Documents can arrive through multiple channels. We model configured channels as ap_doc_source entities.

What qualifies as a doc-source:

Has tenant-specific configuration
Can have multiple instances per tenant
Maintains state (last checked, documents received)
Examples: email inbox watchers, webhook endpoints

What is NOT a doc-source:

Manual file uploads - this is a universal user action, not a configured channel
No per-tenant configuration needed
No state to track

Design pattern: Class Table Inheritance

The ap_doc_source table holds common fields. Subtype tables extend it:

ap_doc_source (base)
├── ap_doc_source_email (extends)
└── ap_doc_source_webhook (extends)

The subtype's primary key is also a foreign key to the parent, with ON DELETE CASCADE.

The Document Entity

A document represents a unique piece of content, deduplicated by its content hash. It is a pure "content entity" that does not track:

Who uploaded or requested processing
Processing state or attempts
Source-specific metadata

Key design decisions:

No status column: Document readiness is derived from structured_data IS NOT NULL
Content deduplication: content_hash ensures the same file is stored once
Minimal attributes: Only content-related fields live here

Document fields:

Field	Purpose
`content_hash`	SHA-256 for deduplication
`file_path`	S3 location of original file
`type`	Document type discriminator
`structured_data`	Final extraction output (copied from successful ingestion)
`needs_human_review`	Flag set when latest ingestion fails

Derived state:

Condition	Meaning
`structured_data IS NOT NULL`	Document is ready
Has `in-progress` ingestion	Document is being processed
`needs_human_review = true`	Latest ingestion failed validation

The Ingestion Entity

An ingestion (table: ap_ingestion) represents a single processing request for a document. This is where:

Source tracking lives (doc_source_id OR uploaded_by)
Processing state is managed (status, attempt_count)
OCR and LLM results are stored

Key design decisions:

Source tracking: Each ingestion knows WHO requested it
Multiple ingestions per document: Same content can be re-processed
Processing state: Status, attempts, timing all live here
Results storage: OCR and LLM outputs stored per-ingestion

Why this separation:

Same document can come from multiple sources (email forwarded, then uploaded)
Re-processing creates a new ingestion, preserving history
Document becomes a simple "content entity" with deduplication
Easier to reason about: "what processing happened" vs "what content exists"

XOR Constraint on source: Every ingestion must have exactly one source - either a doc_source_id OR an uploaded_by, never both, never neither.

`doc_source_id`	`uploaded_by`	Valid?
SET	NULL	Yes - from doc-source
NULL	SET	Yes - manual upload
SET	SET	No - rejected
NULL	NULL	No - rejected

ap_ingestion fields:

Field	Purpose
`document_id`	Reference to content
`doc_source_id` / `uploaded_by`	Source of this request (XOR)
`status`	in-progress, completed, failed
`attempt_count`	Number of processing attempts
`source_metadata`	Origin-specific data
`started_at`	When claimed (used as lock for crash recovery)
`ocr_*`	OCR timing, quality, method
`llm_*`	LLM timing, tokens, model
`structured_data`	Extraction output for this run

Claim mechanism: The started_at field serves as a distributed lock. An ingestion can only be claimed if started_at IS NULL (never claimed) or older than a configurable threshold (previous worker crashed).

Document Type Hierarchy

All document types share a common ingestion pipeline and base attributes. We model this using Class Table Inheritance:

document (base)
├── invoice (extends)
├── purchase_order (future)
├── goods_received_note (future)
└── contract (future)

Why this pattern:

Unified ingestion system - one relationship works for all document types
Common fields in one place (tenant, content hash, file path)
Type-specific fields are properly typed on subtype tables
Polymorphic queries ("show all documents from this tenant")
Relationships between types (PO ↔ Invoice matching)

Subtype tables are minimal - only fields that make sense to query/index are normalized. Everything else lives in structured_data.

Database Trigger: AP Ingestion → Document

When an ingestion completes (status changes to completed or failed), a trigger copies the result to the document:

CREATE OR REPLACE FUNCTION update_document_from_ingestion()
RETURNS TRIGGER AS $$
BEGIN
    IF NEW.status = 'completed' AND OLD.status = 'in-progress' THEN
        UPDATE document
        SET
            structured_data = NEW.structured_data,
            needs_human_review = NOT COALESCE(NEW.valid_structured_data, true),
            updated_at = now()
        WHERE id = NEW.document_id;
    END IF;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

This keeps the document's structured_data synchronized with the latest successful AP ingestion while maintaining the complete ingestion history.

Three-Way Matching (Future)

This feature is planned for a future iteration.

In procurement, a "three-way match" verifies that:

Purchase Order - what was ordered
Goods Receipt - what was delivered
Invoice - what is being billed

All three should align before payment is approved.

Planned relationships:

purchase_order ←─── po_invoice_match ───→ invoice
       │
       └──── po_grn_match ───→ goods_received_note

Match tables will store:

Which documents are linked
Match confidence (0-100, auto-generated by LLM)
Match status (pending, confirmed, rejected)

Conventions

Naming

Tables: singular, lowercase (tenant, not tenants)
Primary keys: id (except in CTI subtypes: document_id, doc_source_id)
AP-specific tables: ap_ prefix (e.g., ap_doc_source, ap_ingestion)
Foreign keys: {referenced_table}_id (e.g., tenant_id)
Timestamps: *_at suffix (created_at, last_document_at)
Booleans: is_* or has_* prefix (is_active, is_super_admin)
Discriminators: type (ap_doc_source.type, document.type)

Types

IDs: UUID with DEFAULT gen_random_uuid() or uuidv7() for time-ordering
Timestamps: TIMESTAMPTZ (timezone-aware)
Money: NUMERIC (not FLOAT)
Flexible data: JSONB (e.g., source_metadata, structured_data)
Enums: PostgreSQL CREATE TYPE ... AS ENUM for fixed value sets

Constraints

All tables have a PRIMARY KEY
Foreign keys use ON DELETE CASCADE for owned relationships
Foreign keys use ON DELETE SET NULL for optional references
Unique constraints on natural keys (slug, email_address, endpoint_path)
CHECK constraints for business rules (e.g., ingestion_source_xor)

Class Table Inheritance Pattern

Base table has id as PK
Subtype tables have {base}_id as both PK and FK
Discriminator column on base table indicates subtype
Use ON DELETE CASCADE so deleting base deletes subtype