Database Schema Design

This document describes the design decisions and patterns for the Orcha database schema.


Entity Relationship Diagram

---
title: ORCHA
---
erDiagram
    %% ============================================
    %% LEGEND: futurePlanned = planned for future iteration
    %% ============================================

    %% ============================================
    %% TENANT & USER
    %% ============================================

    tenant {
        uuid id PK
        text name
        text slug UK "URL-safe identifier"
        text company_address
        text company_vat_id
        text company_tax_id
        text company_country "ISO 3166-1 alpha-2"
        timestamptz created_at
        timestamptz updated_at
    }

    user {
        uuid id PK
        text clerk_user_id UK "external auth provider ID"
        boolean is_super_admin "platform-wide admin"
        timestamptz created_at
        timestamptz updated_at
    }

    tenant_membership {
        uuid id PK
        uuid tenant_id FK
        uuid user_id FK
        timestamptz created_at
    }

    %% ============================================
    %% AP_DOC_SOURCE (Class Table Inheritance)
    %% ============================================

    ap_doc_source {
        uuid id PK
        uuid tenant_id FK
        doc_source_type type "discriminator: email, webhook"
        boolean is_active
        timestamptz last_document_at
        integer documents_received
        timestamptz created_at
        timestamptz updated_at
    }

    ap_doc_source_email {
        uuid doc_source_id PK,FK "extends ap_doc_source"
        text email_address UK
        email_provider provider "gmail, outlook"
        text folder
        bigint last_uid_seen "IMAP UID watermark"
    }

    ap_doc_source_webhook {
        uuid doc_source_id PK,FK "extends ap_doc_source"
        text endpoint_path UK "unique URL path"
        text secret_hash "HMAC verification"
    }

    %% ============================================
    %% DOCUMENT (Content Entity)
    %% ============================================

    document {
        uuid id PK "UUIDv7 for time-ordering"
        uuid tenant_id FK
        document_type type "discriminator: invoice"
        text domain "ap or ar"
        text content_hash UK "SHA-256 for deduplication"
        text file_path "S3 path: documents/{id}.{ext}"
        jsonb structured_data "final extraction output"
        boolean needs_human_review "set by trigger on failed ingestion"
        timestamptz created_at
        timestamptz updated_at
    }

    invoice {
        uuid document_id PK,FK "extends document"
        text number
        date date
        date due_date
        text currency
        numeric total
        text supplier_name
    }

    purchase_order_futurePlanned {
        uuid document_id PK,FK "extends document"
        text number
        date date
        text supplier_name
        numeric total
    }

    goods_received_note_futurePlanned {
        uuid document_id PK,FK "extends document"
        text number
        date date
        text supplier_name
    }

    contract_futurePlanned {
        uuid document_id PK,FK "extends document"
        text title
        contract_type type "service, license, nda, supplier"
        date effective_date
        date expiration_date
    }

    %% ============================================
    %% AP_INGESTION (Processing Entity)
    %% ============================================

    ap_ingestion {
        uuid id PK "UUIDv7 for time-ordering"
        uuid document_id FK
        uuid doc_source_id FK "XOR with uploaded_by"
        uuid uploaded_by FK "XOR with doc_source_id"
        ingestion_status status "in-progress, completed, failed"
        integer attempt_count "processing attempts"
        jsonb source_metadata "origin-specific data"
        text ocr_file_path "S3: ingestions/{id}/ocr-output.edn"
        text preprocessed_file_path "S3: ingestions/{id}/preprocessed.{ext}"
        numeric ocr_quality_score "0-1 range"
        integer ocr_page_count
        text ocr_method "google-document-ai, pdf-lib"
        timestamptz ocr_started_at
        timestamptz ocr_ended_at
        jsonb llm_response "raw LLM response"
        integer llm_input_tokens
        integer llm_output_tokens
        text llm_model
        timestamptz llm_started_at
        timestamptz llm_ended_at
        jsonb structured_data "extraction output"
        boolean valid_structured_data
        timestamptz started_at "when claimed"
        timestamptz completed_at
        timestamptz created_at
    }

    %% ============================================
    %% THREE-WAY MATCHING (Future)
    %% ============================================

    po_invoice_match_futurePlanned {
        uuid id PK
        uuid purchase_order_id FK
        uuid invoice_id FK
        numeric confidence "0-100"
        match_status status "pending, confirmed, rejected"
        timestamptz created_at
        timestamptz updated_at
    }

    po_grn_match_futurePlanned {
        uuid id PK
        uuid purchase_order_id FK
        uuid goods_received_note_id FK
        numeric confidence "0-100"
        match_status status "pending, confirmed, rejected"
        timestamptz created_at
        timestamptz updated_at
    }

    %% ============================================
    %% RELATIONSHIPS: Tenant & User
    %% ============================================

    tenant ||--o{ tenant_membership : "has"
    user ||--o{ tenant_membership : "belongs to"

    %% ============================================
    %% RELATIONSHIPS: Doc source hierarchy
    %% ============================================

    tenant ||--o{ ap_doc_source : "has"
    ap_doc_source ||--o| ap_doc_source_email : "extends"
    ap_doc_source ||--o| ap_doc_source_webhook : "extends"

    %% ============================================
    %% RELATIONSHIPS: Document hierarchy
    %% ============================================

    tenant ||--o{ document : "has"
    document ||--o| invoice : "extends"
    document ||--o| purchase_order_futurePlanned : "extends"
    document ||--o| goods_received_note_futurePlanned : "extends"
    document ||--o| contract_futurePlanned : "extends"

    %% ============================================
    %% RELATIONSHIPS: Ingestion
    %% ============================================

    document ||--o{ ap_ingestion : "has"
    ap_doc_source |o--o{ ap_ingestion : "triggered"
    user |o--o{ ap_ingestion : "triggered"

    %% ============================================
    %% RELATIONSHIPS: Three-way matching
    %% ============================================

    purchase_order_futurePlanned ||--o{ po_invoice_match_futurePlanned : "matched to"
    invoice ||--o{ po_invoice_match_futurePlanned : "matched to"

    purchase_order_futurePlanned ||--o{ po_grn_match_futurePlanned : "matched to"
    goods_received_note_futurePlanned ||--o{ po_grn_match_futurePlanned : "matched to"

Legend: Entities marked with futurePlanned are planned for a future iteration.


Core Concepts

Multi-tenancy: tenant vs user

The schema separates two distinct concepts:

Entity Represents Examples
tenant A company/organization paying for Orcha "Acme Corp", "TechStartup GmbH"
user An individual human using the system "alice@acme.com", "bob@acme.com"

Why the separation:

Design decisions:

Tenant settings (on tenant table):

User flags:


Document Sources: The Doc-Source Abstraction

Documents can arrive through multiple channels. We model configured channels as ap_doc_source entities.

What qualifies as a doc-source:

What is NOT a doc-source:

Design pattern: Class Table Inheritance

The ap_doc_source table holds common fields. Subtype tables extend it:

ap_doc_source (base)
├── ap_doc_source_email (extends)
└── ap_doc_source_webhook (extends)

The subtype's primary key is also a foreign key to the parent, with ON DELETE CASCADE.


The Document Entity

A document represents a unique piece of content, deduplicated by its content hash. It is a pure "content entity" that does not track:

Key design decisions:

Document fields:

Field Purpose
content_hash SHA-256 for deduplication
file_path S3 location of original file
type Document type discriminator
structured_data Final extraction output (copied from successful ingestion)
needs_human_review Flag set when latest ingestion fails

Derived state:

Condition Meaning
structured_data IS NOT NULL Document is ready
Has in-progress ingestion Document is being processed
needs_human_review = true Latest ingestion failed validation

The Ingestion Entity

An ingestion (table: ap_ingestion) represents a single processing request for a document. This is where:

Key design decisions:

Why this separation:

  1. Same document can come from multiple sources (email forwarded, then uploaded)
  2. Re-processing creates a new ingestion, preserving history
  3. Document becomes a simple "content entity" with deduplication
  4. Easier to reason about: "what processing happened" vs "what content exists"

XOR Constraint on source: Every ingestion must have exactly one source - either a doc_source_id OR an uploaded_by, never both, never neither.

doc_source_id uploaded_by Valid?
SET NULL Yes - from doc-source
NULL SET Yes - manual upload
SET SET No - rejected
NULL NULL No - rejected

ap_ingestion fields:

Field Purpose
document_id Reference to content
doc_source_id / uploaded_by Source of this request (XOR)
status in-progress, completed, failed
attempt_count Number of processing attempts
source_metadata Origin-specific data
started_at When claimed (used as lock for crash recovery)
ocr_* OCR timing, quality, method
llm_* LLM timing, tokens, model
structured_data Extraction output for this run

Claim mechanism: The started_at field serves as a distributed lock. An ingestion can only be claimed if started_at IS NULL (never claimed) or older than a configurable threshold (previous worker crashed).


Document Type Hierarchy

All document types share a common ingestion pipeline and base attributes. We model this using Class Table Inheritance:

document (base)
├── invoice (extends)
├── purchase_order (future)
├── goods_received_note (future)
└── contract (future)

Why this pattern:

Subtype tables are minimal - only fields that make sense to query/index are normalized. Everything else lives in structured_data.


Database Trigger: AP Ingestion → Document

When an ingestion completes (status changes to completed or failed), a trigger copies the result to the document:

CREATE OR REPLACE FUNCTION update_document_from_ingestion()
RETURNS TRIGGER AS $$
BEGIN
    IF NEW.status = 'completed' AND OLD.status = 'in-progress' THEN
        UPDATE document
        SET
            structured_data = NEW.structured_data,
            needs_human_review = NOT COALESCE(NEW.valid_structured_data, true),
            updated_at = now()
        WHERE id = NEW.document_id;
    END IF;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

This keeps the document's structured_data synchronized with the latest successful AP ingestion while maintaining the complete ingestion history.


Three-Way Matching (Future)

This feature is planned for a future iteration.

In procurement, a "three-way match" verifies that:

  1. Purchase Order - what was ordered
  2. Goods Receipt - what was delivered
  3. Invoice - what is being billed

All three should align before payment is approved.

Planned relationships:

purchase_order ←─── po_invoice_match ───→ invoice
       │
       └──── po_grn_match ───→ goods_received_note

Match tables will store:


Conventions

Naming

Types

Constraints

Class Table Inheritance Pattern