Document Management Part 1: Multi-Document Type Ingestion

Context

Orcha's ingestion pipeline currently only processes invoices. Documents classified as contracts, purchase orders, or goods received notes are rejected (deleted from S3/DB, user notified). The business needs to ingest and extract structured data from these new document types.

This is the initial showcase — full post-processing for new types is future work.

Goal

Classify all document types, extract type-specific structured data using dedicated schemas and prompts, and store results. No cross-document matching. No post-processing beyond basic validation for new types.

Approach: Shared Common Schema + Type-Specific Folders

Schema Reorganization

Extract shared components (TaxIdType, Issuer, Recipient, Confidence, DocumentType, ClassificationFields) into schema/common.clj.

schema/
├── common.clj                          # Shared party/classification schemas
├── document.clj                        # Add types to Type enum
├── invoice/
│   └── structured_data.clj             # Invoice-specific schema (moved)
├── purchase_order/
│   └── structured_data.clj             # PO schema (new)
├── contract/
│   └── structured_data.clj             # Contract schema (new)
├── grn/
│   └── structured_data.clj             # GRN schema (new)
└── structured_data.clj                 # Dispatch layer (Malli :multi on document-type)

schema/structured_data.clj becomes a thin dispatch layer using Malli :multi. All existing code requiring schema.structured-data/StructuredData continues to work.

New Data Models

Purchase Order: po-number, po-date, buyer (Recipient), supplier (Issuer), currency, total-value, status, line items (description, article-code, quantity, unit, unit-price, amount, delivery-date, tax-rate), logistics (delivery-date, address, incoterms), commercial (payment-terms, discount-terms, validity-date), references (contract-reference, requisition-number), approval (authorized-by, approval-date).

Contract: contract-number, title, contract-type (service/supply/lease/NDA/framework/other), effective-date, expiration-date, currency, total-value, parties (party-a/b with name, address, tax-id, signatory, role), terms (payment-schedule, renewal-type, termination), scope (description, deliverables, SLAs), financial (base-fee, variable-components, penalties), references (PO refs, predecessor), legal (governing-law, jurisdiction, liability-cap, insurance, confidentiality).

GRN: grn-number, receipt-date, receiving-location, references (po-reference, delivery-note-number, shipping-reference), parties (supplier, receiver), line items (description, qty-ordered/received/rejected, unit, condition, rejection-reason), logistics (carrier, delivery-date, method), inspection (inspector, date, notes, quality-assessment), sign-off (received-by, approved-by).

Pipeline Changes

Database Migration

Add contract, purchase-order, goods-received-note, other to document_type ENUM.

Out of Scope