Orcha's ingestion pipeline currently only processes invoices. Documents classified as contracts, purchase orders, or goods received notes are rejected (deleted from S3/DB, user notified). The business needs to ingest and extract structured data from these new document types.
This is the initial showcase — full post-processing for new types is future work.
Classify all document types, extract type-specific structured data using dedicated schemas and prompts, and store results. No cross-document matching. No post-processing beyond basic validation for new types.
Extract shared components (TaxIdType, Issuer, Recipient, Confidence, DocumentType, ClassificationFields) into schema/common.clj.
schema/
├── common.clj # Shared party/classification schemas
├── document.clj # Add types to Type enum
├── invoice/
│ └── structured_data.clj # Invoice-specific schema (moved)
├── purchase_order/
│ └── structured_data.clj # PO schema (new)
├── contract/
│ └── structured_data.clj # Contract schema (new)
├── grn/
│ └── structured_data.clj # GRN schema (new)
└── structured_data.clj # Dispatch layer (Malli :multi on document-type)
schema/structured_data.clj becomes a thin dispatch layer using Malli :multi. All existing code requiring schema.structured-data/StructuredData continues to work.
Purchase Order: po-number, po-date, buyer (Recipient), supplier (Issuer), currency, total-value, status, line items (description, article-code, quantity, unit, unit-price, amount, delivery-date, tax-rate), logistics (delivery-date, address, incoterms), commercial (payment-terms, discount-terms, validity-date), references (contract-reference, requisition-number), approval (authorized-by, approval-date).
Contract: contract-number, title, contract-type (service/supply/lease/NDA/framework/other), effective-date, expiration-date, currency, total-value, parties (party-a/b with name, address, tax-id, signatory, role), terms (payment-schedule, renewal-type, termination), scope (description, deliverables, SLAs), financial (base-fee, variable-components, penalties), references (PO refs, predecessor), legal (governing-law, jurisdiction, liability-cap, insurance, confidentiality).
GRN: grn-number, receipt-date, receiving-location, references (po-reference, delivery-note-number, shipping-reference), parties (supplier, receiver), line items (description, qty-ordered/received/rejected, unit, condition, rejection-reason), logistics (carrier, delivery-date, method), inspection (inspector, date, notes, quality-assessment), sign-off (received-by, approved-by).
Add contract, purchase-order, goods-received-note, other to document_type ENUM.