Note (2026-04-24): After this document was written,
legal_entitywas renamed totenantand the oldtenantwas renamed toorganization. Read references to these terms with the pre-rename meaning.
The Orcha ingestion pipeline processes invoices, POs, contracts, and GRNs through LLM-based extraction and post-processing. When the AI produces wrong results (wrong account assignment, missed supplier details, incorrect totals), there is no systematic way to verify fixes or prevent regressions on known failure cases.
The existing /ingestion-regression-test skill answers "did my code change break anything?" by diffing before/after. This framework answers "does the pipeline produce the correct output for this document?" by comparing against curated golden files.
dev/
correctness/
manifest.edn # test case + match group definitions
pdfs/ # committed test PDFs (5-10 files, ~10MB)
inv-001-standard.pdf
inv-002-credit-note.pdf
inv-003-mixed-tax-rates.pdf
inv-004-with-po.pdf
inv-005-ocr-difficult.pdf
inv-006-wrong-accounts.pdf
inv-007-supplier-edge.pdf
inv-008-multi-page.pdf
con-001-contract.pdf
po-001-purchase-order.pdf
golden/ # expected structured_data (EDN)
inv-001-standard.edn
...
match-001-invoice-po.edn # expected matching edges
dev/getorcha/
correctness.clj # runner namespace
correctness/
diff.clj # semantic diff algorithm
{:defaults
{:legal-entity {:name "Test GmbH"
:country "DE"
:vat-id "DE123456789"
:tax-id "123/456/78901"
:address "Musterstraße 1, 12345 Berlin"}}
:cases
[{:id "inv-001-standard"
:description "Standard German invoice, happy path"
:pdf "pdfs/inv-001-standard.pdf"
:golden "golden/inv-001-standard.edn"
:type "invoice"
:tags #{:extraction :accounts :cost-center :validation}}
{:id "inv-002-credit-note"
:description "Credit note — tests invoice-subtype classification"
:pdf "pdfs/inv-002-credit-note.pdf"
:golden "golden/inv-002-credit-note.edn"
:type "invoice"
:tags #{:extraction :subtype}}
;; ... more cases
]
:match-groups
[{:id "match-001-invoice-po"
:description "Invoice with PO reference — tests candidate retrieval + evidence scoring"
:cases ["inv-004-with-po" "po-001-purchase-order"]
:expected-edges
[{:a "inv-004-with-po"
:b "po-001-purchase-order"
:min-score 0.7}]}]}
Cases can override :legal-entity from defaults. The type field determines which Malli schema to validate against.
Some post-processors require master data to be present for the legal entity:
gl_account_dataset)cost_center_dataset)business_partner_dataset)The manifest supports an optional :master-data key per case (or in defaults) pointing to EDN files with seed data:
{:defaults
{:legal-entity {...}
:master-data {:chart-of-accounts "master-data/chart-of-accounts.edn"
:cost-centers "master-data/cost-centers.edn"
:business-partners "master-data/business-partners.edn"}}}
These are loaded and inserted into the DB before running the test case. The runner cleans them up afterward. This keeps test cases self-contained — no dependency on what's currently in the dev database.
The diff function classifies every field in structured_data into categories that determine comparison behavior:
| Category | Example paths | Comparison |
|---|---|---|
| Exact numeric | :total, :subtotal, :tax-amount, [:line-items * :amount], [:line-items * :quantity], [:line-items * :unit-price] |
Round to 2 decimal places, then exact match |
| Exact string | :invoice-number, :invoice-date, :currency, [:issuer :vat-id], [:issuer :iban] |
Trim whitespace, then exact match |
| Fuzzy string | [:issuer :name], [:recipient :name], [:issuer :address] |
Case-insensitive, normalize whitespace |
| Enum | :invoice-subtype, :document-type, [:line-items * :category] |
Exact match (already normalized) |
| Assignment | [:line-items * :debit-account :number], [:line-items * :cost-center :number] |
Exact match on :number only |
| Confidence | Any key ending in :confidence |
Tolerance ±0.15 |
| Ignored | Any key ending in :reasoning, :match-reasoning, :suggestion |
Skip entirely |
| Validation | [:validation-results * :status] |
Exact on :status, ignore :message/:details/:reasoning |
| Fraud | [:fraud-flags * :type], [:fraud-flags * :severity] |
Exact on :type + :severity, ignore :message/:details |
| Null-equivalent | All fields | nil and absent key are equivalent |
(def default-config
{:number-tolerance 0.005 ;; absolute tolerance for numeric comparison
:confidence-tolerance 0.15 ;; absolute tolerance for confidence scores
;; Keys whose values are ignored entirely in comparison
:ignored-keys #{:reasoning :match-reasoning :suggestion}
;; Keys compared case-insensitively with whitespace normalization
:fuzzy-string-keys #{:name :address}
;; For vectors of maps: which key to sort by before comparing
:sort-keys {:line-items :description
:fraud-flags :type
:tax-issues :type
:compliance-statements :type
:prepayments :description
:tax-rate-breakdowns :rate}
;; Sub-maps where only specific keys are compared
:partial-compare {:validation-results {:keys [:status]}
:fraud-flags {:keys [:type :severity :rule-id]}}})
{:verdict :identical | :trivial-only | :material-diff
:trivial [{:path [:line-items 0 :amount]
:expected 100.0
:actual 100.00
:reason :number-rounding}]
:material [{:path [:invoice-number]
:expected "INV-001"
:actual "INV-0001"
:reason :value-mismatch}
{:path [:line-items 0 :debit-account :number]
:expected "4800"
:actual "6300"
:reason :value-mismatch}]
:ignored [{:path [:line-items 0 :debit-account :reasoning]
:reason :ignored-key}]}
Line items are the most complex part to diff. Before comparing element-by-element, the diff function:
:description (stable sort):description similarity (Jaro-Winkler) and reports unmatched items(ns dev.getorcha.correctness
"Pipeline correctness testing framework.
Runs test cases against the local dev system and compares output
to golden-file snapshots using semantic diff.")
(defn run!
"Run a single test case by ID. Returns result map with :verdict and :diffs."
[case-id])
(defn run-all!
"Run all test cases. Prints summary table. Returns vector of results."
[])
(defn run-tagged!
"Run all test cases matching any of the given tags."
[& tags])
(defn run-match-group!
"Run a match group: ingest all documents, trigger matching, verify edges."
[group-id])
(defn update-golden!
"Accept the current pipeline output as the new golden file for a case.
Runs the pipeline if no recent result is cached."
[case-id])
(defn create-golden!
"First-time setup: run the pipeline and save output as golden file.
Does not compare — just captures."
[case-id])
(defn show-diff!
"Print detailed diff for a case (from most recent run)."
[case-id])
The runner requires a running Integrant system (REPL with (go) or (reset) already called). On first call, it checks for integrant.repl.state/system and throws a clear error if the system isn't running. It accesses db-pool and aws config from the system map.
run!1. Read manifest, find case by ID
2. Create legal entity in DB (if not exists)
3. Insert master data (chart of accounts, cost centers, business partners)
4. Read PDF bytes from dev/correctness/pdfs/
5. Call queue-for-ingestion! (uploads to S3, creates document + ingestion, queues to SQS)
6. Poll ingestion status every 5 seconds, timeout after 5 minutes
7. On completion: read structured_data from ingestion table
8. Read golden file from dev/correctness/golden/
9. Run semantic diff
10. Clean up: delete document + ingestion + master data (transaction rollback or explicit delete)
11. Return result map
Cleanup strategy: Each test case creates its own legal entity with a unique name (e.g. "correctness-test-inv-001-standard"). After the test, delete the legal entity (cascade deletes documents, ingestions, etc.). This avoids polluting the dev database.
run-match-group!1. Read manifest, find match group
2. Create shared legal entity
3. Insert master data
4. For each case in the group:
a. Upload PDF, create document, queue ingestion
b. Record document ID
5. Poll all ingestions until all complete (timeout 5 min)
6. Trigger matching for each document (publish to matching queue)
7. Poll for matching completion (check ap_document_match table)
8. Verify expected edges exist with minimum scores
9. Clean up
run-all!)Pipeline Correctness Results (2026-04-18 14:23:01)
═══════════════════════════════════════════════════
| Case | Type | Verdict | Material | Trivial | Time |
|-----------------------|----------|---------------|----------|---------|-------|
| inv-001-standard | invoice | identical | 0 | 0 | 42s |
| inv-002-credit-note | invoice | material-diff | 2 | 3 | 38s |
| inv-003-mixed-tax | invoice | trivial-only | 0 | 5 | 55s |
3 cases: 1 identical, 1 trivial-only, 1 material-diff
show-diff!)Diff for inv-002-credit-note
════════════════════════════
MATERIAL (2):
[:invoice-subtype]
expected: "credit-note"
actual: "standard-invoice"
[:line-items 0 :debit-account :number]
expected: "4800"
actual: "6300"
TRIVIAL (3):
[:line-items 0 :amount] 100.0 → 100.00 (number-rounding)
[:issuer :name] "Müller GmbH" → "Müller GmbH " (whitespace)
[:line-items 1 :cost-center :confidence] 0.85 → 0.78 (confidence-drift)
| ID | Description | What it tests |
|---|---|---|
inv-001-standard |
Standard German invoice, happy path | Full pipeline baseline: extraction, accounts, cost center, validation |
inv-002-credit-note |
Credit note (Gutschrift) | Invoice subtype classification, negative amounts |
inv-003-mixed-tax-rates |
Invoice with 7% and 19% VAT lines | Tax rate breakdowns, per-line tax validation, tax math |
inv-004-with-po |
Invoice referencing a PO number | PO reference extraction (used in matching test) |
inv-005-ocr-difficult |
Scan with poor quality / unusual layout | Transcription resilience, extraction from noisy text |
inv-006-wrong-accounts |
Past failure: accounts were assigned incorrectly | Account assignment correctness |
inv-007-supplier-edge |
Supplier with tricky VAT ID or name | Supplier matching, VIES verification |
inv-008-multi-page |
Large multi-page invoice (>10 pages) | Chunking/merging in extraction |
con-001-contract |
Contract document | Different structured_data schema, contract extraction |
po-001-purchase-order |
Purchase order | PO extraction, used in match group with inv-004 |
| ID | Documents | Tests |
|---|---|---|
match-001-invoice-po |
inv-004 + po-001 | Candidate retrieval, evidence scoring, auto-match (score ≥ 0.70) |
After implementation, verify by:
(correctness/create-golden! "inv-001-standard") with a known-good PDF — should produce a golden file(correctness/run! "inv-001-standard") — should return :identical (same system, same input):material-diff(correctness/run-all!) — should print the summary table(correctness/update-golden! "inv-001-standard") — should overwrite the golden file(correctness/run-match-group! "match-001-invoice-po") — should verify matching edges