Note (2026-04-24): After this document was written, legal_entity was renamed to tenant and the old tenant was renamed to organization. Read references to these terms with the pre-rename meaning.

Pipeline Correctness Testing Framework

Context

The Orcha ingestion pipeline processes invoices, POs, contracts, and GRNs through LLM-based extraction and post-processing. When the AI produces wrong results (wrong account assignment, missed supplier details, incorrect totals), there is no systematic way to verify fixes or prevent regressions on known failure cases.

The existing /ingestion-regression-test skill answers "did my code change break anything?" by diffing before/after. This framework answers "does the pipeline produce the correct output for this document?" by comparing against curated golden files.

Architecture

File Structure

dev/
  correctness/
    manifest.edn                          # test case + match group definitions
    pdfs/                                 # committed test PDFs (5-10 files, ~10MB)
      inv-001-standard.pdf
      inv-002-credit-note.pdf
      inv-003-mixed-tax-rates.pdf
      inv-004-with-po.pdf
      inv-005-ocr-difficult.pdf
      inv-006-wrong-accounts.pdf
      inv-007-supplier-edge.pdf
      inv-008-multi-page.pdf
      con-001-contract.pdf
      po-001-purchase-order.pdf
    golden/                               # expected structured_data (EDN)
      inv-001-standard.edn
      ...
      match-001-invoice-po.edn            # expected matching edges
  dev/getorcha/
    correctness.clj                       # runner namespace
    correctness/
      diff.clj                            # semantic diff algorithm

Manifest Format

{:defaults
 {:legal-entity {:name       "Test GmbH"
                 :country    "DE"
                 :vat-id     "DE123456789"
                 :tax-id     "123/456/78901"
                 :address    "Musterstraße 1, 12345 Berlin"}}

 :cases
 [{:id          "inv-001-standard"
   :description "Standard German invoice, happy path"
   :pdf         "pdfs/inv-001-standard.pdf"
   :golden      "golden/inv-001-standard.edn"
   :type        "invoice"
   :tags        #{:extraction :accounts :cost-center :validation}}

  {:id          "inv-002-credit-note"
   :description "Credit note — tests invoice-subtype classification"
   :pdf         "pdfs/inv-002-credit-note.pdf"
   :golden      "golden/inv-002-credit-note.edn"
   :type        "invoice"
   :tags        #{:extraction :subtype}}

  ;; ... more cases
  ]

 :match-groups
 [{:id          "match-001-invoice-po"
   :description "Invoice with PO reference — tests candidate retrieval + evidence scoring"
   :cases       ["inv-004-with-po" "po-001-purchase-order"]
   :expected-edges
   [{:a         "inv-004-with-po"
     :b         "po-001-purchase-order"
     :min-score 0.7}]}]}

Cases can override :legal-entity from defaults. The type field determines which Malli schema to validate against.

Master Data Dependencies

Some post-processors require master data to be present for the legal entity:

Accounts matcher: needs a chart of accounts (gl_account_dataset)
Cost center matcher: needs cost center dataset (cost_center_dataset)
Supplier matcher: needs business partner dataset (business_partner_dataset)

The manifest supports an optional :master-data key per case (or in defaults) pointing to EDN files with seed data:

{:defaults
 {:legal-entity {...}
  :master-data  {:chart-of-accounts  "master-data/chart-of-accounts.edn"
                 :cost-centers       "master-data/cost-centers.edn"
                 :business-partners  "master-data/business-partners.edn"}}}

These are loaded and inserted into the DB before running the test case. The runner cleans them up afterward. This keeps test cases self-contained — no dependency on what's currently in the dev database.

Semantic Diff Algorithm

Field Classification

The diff function classifies every field in structured_data into categories that determine comparison behavior:

Category	Example paths	Comparison
Exact numeric	`:total`, `:subtotal`, `:tax-amount`, `[:line-items * :amount]`, `[:line-items * :quantity]`, `[:line-items * :unit-price]`	Round to 2 decimal places, then exact match
Exact string	`:invoice-number`, `:invoice-date`, `:currency`, `[:issuer :vat-id]`, `[:issuer :iban]`	Trim whitespace, then exact match
Fuzzy string	`[:issuer :name]`, `[:recipient :name]`, `[:issuer :address]`	Case-insensitive, normalize whitespace
Enum	`:invoice-subtype`, `:document-type`, `[:line-items * :category]`	Exact match (already normalized)
Assignment	`[:line-items * :debit-account :number]`, `[:line-items * :cost-center :number]`	Exact match on `:number` only
Confidence	Any key ending in `:confidence`	Tolerance ±0.15
Ignored	Any key ending in `:reasoning`, `:match-reasoning`, `:suggestion`	Skip entirely
Validation	`[:validation-results * :status]`	Exact on `:status`, ignore `:message`/`:details`/`:reasoning`
Fraud	`[:fraud-flags * :type]`, `[:fraud-flags * :severity]`	Exact on `:type` + `:severity`, ignore `:message`/`:details`
Null-equivalent	All fields	`nil` and absent key are equivalent

Diff Configuration

(def default-config
  {:number-tolerance  0.005          ;; absolute tolerance for numeric comparison
   :confidence-tolerance 0.15       ;; absolute tolerance for confidence scores

   ;; Keys whose values are ignored entirely in comparison
   :ignored-keys #{:reasoning :match-reasoning :suggestion}

   ;; Keys compared case-insensitively with whitespace normalization
   :fuzzy-string-keys #{:name :address}

   ;; For vectors of maps: which key to sort by before comparing
   :sort-keys {:line-items       :description
               :fraud-flags      :type
               :tax-issues       :type
               :compliance-statements :type
               :prepayments      :description
               :tax-rate-breakdowns :rate}

   ;; Sub-maps where only specific keys are compared
   :partial-compare {:validation-results {:keys [:status]}
                     :fraud-flags        {:keys [:type :severity :rule-id]}}})

Diff Output Structure

{:verdict  :identical | :trivial-only | :material-diff
 :trivial  [{:path [:line-items 0 :amount]
             :expected 100.0
             :actual 100.00
             :reason :number-rounding}]
 :material [{:path [:invoice-number]
             :expected "INV-001"
             :actual "INV-0001"
             :reason :value-mismatch}
            {:path [:line-items 0 :debit-account :number]
             :expected "4800"
             :actual "6300"
             :reason :value-mismatch}]
 :ignored  [{:path [:line-items 0 :debit-account :reasoning]
             :reason :ignored-key}]}

Line Item Matching

Line items are the most complex part to diff. Before comparing element-by-element, the diff function:

Sorts both vectors by :description (stable sort)
If lengths differ, reports missing/extra items as material diffs
If descriptions don't align well (e.g. LLM merged two items into one), falls back to best-effort matching by :description similarity (Jaro-Winkler) and reports unmatched items

Runner Namespace

Public API

(ns dev.getorcha.correctness
  "Pipeline correctness testing framework.
   Runs test cases against the local dev system and compares output
   to golden-file snapshots using semantic diff.")

(defn run!
  "Run a single test case by ID. Returns result map with :verdict and :diffs."
  [case-id])

(defn run-all!
  "Run all test cases. Prints summary table. Returns vector of results."
  [])

(defn run-tagged!
  "Run all test cases matching any of the given tags."
  [& tags])

(defn run-match-group!
  "Run a match group: ingest all documents, trigger matching, verify edges."
  [group-id])

(defn update-golden!
  "Accept the current pipeline output as the new golden file for a case.
   Runs the pipeline if no recent result is cached."
  [case-id])

(defn create-golden!
  "First-time setup: run the pipeline and save output as golden file.
   Does not compare — just captures."
  [case-id])

(defn show-diff!
  "Print detailed diff for a case (from most recent run)."
  [case-id])

Prerequisites

The runner requires a running Integrant system (REPL with (go) or (reset) already called). On first call, it checks for integrant.repl.state/system and throws a clear error if the system isn't running. It accesses db-pool and aws config from the system map.

Execution Flow for `run!`

1. Read manifest, find case by ID
2. Create legal entity in DB (if not exists)
3. Insert master data (chart of accounts, cost centers, business partners)
4. Read PDF bytes from dev/correctness/pdfs/
5. Call queue-for-ingestion! (uploads to S3, creates document + ingestion, queues to SQS)
6. Poll ingestion status every 5 seconds, timeout after 5 minutes
7. On completion: read structured_data from ingestion table
8. Read golden file from dev/correctness/golden/
9. Run semantic diff
10. Clean up: delete document + ingestion + master data (transaction rollback or explicit delete)
11. Return result map

Cleanup strategy: Each test case creates its own legal entity with a unique name (e.g. "correctness-test-inv-001-standard"). After the test, delete the legal entity (cascade deletes documents, ingestions, etc.). This avoids polluting the dev database.

Execution Flow for `run-match-group!`

1. Read manifest, find match group
2. Create shared legal entity
3. Insert master data
4. For each case in the group:
   a. Upload PDF, create document, queue ingestion
   b. Record document ID
5. Poll all ingestions until all complete (timeout 5 min)
6. Trigger matching for each document (publish to matching queue)
7. Poll for matching completion (check ap_document_match table)
8. Verify expected edges exist with minimum scores
9. Clean up

Reporting

Summary Table (printed by `run-all!`)

Pipeline Correctness Results (2026-04-18 14:23:01)
═══════════════════════════════════════════════════
| Case                  | Type     | Verdict       | Material | Trivial | Time  |
|-----------------------|----------|---------------|----------|---------|-------|
| inv-001-standard      | invoice  | identical     | 0        | 0       | 42s   |
| inv-002-credit-note   | invoice  | material-diff | 2        | 3       | 38s   |
| inv-003-mixed-tax     | invoice  | trivial-only  | 0        | 5       | 55s   |

3 cases: 1 identical, 1 trivial-only, 1 material-diff

Detailed Diff (printed by `show-diff!`)

Diff for inv-002-credit-note
════════════════════════════
MATERIAL (2):
  [:invoice-subtype]
    expected: "credit-note"
    actual:   "standard-invoice"

  [:line-items 0 :debit-account :number]
    expected: "4800"
    actual:   "6300"

TRIVIAL (3):
  [:line-items 0 :amount] 100.0 → 100.00 (number-rounding)
  [:issuer :name] "Müller GmbH" → "Müller GmbH " (whitespace)
  [:line-items 1 :cost-center :confidence] 0.85 → 0.78 (confidence-drift)

Initial Test Cases

ID	Description	What it tests
`inv-001-standard`	Standard German invoice, happy path	Full pipeline baseline: extraction, accounts, cost center, validation
`inv-002-credit-note`	Credit note (Gutschrift)	Invoice subtype classification, negative amounts
`inv-003-mixed-tax-rates`	Invoice with 7% and 19% VAT lines	Tax rate breakdowns, per-line tax validation, tax math
`inv-004-with-po`	Invoice referencing a PO number	PO reference extraction (used in matching test)
`inv-005-ocr-difficult`	Scan with poor quality / unusual layout	Transcription resilience, extraction from noisy text
`inv-006-wrong-accounts`	Past failure: accounts were assigned incorrectly	Account assignment correctness
`inv-007-supplier-edge`	Supplier with tricky VAT ID or name	Supplier matching, VIES verification
`inv-008-multi-page`	Large multi-page invoice (>10 pages)	Chunking/merging in extraction
`con-001-contract`	Contract document	Different structured_data schema, contract extraction
`po-001-purchase-order`	Purchase order	PO extraction, used in match group with inv-004

Match Groups

ID	Documents	Tests
`match-001-invoice-po`	inv-004 + po-001	Candidate retrieval, evidence scoring, auto-match (score ≥ 0.70)

Verification

After implementation, verify by:

Run (correctness/create-golden! "inv-001-standard") with a known-good PDF — should produce a golden file
Run (correctness/run! "inv-001-standard") — should return :identical (same system, same input)
Temporarily break something (e.g., hardcode a wrong account in the prompt) and re-run — should show :material-diff
Run (correctness/run-all!) — should print the summary table
Run (correctness/update-golden! "inv-001-standard") — should overwrite the golden file
Run (correctness/run-match-group! "match-001-invoice-po") — should verify matching edges