Note (2026-04-24): After this document was written, legal_entity was renamed to tenant and the old tenant was renamed to organization. Read references to these terms with the pre-rename meaning.

Pipeline Correctness Testing Framework

Context

The Orcha ingestion pipeline processes invoices, POs, contracts, and GRNs through LLM-based extraction and post-processing. When the AI produces wrong results (wrong account assignment, missed supplier details, incorrect totals), there is no systematic way to verify fixes or prevent regressions on known failure cases.

The existing /ingestion-regression-test skill answers "did my code change break anything?" by diffing before/after. This framework answers "does the pipeline produce the correct output for this document?" by comparing against curated golden files.

Architecture

File Structure

dev/
  correctness/
    manifest.edn                          # test case + match group definitions
    pdfs/                                 # committed test PDFs (5-10 files, ~10MB)
      inv-001-standard.pdf
      inv-002-credit-note.pdf
      inv-003-mixed-tax-rates.pdf
      inv-004-with-po.pdf
      inv-005-ocr-difficult.pdf
      inv-006-wrong-accounts.pdf
      inv-007-supplier-edge.pdf
      inv-008-multi-page.pdf
      con-001-contract.pdf
      po-001-purchase-order.pdf
    golden/                               # expected structured_data (EDN)
      inv-001-standard.edn
      ...
      match-001-invoice-po.edn            # expected matching edges
  dev/getorcha/
    correctness.clj                       # runner namespace
    correctness/
      diff.clj                            # semantic diff algorithm

Manifest Format

{:defaults
 {:legal-entity {:name       "Test GmbH"
                 :country    "DE"
                 :vat-id     "DE123456789"
                 :tax-id     "123/456/78901"
                 :address    "Musterstraße 1, 12345 Berlin"}}

 :cases
 [{:id          "inv-001-standard"
   :description "Standard German invoice, happy path"
   :pdf         "pdfs/inv-001-standard.pdf"
   :golden      "golden/inv-001-standard.edn"
   :type        "invoice"
   :tags        #{:extraction :accounts :cost-center :validation}}

  {:id          "inv-002-credit-note"
   :description "Credit note — tests invoice-subtype classification"
   :pdf         "pdfs/inv-002-credit-note.pdf"
   :golden      "golden/inv-002-credit-note.edn"
   :type        "invoice"
   :tags        #{:extraction :subtype}}

  ;; ... more cases
  ]

 :match-groups
 [{:id          "match-001-invoice-po"
   :description "Invoice with PO reference — tests candidate retrieval + evidence scoring"
   :cases       ["inv-004-with-po" "po-001-purchase-order"]
   :expected-edges
   [{:a         "inv-004-with-po"
     :b         "po-001-purchase-order"
     :min-score 0.7}]}]}

Cases can override :legal-entity from defaults. The type field determines which Malli schema to validate against.

Master Data Dependencies

Some post-processors require master data to be present for the legal entity:

The manifest supports an optional :master-data key per case (or in defaults) pointing to EDN files with seed data:

{:defaults
 {:legal-entity {...}
  :master-data  {:chart-of-accounts  "master-data/chart-of-accounts.edn"
                 :cost-centers       "master-data/cost-centers.edn"
                 :business-partners  "master-data/business-partners.edn"}}}

These are loaded and inserted into the DB before running the test case. The runner cleans them up afterward. This keeps test cases self-contained — no dependency on what's currently in the dev database.

Semantic Diff Algorithm

Field Classification

The diff function classifies every field in structured_data into categories that determine comparison behavior:

Category Example paths Comparison
Exact numeric :total, :subtotal, :tax-amount, [:line-items * :amount], [:line-items * :quantity], [:line-items * :unit-price] Round to 2 decimal places, then exact match
Exact string :invoice-number, :invoice-date, :currency, [:issuer :vat-id], [:issuer :iban] Trim whitespace, then exact match
Fuzzy string [:issuer :name], [:recipient :name], [:issuer :address] Case-insensitive, normalize whitespace
Enum :invoice-subtype, :document-type, [:line-items * :category] Exact match (already normalized)
Assignment [:line-items * :debit-account :number], [:line-items * :cost-center :number] Exact match on :number only
Confidence Any key ending in :confidence Tolerance ±0.15
Ignored Any key ending in :reasoning, :match-reasoning, :suggestion Skip entirely
Validation [:validation-results * :status] Exact on :status, ignore :message/:details/:reasoning
Fraud [:fraud-flags * :type], [:fraud-flags * :severity] Exact on :type + :severity, ignore :message/:details
Null-equivalent All fields nil and absent key are equivalent

Diff Configuration

(def default-config
  {:number-tolerance  0.005          ;; absolute tolerance for numeric comparison
   :confidence-tolerance 0.15       ;; absolute tolerance for confidence scores

   ;; Keys whose values are ignored entirely in comparison
   :ignored-keys #{:reasoning :match-reasoning :suggestion}

   ;; Keys compared case-insensitively with whitespace normalization
   :fuzzy-string-keys #{:name :address}

   ;; For vectors of maps: which key to sort by before comparing
   :sort-keys {:line-items       :description
               :fraud-flags      :type
               :tax-issues       :type
               :compliance-statements :type
               :prepayments      :description
               :tax-rate-breakdowns :rate}

   ;; Sub-maps where only specific keys are compared
   :partial-compare {:validation-results {:keys [:status]}
                     :fraud-flags        {:keys [:type :severity :rule-id]}}})

Diff Output Structure

{:verdict  :identical | :trivial-only | :material-diff
 :trivial  [{:path [:line-items 0 :amount]
             :expected 100.0
             :actual 100.00
             :reason :number-rounding}]
 :material [{:path [:invoice-number]
             :expected "INV-001"
             :actual "INV-0001"
             :reason :value-mismatch}
            {:path [:line-items 0 :debit-account :number]
             :expected "4800"
             :actual "6300"
             :reason :value-mismatch}]
 :ignored  [{:path [:line-items 0 :debit-account :reasoning]
             :reason :ignored-key}]}

Line Item Matching

Line items are the most complex part to diff. Before comparing element-by-element, the diff function:

  1. Sorts both vectors by :description (stable sort)
  2. If lengths differ, reports missing/extra items as material diffs
  3. If descriptions don't align well (e.g. LLM merged two items into one), falls back to best-effort matching by :description similarity (Jaro-Winkler) and reports unmatched items

Runner Namespace

Public API

(ns dev.getorcha.correctness
  "Pipeline correctness testing framework.
   Runs test cases against the local dev system and compares output
   to golden-file snapshots using semantic diff.")

(defn run!
  "Run a single test case by ID. Returns result map with :verdict and :diffs."
  [case-id])

(defn run-all!
  "Run all test cases. Prints summary table. Returns vector of results."
  [])

(defn run-tagged!
  "Run all test cases matching any of the given tags."
  [& tags])

(defn run-match-group!
  "Run a match group: ingest all documents, trigger matching, verify edges."
  [group-id])

(defn update-golden!
  "Accept the current pipeline output as the new golden file for a case.
   Runs the pipeline if no recent result is cached."
  [case-id])

(defn create-golden!
  "First-time setup: run the pipeline and save output as golden file.
   Does not compare — just captures."
  [case-id])

(defn show-diff!
  "Print detailed diff for a case (from most recent run)."
  [case-id])

Prerequisites

The runner requires a running Integrant system (REPL with (go) or (reset) already called). On first call, it checks for integrant.repl.state/system and throws a clear error if the system isn't running. It accesses db-pool and aws config from the system map.

Execution Flow for run!

1. Read manifest, find case by ID
2. Create legal entity in DB (if not exists)
3. Insert master data (chart of accounts, cost centers, business partners)
4. Read PDF bytes from dev/correctness/pdfs/
5. Call queue-for-ingestion! (uploads to S3, creates document + ingestion, queues to SQS)
6. Poll ingestion status every 5 seconds, timeout after 5 minutes
7. On completion: read structured_data from ingestion table
8. Read golden file from dev/correctness/golden/
9. Run semantic diff
10. Clean up: delete document + ingestion + master data (transaction rollback or explicit delete)
11. Return result map

Cleanup strategy: Each test case creates its own legal entity with a unique name (e.g. "correctness-test-inv-001-standard"). After the test, delete the legal entity (cascade deletes documents, ingestions, etc.). This avoids polluting the dev database.

Execution Flow for run-match-group!

1. Read manifest, find match group
2. Create shared legal entity
3. Insert master data
4. For each case in the group:
   a. Upload PDF, create document, queue ingestion
   b. Record document ID
5. Poll all ingestions until all complete (timeout 5 min)
6. Trigger matching for each document (publish to matching queue)
7. Poll for matching completion (check ap_document_match table)
8. Verify expected edges exist with minimum scores
9. Clean up

Reporting

Summary Table (printed by run-all!)

Pipeline Correctness Results (2026-04-18 14:23:01)
═══════════════════════════════════════════════════
| Case                  | Type     | Verdict       | Material | Trivial | Time  |
|-----------------------|----------|---------------|----------|---------|-------|
| inv-001-standard      | invoice  | identical     | 0        | 0       | 42s   |
| inv-002-credit-note   | invoice  | material-diff | 2        | 3       | 38s   |
| inv-003-mixed-tax     | invoice  | trivial-only  | 0        | 5       | 55s   |

3 cases: 1 identical, 1 trivial-only, 1 material-diff

Detailed Diff (printed by show-diff!)

Diff for inv-002-credit-note
════════════════════════════
MATERIAL (2):
  [:invoice-subtype]
    expected: "credit-note"
    actual:   "standard-invoice"

  [:line-items 0 :debit-account :number]
    expected: "4800"
    actual:   "6300"

TRIVIAL (3):
  [:line-items 0 :amount] 100.0 → 100.00 (number-rounding)
  [:issuer :name] "Müller GmbH" → "Müller GmbH " (whitespace)
  [:line-items 1 :cost-center :confidence] 0.85 → 0.78 (confidence-drift)

Initial Test Cases

ID Description What it tests
inv-001-standard Standard German invoice, happy path Full pipeline baseline: extraction, accounts, cost center, validation
inv-002-credit-note Credit note (Gutschrift) Invoice subtype classification, negative amounts
inv-003-mixed-tax-rates Invoice with 7% and 19% VAT lines Tax rate breakdowns, per-line tax validation, tax math
inv-004-with-po Invoice referencing a PO number PO reference extraction (used in matching test)
inv-005-ocr-difficult Scan with poor quality / unusual layout Transcription resilience, extraction from noisy text
inv-006-wrong-accounts Past failure: accounts were assigned incorrectly Account assignment correctness
inv-007-supplier-edge Supplier with tricky VAT ID or name Supplier matching, VIES verification
inv-008-multi-page Large multi-page invoice (>10 pages) Chunking/merging in extraction
con-001-contract Contract document Different structured_data schema, contract extraction
po-001-purchase-order Purchase order PO extraction, used in match group with inv-004

Match Groups

ID Documents Tests
match-001-invoice-po inv-004 + po-001 Candidate retrieval, evidence scoring, auto-match (score ≥ 0.70)

Verification

After implementation, verify by:

  1. Run (correctness/create-golden! "inv-001-standard") with a known-good PDF — should produce a golden file
  2. Run (correctness/run! "inv-001-standard") — should return :identical (same system, same input)
  3. Temporarily break something (e.g., hardcode a wrong account in the prompt) and re-run — should show :material-diff
  4. Run (correctness/run-all!) — should print the summary table
  5. Run (correctness/update-golden! "inv-001-standard") — should overwrite the golden file
  6. Run (correctness/run-match-group! "match-001-invoice-po") — should verify matching edges