Note (2026-04-24): After this document was written, legal_entity was renamed to tenant and the old tenant was renamed to organization. Read references to these terms with the pre-rename meaning.

Pipeline Correctness Testing Framework — Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Build a REPL-driven testing framework that runs real PDFs through the full ingestion pipeline and compares output against curated golden-file snapshots using semantic diff.

Architecture: Two Clojure namespaces in dev/: a semantic diff library (dev.getorcha.correctness.diff) and a runner (dev.getorcha.correctness). Test cases live in dev/correctness/ with committed PDFs and golden EDN files. Tests run against the local dev system via (repl/db-pool) and (repl/aws).

Tech Stack: Clojure, next.jdbc (DB), AWS S3 (LocalStack), Malli (schema), clojure.data (diff foundation), clojure.test (for diff unit tests)

Design spec: docs/superpowers/specs/2026-04-18-pipeline-correctness-testing-design.md

File Structure

dev/
  correctness/
    manifest.edn                            # test case definitions
    pdfs/                                   # test PDFs (committed, ~10MB)
    golden/                                 # expected structured_data EDN files
    master-data/                            # optional chart of accounts, cost centers, etc.
  dev/getorcha/
    correctness.clj                         # runner: run!, run-all!, update-golden!, etc.
    correctness/
      diff.clj                              # semantic diff algorithm

test/com/getorcha/
  correctness/
    diff_test.clj                           # unit tests for the diff algorithm

Key existing files referenced:

repl/com/getorcha/repl.clj — (repl/db-pool), (repl/aws) accessors
src/com/getorcha/app/ingestion.clj — queue-for-ingestion! (line 57)
src/com/getorcha/schema/invoice/structured_data.clj — Malli schema for invoice structured data
src/com/getorcha/schema/common.clj — Issuer, Recipient, etc.
test/com/getorcha/test/notification_helpers.clj — create-legal-entity! pattern (line 13)

Task 1: Semantic Diff Algorithm

Files:

Create: dev/dev/getorcha/correctness/diff.clj
Create: test/com/getorcha/correctness/diff_test.clj

This is the core of the framework. It compares two structured_data maps and classifies differences as :trivial, :material, or :ignored.

Step 1: Write failing tests for the diff function

Create test/com/getorcha/correctness/diff_test.clj:

(ns com.getorcha.correctness.diff-test
  (:require [clojure.test :refer [deftest is testing]]
            [dev.getorcha.correctness.diff :as diff]))


(deftest identical-maps-test
  (testing "identical maps return :identical verdict"
    (let [data {:invoice-number "INV-001"
                :total          100.0
                :issuer         {:name "Test GmbH" :vat-id "DE123456789"}}
          result (diff/compare-structured-data data data)]
      (is (= :identical (:verdict result)))
      (is (empty? (:material result)))
      (is (empty? (:trivial result))))))


(deftest number-rounding-trivial-test
  (testing "number rounding differences are trivial"
    (let [expected {:total 100.0 :subtotal 84.03}
          actual   {:total 100.00 :subtotal 84.03}
          result   (diff/compare-structured-data expected actual)]
      (is (= :identical (:verdict result))))))


(deftest material-value-diff-test
  (testing "different invoice numbers are material"
    (let [expected {:invoice-number "INV-001" :total 100.0}
          actual   {:invoice-number "INV-002" :total 100.0}
          result   (diff/compare-structured-data expected actual)]
      (is (= :material-diff (:verdict result)))
      (is (= 1 (count (:material result))))
      (is (= [:invoice-number] (:path (first (:material result))))))))


(deftest reasoning-ignored-test
  (testing "reasoning fields are ignored"
    (let [expected {:line-items [{:description "Widget"
                                  :amount      100.0
                                  :debit-account {:number "4800"
                                                  :confidence 0.9
                                                  :reasoning "Because widgets"}}]}
          actual   {:line-items [{:description "Widget"
                                  :amount      100.0
                                  :debit-account {:number "4800"
                                                  :confidence 0.9
                                                  :reasoning "Different reason"}}]}
          result   (diff/compare-structured-data expected actual)]
      (is (= :identical (:verdict result))))))


(deftest confidence-tolerance-test
  (testing "small confidence changes are trivial"
    (let [expected {:line-items [{:description "Widget"
                                  :amount      100.0
                                  :debit-account {:number "4800"
                                                  :confidence 0.90
                                                  :reasoning nil}}]}
          actual   {:line-items [{:description "Widget"
                                  :amount      100.0
                                  :debit-account {:number "4800"
                                                  :confidence 0.82
                                                  :reasoning nil}}]}
          result   (diff/compare-structured-data expected actual)]
      (is (= :trivial-only (:verdict result)))
      (is (= 1 (count (:trivial result))))))

  (testing "large confidence changes are material"
    (let [expected {:line-items [{:description "Widget"
                                  :amount      100.0
                                  :debit-account {:number "4800"
                                                  :confidence 0.90
                                                  :reasoning nil}}]}
          actual   {:line-items [{:description "Widget"
                                  :amount      100.0
                                  :debit-account {:number "4800"
                                                  :confidence 0.50
                                                  :reasoning nil}}]}
          result   (diff/compare-structured-data expected actual)]
      (is (= :material-diff (:verdict result))))))


(deftest nil-vs-absent-test
  (testing "nil value and absent key are equivalent"
    (let [expected {:invoice-number "INV-001" :discount nil}
          actual   {:invoice-number "INV-001"}
          result   (diff/compare-structured-data expected actual)]
      (is (= :identical (:verdict result))))))


(deftest fuzzy-string-test
  (testing "issuer name is compared case-insensitively with whitespace normalization"
    (let [expected {:issuer {:name "Müller  GmbH" :vat-id "DE123"}}
          actual   {:issuer {:name "müller GmbH" :vat-id "DE123"}}
          result   (diff/compare-structured-data expected actual)]
      (is (= :identical (:verdict result))))))


(deftest line-items-sorted-before-compare-test
  (testing "line items are sorted by description before comparing"
    (let [expected {:line-items [{:description "Alpha" :amount 10.0}
                                 {:description "Beta" :amount 20.0}]}
          actual   {:line-items [{:description "Beta" :amount 20.0}
                                 {:description "Alpha" :amount 10.0}]}
          result   (diff/compare-structured-data expected actual)]
      (is (= :identical (:verdict result))))))


(deftest validation-results-status-only-test
  (testing "validation results compare status only, ignore message"
    (let [expected {:validation-results {:line-item-math {:status "pass" :message "All good"}}}
          actual   {:validation-results {:line-item-math {:status "pass" :message "Checks out"}}}
          result   (diff/compare-structured-data expected actual)]
      (is (= :identical (:verdict result)))))

  (testing "validation status change is material"
    (let [expected {:validation-results {:line-item-math {:status "pass"}}}
          actual   {:validation-results {:line-item-math {:status "error" :message "Bad math"}}}
          result   (diff/compare-structured-data expected actual)]
      (is (= :material-diff (:verdict result))))))


(deftest fraud-flags-type-severity-only-test
  (testing "fraud flags compare type+severity, ignore message"
    (let [expected {:fraud-flags [{:rule-id :ef1-01 :type :bank-account-mismatch
                                   :severity :warning :message "Old msg"}]}
          actual   {:fraud-flags [{:rule-id :ef1-01 :type :bank-account-mismatch
                                   :severity :warning :message "New msg"}]}
          result   (diff/compare-structured-data expected actual)]
      (is (= :identical (:verdict result))))))


(deftest mixed-diffs-test
  (testing "mix of trivial and material diffs"
    (let [expected {:invoice-number "INV-001"
                    :total          100.0
                    :line-items     [{:description   "Widget"
                                      :amount        50.0
                                      :debit-account {:number     "4800"
                                                      :confidence 0.9
                                                      :reasoning  nil}}]}
          actual   {:invoice-number "INV-002"       ;; material
                    :total          100.00           ;; trivial (rounding)
                    :line-items     [{:description   "Widget"
                                      :amount        50.0
                                      :debit-account {:number     "4800"
                                                      :confidence 0.82  ;; trivial (tolerance)
                                                      :reasoning  nil}}]}
          result   (diff/compare-structured-data expected actual)]
      (is (= :material-diff (:verdict result)))
      (is (= 1 (count (:material result))))
      (is (= 1 (count (:trivial result)))))))

Step 2: Run tests to verify they fail

cd orcha && clj -X:test:silent :nses '[com.getorcha.correctness.diff-test]'

Expected: compilation error — dev.getorcha.correctness.diff namespace not found.

Step 3: Implement the diff algorithm

Create dev/dev/getorcha/correctness/diff.clj:

(ns dev.getorcha.correctness.diff
  "Semantic diff for structured_data maps.

   Compares two structured_data maps field-by-field, classifying
   differences as :material, :trivial, or :ignored based on field
   type and configurable tolerances."
  (:require [clojure.string :as str]))


(def ^:private default-config
  "Diff configuration. Controls how fields are compared."
  {;; Absolute tolerance for numeric comparison (amounts, quantities)
   :number-tolerance     0.005

   ;; Absolute tolerance for confidence scores
   :confidence-tolerance 0.15

   ;; Keys whose values are ignored entirely
   :ignored-keys         #{:reasoning :match-reasoning :suggestion}

   ;; Keys compared case-insensitively with whitespace normalization
   :fuzzy-string-keys    #{:name :address}

   ;; For vectors of maps: which key to sort by before comparing
   :sort-keys            {:line-items             :description
                          :fraud-flags            :type
                          :tax-issues             :type
                          :compliance-statements  :type
                          :prepayments            :description
                          :tax-rate-breakdowns    :rate
                          :breakdown-items        :description}

   ;; Sub-maps where only specific keys matter
   :partial-compare-keys {:validation-results #{:status}
                          :fraud-flags        #{:rule-id :type :severity}}})


(defn ^:private normalize-string
  "Normalize string for fuzzy comparison: lowercase, collapse whitespace, trim."
  [s]
  (when s
    (-> s str/trim (str/replace #"\s+" " ") str/lower-case)))


(defn ^:private numbers-equal?
  "Compare two numbers with tolerance."
  [a b tolerance]
  (and (number? a) (number? b)
       (<= (abs (- (double a) (double b))) tolerance)))


(defn ^:private nil-equivalent?
  "True if both values are nil-equivalent (nil, absent, empty string)."
  [a b]
  (let [nil-ish? #(or (nil? %) (and (string? %) (str/blank? %)))]
    (and (nil-ish? a) (nil-ish? b))))


(defn ^:private sort-vector-of-maps
  "Sort a vector of maps by the given key for stable comparison."
  [v sort-key]
  (if (and (vector? v) (every? map? v) sort-key)
    (vec (sort-by #(str (get % sort-key "")) v))
    v))


(defn ^:private diff-values
  "Compare two values at a given path. Returns nil (equal), or a diff entry."
  [path expected actual config]
  (let [last-key     (last path)
        parent-key   (last (butlast path))]
    (cond
      ;; Both nil-equivalent
      (nil-equivalent? expected actual)
      nil

      ;; Ignored key
      (contains? (:ignored-keys config) last-key)
      nil

      ;; Partial-compare: only check specific sub-keys
      (and (map? expected) (map? actual)
           (contains? (:partial-compare-keys config) parent-key))
      (let [keys-to-check (get (:partial-compare-keys config) parent-key)]
        (when-not (= (select-keys expected keys-to-check)
                     (select-keys actual keys-to-check))
          {:path     path
           :expected (select-keys expected keys-to-check)
           :actual   (select-keys actual keys-to-check)
           :category :material
           :reason   :value-mismatch}))

      ;; Confidence score
      (= :confidence last-key)
      (cond
        (numbers-equal? expected actual (:confidence-tolerance config)) nil
        (numbers-equal? expected actual 0.30)
        {:path path :expected expected :actual actual
         :category :trivial :reason :confidence-drift}
        :else
        {:path path :expected expected :actual actual
         :category :material :reason :confidence-drift-large})

      ;; Numbers
      (and (number? expected) (number? actual))
      (when-not (numbers-equal? expected actual (:number-tolerance config))
        {:path path :expected expected :actual actual
         :category :material :reason :value-mismatch})

      ;; Fuzzy strings
      (and (string? expected) (string? actual)
           (contains? (:fuzzy-string-keys config) last-key))
      (when-not (= (normalize-string expected) (normalize-string actual))
        {:path path :expected expected :actual actual
         :category :material :reason :value-mismatch})

      ;; Exact comparison for everything else
      :else
      (when-not (= expected actual)
        {:path path :expected expected :actual actual
         :category :material :reason :value-mismatch}))))


(defn ^:private diff-maps
  "Recursively diff two maps. Returns vector of diff entries."
  [path expected actual config]
  (let [all-keys (distinct (concat (keys expected) (keys actual)))]
    (reduce
     (fn [diffs k]
       (let [child-path (conj path k)
             ev         (get expected k)
             av         (get actual k)]
         (cond
           ;; Both nil-equivalent
           (nil-equivalent? ev av)
           diffs

           ;; Ignored key
           (contains? (:ignored-keys config) k)
           diffs

           ;; Both maps — recurse
           (and (map? ev) (map? av))
           (into diffs (diff-maps child-path ev av config))

           ;; Both vectors — sort and compare element-by-element
           (and (vector? ev) (vector? av))
           (let [sort-key (get (:sort-keys config) k)
                 ;; Check if elements should be partially compared
                 partial-keys (get (:partial-compare-keys config) k)
                 sv       (sort-vector-of-maps ev sort-key)
                 sa       (sort-vector-of-maps av sort-key)
                 max-len  (max (count sv) (count sa))]
             (reduce
              (fn [diffs i]
                (let [ei (get sv i)
                      ai (get sa i)]
                  (cond
                    (nil? ei)
                    (conj diffs {:path (conj child-path i) :expected nil :actual ai
                                 :category :material :reason :extra-item})
                    (nil? ai)
                    (conj diffs {:path (conj child-path i) :expected ei :actual nil
                                 :category :material :reason :missing-item})
                    (and (map? ei) (map? ai))
                    (if partial-keys
                      ;; Partial compare for this vector's elements
                      (if (= (select-keys ei partial-keys)
                             (select-keys ai partial-keys))
                        diffs
                        (conj diffs {:path     (conj child-path i)
                                     :expected (select-keys ei partial-keys)
                                     :actual   (select-keys ai partial-keys)
                                     :category :material
                                     :reason   :value-mismatch}))
                      (into diffs (diff-maps (conj child-path i) ei ai config)))
                    :else
                    (if-let [d (diff-values (conj child-path i) ei ai config)]
                      (conj diffs d)
                      diffs))))
              diffs
              (range max-len)))

           ;; Leaf values
           :else
           (if-let [d (diff-values child-path ev av config)]
             (conj diffs d)
             diffs))))
     []
     all-keys)))


(defn compare-structured-data
  "Compare expected and actual structured_data maps.

   Returns:
   {:verdict  :identical | :trivial-only | :material-diff
    :material [{:path [...] :expected v :actual v :reason kw}]
    :trivial  [{:path [...] :expected v :actual v :reason kw}]
    :ignored  [{:path [...] :reason kw}]}"
  ([expected actual]
   (compare-structured-data expected actual default-config))
  ([expected actual config]
   (let [config    (merge default-config config)
         all-diffs (diff-maps [] expected actual config)
         material  (filterv #(= :material (:category %)) all-diffs)
         trivial   (filterv #(= :trivial (:category %)) all-diffs)]
     {:verdict  (cond
                  (seq material)        :material-diff
                  (seq trivial)         :trivial-only
                  :else                 :identical)
      :material material
      :trivial  trivial})))

Step 4: Run tests to verify they pass

cd orcha && clj -X:test:silent :nses '[com.getorcha.correctness.diff-test]'

Expected: all tests pass.

Step 5: Commit

cd orcha && git add dev/dev/getorcha/correctness/diff.clj test/com/getorcha/correctness/diff_test.clj && git commit -m "feat: add semantic diff algorithm for pipeline correctness testing"

Task 2: Correctness Runner — Core Functions

Files:

Create: dev/dev/getorcha/correctness.clj

The runner namespace provides the REPL API: run!, run-all!, update-golden!, create-golden!, show-diff!.

Step 1: Create the runner namespace with manifest loading and system access

Create dev/dev/getorcha/correctness.clj:

(ns dev.getorcha.correctness
  "Pipeline correctness testing framework.

   Runs test cases against the local dev system and compares output
   to golden-file snapshots using semantic diff.

   Usage from REPL:
     (correctness/run! \"inv-001-standard\")
     (correctness/run-all!)
     (correctness/run-tagged! :extraction :accounts)
     (correctness/create-golden! \"inv-001-standard\")
     (correctness/update-golden! \"inv-001-standard\")"
  (:require [cheshire.core :as json]
            [clojure.java.io :as io]
            [clojure.pprint :as pprint]
            [clojure.string :as str]
            [clojure.tools.reader.edn :as edn]
            [com.getorcha.app.ingestion :as app.ingestion]
            [com.getorcha.aws :as aws]
            [com.getorcha.db.sql :as db.sql]
            [com.getorcha.repl :as repl]
            [dev.getorcha.correctness.diff :as diff])
  (:import (java.io PushbackReader)
           (java.nio.file Files Path Paths)
           (java.time Duration Instant)
           (java.util UUID)))


;; ---------------------------------------------------------------------------
;; Paths
;; ---------------------------------------------------------------------------

(def ^:private base-dir
  "Base directory for correctness test data."
  "dev/correctness")


(defn ^:private manifest-path [] (str base-dir "/manifest.edn"))
(defn ^:private pdf-path [relative] (str base-dir "/" relative))
(defn ^:private golden-path [relative] (str base-dir "/" relative))


;; ---------------------------------------------------------------------------
;; Manifest
;; ---------------------------------------------------------------------------

(defn ^:private load-manifest
  "Load and parse the manifest.edn file."
  []
  (let [f (io/file (manifest-path))]
    (when-not (.exists f)
      (throw (ex-info "Manifest not found" {:path (manifest-path)})))
    (with-open [r (PushbackReader. (io/reader f))]
      (edn/read r))))


(defn ^:private find-case
  "Find a test case by ID in the manifest."
  [manifest case-id]
  (let [cases (:cases manifest)]
    (or (first (filter #(= case-id (:id %)) cases))
        (throw (ex-info (str "Test case not found: " case-id)
                        {:case-id case-id
                         :available (mapv :id cases)})))))


;; ---------------------------------------------------------------------------
;; System access
;; ---------------------------------------------------------------------------

(defn ^:private ensure-system!
  "Verify the Integrant system is running. Throws if not."
  []
  (try
    (repl/db-pool)
    (catch Throwable _
      (throw (ex-info "System not running. Call (go) or (reset) first." {})))))


;; ---------------------------------------------------------------------------
;; Test data setup and teardown
;; ---------------------------------------------------------------------------

(defn ^:private setup-legal-entity!
  "Create a tenant + legal entity for the test case. Returns legal-entity-id."
  [db-pool case-config defaults]
  (let [le-config   (merge (:legal-entity defaults) (:legal-entity case-config))
        tenant-id   (random-uuid)
        le-id       (random-uuid)
        slug        (str "correctness-" (:id case-config))]
    (db.sql/execute-one!
     db-pool
     {:insert-into :tenant
      :values      [{:id tenant-id :name "Correctness Test" :slug slug}]})
    (db.sql/execute-one!
     db-pool
     {:insert-into :legal-entity
      :values      [{:id              le-id
                     :name            (or (:name le-config) "Correctness Test GmbH")
                     :company-address (:address le-config)
                     :company-vat-id  (:vat-id le-config)
                     :company-tax-id  (:tax-id le-config)
                     :company-country (:country le-config)
                     :tenant-id       tenant-id}]})
    le-id))


(defn ^:private setup-identity!
  "Create a test identity for uploaded-by. Returns identity-id."
  [db-pool]
  (let [id (random-uuid)]
    (db.sql/execute-one!
     db-pool
     {:insert-into :identity
      :values      [{:id    id
                     :email "correctness-test@getorcha.com"}]})
    id))


(defn ^:private setup-master-data!
  "Insert master data (chart of accounts, cost centers, business partners)
   for the legal entity if configured."
  [db-pool le-id case-config defaults]
  (let [md (merge (:master-data defaults) (:master-data case-config))]
    (doseq [[table-key table-name active-field] [[:chart-of-accounts  :gl-accounts-dataset       :is-active]
                                                 [:cost-centers       :cost-center-dataset       :position]
                                                 [:business-partners  :business-partner-dataset  :is-active]]]
      (when-let [path (get md table-key)]
        (let [f (io/file (str base-dir "/" path))]
          (when (.exists f)
            (let [data (with-open [r (PushbackReader. (io/reader f))]
                         (edn/read r))
                  row  (cond-> {:legal-entity-id le-id
                                :data            [:lift (json/generate-string data)]}
                         ;; gl_accounts_dataset and business_partner_dataset use is_active
                         ;; cost_center_dataset uses position (integer) for active ordering
                         (= active-field :is-active) (assoc :is-active true)
                         (= active-field :position)  (assoc :position 0 :headers [:lift "[]"]))]
              (db.sql/execute-one!
               db-pool
               {:insert-into table-name
                :values      [row]}))))))))


(defn ^:private teardown!
  "Delete all test data by removing the tenant (cascades to legal entity,
   documents, ingestions, master data, etc.)."
  [db-pool tenant-slug]
  (db.sql/execute-one!
   db-pool
   {:delete-from :tenant
    :where       [:= :slug tenant-slug]}))


;; ---------------------------------------------------------------------------
;; Pipeline execution
;; ---------------------------------------------------------------------------

(defn ^:private ingest-pdf!
  "Upload PDF and queue for ingestion. Returns {:ingestion-id UUID :document-id UUID}."
  [db-pool aws-config le-id identity-id case-config]
  (let [pdf-file    (io/file (pdf-path (:pdf case-config)))
        _           (when-not (.exists pdf-file)
                      (throw (ex-info (str "PDF not found: " (:pdf case-config))
                                      {:path (pdf-path (:pdf case-config))})))
        content     (Files/readAllBytes (.toPath pdf-file))
        result      (app.ingestion/queue-for-ingestion!
                     db-pool aws-config
                     {:content          content
                      :content-type     "application/pdf"
                      :legal-entity-id  le-id
                      :uploaded-by      identity-id
                      :file-original-name (.getName pdf-file)})]
    (when (:skipped? result)
      (throw (ex-info "Document already has in-progress ingestion" result)))
    {:ingestion-id (:ap-ingestion/id result)
     :document-id  (:document/id result)}))


(defn ^:private poll-ingestion!
  "Poll for ingestion completion. Returns structured-data or throws on failure/timeout."
  [db-pool ingestion-id & {:keys [timeout-seconds poll-interval-seconds]
                            :or   {timeout-seconds        300
                                   poll-interval-seconds   5}}]
  (let [deadline (Instant/ofEpochMilli (+ (System/currentTimeMillis) (* timeout-seconds 1000)))]
    (loop []
      (let [{:ap-ingestion/keys [status structured-data error-message]}
            (db.sql/execute-one!
             db-pool
             {:select [:status :structured-data :error-message]
              :from   [:ap-ingestion]
              :where  [:= :id ingestion-id]})]
        (case (str status)
          "completed" (or structured-data
                          (throw (ex-info "Ingestion completed but no structured_data"
                                         {:ingestion-id ingestion-id})))
          "failed"    (throw (ex-info (str "Ingestion failed: " error-message)
                                      {:ingestion-id ingestion-id}))
          "skipped"   (throw (ex-info "Ingestion was skipped"
                                      {:ingestion-id ingestion-id}))
          ;; Still in progress
          (if (.isAfter (Instant/now) deadline)
            (throw (ex-info "Ingestion timed out"
                            {:ingestion-id ingestion-id
                             :timeout-seconds timeout-seconds}))
            (do (Thread/sleep (* poll-interval-seconds 1000))
                (recur))))))))


;; ---------------------------------------------------------------------------
;; Golden file I/O
;; ---------------------------------------------------------------------------

(defn ^:private read-golden
  "Read a golden file. Returns the structured_data map."
  [case-config]
  (let [f (io/file (golden-path (:golden case-config)))]
    (when-not (.exists f)
      (throw (ex-info (str "Golden file not found: " (:golden case-config)
                           "\nRun (create-golden! \"" (:id case-config) "\") first.")
                      {:path (golden-path (:golden case-config))})))
    (with-open [r (PushbackReader. (io/reader f))]
      (edn/read r))))


(defn ^:private write-golden!
  "Write structured_data as a golden file."
  [case-config data]
  (let [f (io/file (golden-path (:golden case-config)))]
    (io/make-parents f)
    (spit f (with-out-str (pprint/pprint data)))
    (println "Golden file written:" (.getPath f))))


;; ---------------------------------------------------------------------------
;; Reporting
;; ---------------------------------------------------------------------------

(defn ^:private format-path
  "Format a diff path for display."
  [path]
  (str/join " > " (map #(if (number? %) (str "[" % "]") (name %)) path)))


(defn show-diff!
  "Print detailed diff for the most recent run of a test case."
  [result]
  (let [{:keys [material trivial]} result]
    (when (seq material)
      (println "\nMATERIAL" (str "(" (count material) "):"))
      (doseq [{:keys [path expected actual reason]} material]
        (println (str "  " (format-path path)))
        (println (str "    expected: " (pr-str expected)))
        (println (str "    actual:   " (pr-str actual)))
        (when (not= reason :value-mismatch)
          (println (str "    reason:   " (name reason))))))
    (when (seq trivial)
      (println "\nTRIVIAL" (str "(" (count trivial) "):"))
      (doseq [{:keys [path expected actual reason]} trivial]
        (println (str "  " (format-path path) "  "
                      (pr-str expected) " -> " (pr-str actual)
                      " (" (name reason) ")"))))))


(defn ^:private print-summary-table
  "Print a summary table of test results."
  [results]
  (println)
  (println "Pipeline Correctness Results")
  (println (str/join "" (repeat 70 "=")))
  (printf "| %-25s | %-8s | %-13s | %8s | %7s | %5s |\n"
          "Case" "Type" "Verdict" "Material" "Trivial" "Time")
  (println (str/join "" (repeat 70 "-")))
  (doseq [{:keys [case-id type verdict material trivial elapsed-ms]} results]
    (printf "| %-25s | %-8s | %-13s | %8d | %7d | %4.0fs |\n"
            (subs case-id 0 (min 25 (count case-id)))
            (or type "?")
            (name verdict)
            (count material)
            (count trivial)
            (/ (double (or elapsed-ms 0)) 1000.0)))
  (println (str/join "" (repeat 70 "=")))
  (let [freqs (frequencies (map :verdict results))]
    (printf "%d cases: %s\n"
            (count results)
            (str/join ", " (for [[v c] (sort-by key freqs)]
                             (str c " " (name v)))))))


;; ---------------------------------------------------------------------------
;; Public API
;; ---------------------------------------------------------------------------

(defn run!
  "Run a single test case. Returns result map."
  [case-id]
  (ensure-system!)
  (let [manifest     (load-manifest)
        case-config  (find-case manifest case-id)
        defaults     (:defaults manifest)
        db-pool      (repl/db-pool)
        aws-config   (repl/aws)
        tenant-slug  (str "correctness-" case-id)
        start        (System/currentTimeMillis)]
    (try
      ;; Setup
      (let [le-id       (setup-legal-entity! db-pool case-config defaults)
            identity-id (setup-identity! db-pool)]
        (setup-master-data! db-pool le-id case-config defaults)
        (try
          ;; Execute pipeline
          (let [{:keys [ingestion-id]} (ingest-pdf! db-pool aws-config le-id identity-id case-config)
                _              (println (str "Ingesting " case-id " (ingestion " ingestion-id ")..."))
                actual         (poll-ingestion! db-pool ingestion-id)
                golden         (read-golden case-config)
                diff-result    (diff/compare-structured-data golden actual)
                elapsed        (- (System/currentTimeMillis) start)
                result         (merge diff-result
                                      {:case-id    case-id
                                       :type       (:type case-config)
                                       :elapsed-ms elapsed
                                       :actual     actual})]
            (println (str case-id ": " (name (:verdict result))
                          " (" (count (:material result)) " material, "
                          (count (:trivial result)) " trivial)"))
            (when (= :material-diff (:verdict result))
              (show-diff! result))
            result)
          (finally
            ;; Teardown
            (teardown! db-pool tenant-slug))))
      (catch Throwable t
        ;; Ensure cleanup even on setup failure
        (try (teardown! db-pool tenant-slug) (catch Throwable _))
        (let [elapsed (- (System/currentTimeMillis) start)]
          (println (str case-id ": ERROR - " (ex-message t)))
          {:case-id    case-id
           :type       (:type case-config)
           :verdict    :error
           :error      (ex-message t)
           :material   []
           :trivial    []
           :elapsed-ms elapsed})))))


(defn run-all!
  "Run all test cases. Prints summary table."
  []
  (ensure-system!)
  (let [manifest (load-manifest)
        results  (mapv #(run! (:id %)) (:cases manifest))]
    (print-summary-table results)
    results))


(defn run-tagged!
  "Run all test cases matching any of the given tags."
  [& tags]
  (ensure-system!)
  (let [manifest (load-manifest)
        tag-set  (set tags)
        cases    (filter #(some tag-set (:tags %)) (:cases manifest))
        results  (mapv #(run! (:id %)) cases)]
    (print-summary-table results)
    results))


(defn create-golden!
  "Run the pipeline and save output as golden file (first-time setup)."
  [case-id]
  (ensure-system!)
  (let [manifest     (load-manifest)
        case-config  (find-case manifest case-id)
        defaults     (:defaults manifest)
        db-pool      (repl/db-pool)
        aws-config   (repl/aws)
        tenant-slug  (str "correctness-" case-id)]
    (try
      (let [le-id       (setup-legal-entity! db-pool case-config defaults)
            identity-id (setup-identity! db-pool)]
        (setup-master-data! db-pool le-id case-config defaults)
        (try
          (let [{:keys [ingestion-id]} (ingest-pdf! db-pool aws-config le-id identity-id case-config)
                _       (println (str "Ingesting " case-id " for golden file capture..."))
                actual  (poll-ingestion! db-pool ingestion-id)]
            (write-golden! case-config actual)
            actual)
          (finally
            (teardown! db-pool tenant-slug))))
      (catch Throwable t
        (try (teardown! db-pool tenant-slug) (catch Throwable _))
        (throw t)))))


(defn update-golden!
  "Run the pipeline and overwrite the golden file with current output."
  [case-id]
  (create-golden! case-id))

Step 2: Run lint to verify no issues

cd orcha && clj-kondo --lint dev/dev/getorcha/correctness.clj dev/dev/getorcha/correctness/diff.clj

Expected: no errors or warnings.

Step 3: Commit

cd orcha && git add dev/dev/getorcha/correctness.clj && git commit -m "feat: add correctness test runner with REPL API"

Task 3: Manifest and Directory Structure

Files:

Create: dev/correctness/manifest.edn
Create: dev/correctness/pdfs/.gitkeep
Create: dev/correctness/golden/.gitkeep
Create: dev/correctness/master-data/.gitkeep
Step 1: Create the manifest with placeholder cases

Create dev/correctness/manifest.edn:

{:defaults
 {:legal-entity {:name    "Correctness Test GmbH"
                 :country "DE"
                 :vat-id  "DE123456789"
                 :tax-id  "123/456/78901"
                 :address "Musterstraße 1, 12345 Berlin"}}

 :cases
 [;; Add test cases here. Example:
  ;; {:id          "inv-001-standard"
  ;;  :description "Standard German invoice, happy path"
  ;;  :pdf         "pdfs/inv-001-standard.pdf"
  ;;  :golden      "golden/inv-001-standard.edn"
  ;;  :type        "invoice"
  ;;  :tags        #{:extraction :accounts :cost-center :validation}}
  ]

 :match-groups
 [;; Add match groups here. Example:
  ;; {:id             "match-001-invoice-po"
  ;;  :description    "Invoice with PO reference"
  ;;  :cases          ["inv-004-with-po" "po-001-purchase-order"]
  ;;  :expected-edges [{:a "inv-004-with-po" :b "po-001-purchase-order" :min-score 0.7}]}
  ]}

Step 2: Create directory structure with .gitkeep files

cd orcha && mkdir -p dev/correctness/pdfs dev/correctness/golden dev/correctness/master-data && touch dev/correctness/pdfs/.gitkeep dev/correctness/golden/.gitkeep dev/correctness/master-data/.gitkeep

Step 3: Add a .gitattributes entry to treat PDFs as binary

Check if .gitattributes exists. If not, create one in the repo root. Add:

dev/correctness/pdfs/*.pdf binary

This prevents git from trying to diff binary PDFs.

Step 4: Commit

cd orcha && git add dev/correctness/ && git commit -m "feat: add correctness test directory structure and manifest"

Task 4: Smoke Test — End-to-End Verification

Files:

No new files — this task uses the REPL to verify the framework works.

Before adding real test PDFs, verify the full flow works with any available PDF.

Step 1: Evaluate the framework in the REPL

Load the namespace and verify it compiles:

(require '[dev.getorcha.correctness :as correctness] :reload-all)

Expected: no errors.

Step 2: Add a temporary test case to the manifest

Pick any PDF available locally (from dev snapshots or Downloads). Add a test case to manifest.edn:

{:id          "smoke-test"
 :description "Temporary smoke test"
 :pdf         "pdfs/smoke-test.pdf"
 :golden      "golden/smoke-test.edn"
 :type        "invoice"
 :tags        #{:smoke}}

Copy the PDF to dev/correctness/pdfs/smoke-test.pdf.

Step 3: Create the golden file

(correctness/create-golden! "smoke-test")

Expected: prints "Ingesting smoke-test...", waits for completion, writes golden file, prints path.

Step 4: Run the test

(correctness/run! "smoke-test")

Expected: prints verdict (likely :identical or :trivial-only since it was just captured). If :material-diff, the diff is printed. This verifies the full loop: setup -> ingest -> poll -> diff -> teardown.

Step 5: Run all tests

(correctness/run-all!)

Expected: prints summary table with the smoke test result.

Step 6: Clean up smoke test (optional)

Remove smoke-test.pdf and smoke-test.edn from the correctness directories if not needed as a permanent test case.

Task 5: Add Real Test Cases

Files:

Modify: dev/correctness/manifest.edn (add cases)
Add: PDFs to dev/correctness/pdfs/

This task is about curating the 5-10 test cases. The user needs to identify PDFs for each category and create golden files.

Step 1: Identify test PDFs

The user selects PDFs for these categories (from Downloads, dev snapshots, or production):

ID	Description	Source
`inv-001-standard`	Standard German invoice, happy path	A known-good invoice
`inv-002-credit-note`	Credit note (Gutschrift)	Past failure case
`inv-003-mixed-tax-rates`	Invoice with 7% and 19% VAT	Past failure case
`inv-004-with-po`	Invoice referencing a PO	For matching test
`inv-005-ocr-difficult`	Scan with poor quality	Past failure case
`inv-006-wrong-accounts`	Past account assignment failure	Past failure case

Step 2: Copy PDFs to the correctness directory

For each PDF, copy it:

cp /path/to/invoice.pdf orcha/dev/correctness/pdfs/inv-001-standard.pdf

Step 3: Add cases to manifest

Add each case to dev/correctness/manifest.edn following the format in the existing comments.

Step 4: Create golden files for each case

For each case:

(correctness/create-golden! "inv-001-standard")

Review the golden file output to verify it looks correct (this becomes the ground truth).

Step 5: Run the full suite

(correctness/run-all!)

Expected: summary table showing all cases. Most should be :identical or :trivial-only since golden files were just captured.

Step 6: Commit all test data

cd orcha && git add dev/correctness/ && git commit -m "feat: add initial correctness test cases and golden files"

Task 6: Match Group Support (Optional / Future)

Files:

Modify: dev/dev/getorcha/correctness.clj (add run-match-group!)

This task adds support for testing four-way matching. It requires at least two related documents (e.g., invoice + PO). Implement after the core framework is working.

Step 1: Add run-match-group! to the runner

Add to dev/dev/getorcha/correctness.clj:

(defn run-match-group!
  "Run a match group: ingest all documents, trigger matching, verify edges."
  [group-id]
  (ensure-system!)
  (let [manifest     (load-manifest)
        group        (or (first (filter #(= group-id (:id %)) (:match-groups manifest)))
                         (throw (ex-info (str "Match group not found: " group-id) {})))
        defaults     (:defaults manifest)
        db-pool      (repl/db-pool)
        aws-config   (repl/aws)
        tenant-slug  (str "correctness-match-" group-id)
        case-configs (mapv #(find-case manifest %) (:cases group))]
    (try
      (let [le-id       (setup-legal-entity! db-pool (first case-configs) defaults)
            identity-id (setup-identity! db-pool)]
        (setup-master-data! db-pool le-id (first case-configs) defaults)
        (try
          ;; Ingest all documents in the group
          (let [ingestion-results
                (mapv (fn [case-config]
                        (let [result (ingest-pdf! db-pool aws-config le-id identity-id case-config)]
                          (println (str "Queued " (:id case-config) " (ingestion " (:ingestion-id result) ")"))
                          (assoc result :case-id (:id case-config))))
                      case-configs)]

            ;; Poll all ingestions until complete
            (doseq [{:keys [ingestion-id case-id]} ingestion-results]
              (println (str "Waiting for " case-id "..."))
              (poll-ingestion! db-pool ingestion-id))

            (println "All ingestions complete. Waiting for matching...")
            ;; Allow time for the matching worker to process
            (Thread/sleep 15000)

            ;; Check expected edges
            (let [doc-id-by-case (into {}
                                       (map (fn [{:keys [case-id document-id]}]
                                              [case-id document-id])
                                            ingestion-results))
                  results
                  (mapv (fn [{:keys [a b min-score]}]
                          (let [doc-a (get doc-id-by-case a)
                                doc-b (get doc-id-by-case b)
                                [id-a id-b] (sort [doc-a doc-b])
                                match (db.sql/execute-one!
                                       db-pool
                                       {:select [:blended-score :match-method]
                                        :from   [:ap-document-match]
                                        :where  [:and
                                                 [:= :document-a-id id-a]
                                                 [:= :document-b-id id-b]]})]
                            {:edge       (str a " <-> " b)
                             :found?     (some? match)
                             :score      (:ap-document-match/blended-score match)
                             :method     (:ap-document-match/match-method match)
                             :min-score  min-score
                             :pass?      (and (some? match)
                                              (>= (double (:ap-document-match/blended-score match))
                                                  (double min-score)))}))
                        (:expected-edges group))]
              (println "\nMatch Group Results: " group-id)
              (doseq [{:keys [edge found? score method min-score pass?]} results]
                (println (str "  " (if pass? "PASS" "FAIL") "  " edge
                              (if found?
                                (str "  score=" (format "%.3f" (double score))
                                     " method=" method
                                     " (min=" min-score ")")
                                "  NOT FOUND"))))
              results))
          (finally
            (teardown! db-pool tenant-slug))))
      (catch Throwable t
        (try (teardown! db-pool tenant-slug) (catch Throwable _))
        (throw t)))))

Step 2: Add a match group to the manifest

Requires inv-004-with-po and po-001-purchase-order cases to already exist. Add to manifest:

:match-groups
[{:id             "match-001-invoice-po"
  :description    "Invoice with PO reference"
  :cases          ["inv-004-with-po" "po-001-purchase-order"]
  :expected-edges [{:a "inv-004-with-po" :b "po-001-purchase-order" :min-score 0.7}]}]

Step 3: Test the match group

(correctness/run-match-group! "match-001-invoice-po")

Step 4: Commit

cd orcha && git add dev/dev/getorcha/correctness.clj dev/correctness/manifest.edn && git commit -m "feat: add match group support to correctness testing"