Note (2026-04-24): After this document was written, legal_entity was renamed to tenant and the old tenant was renamed to organization. Read references to these terms with the pre-rename meaning.

Candidate Retrieval Redesign Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Goal: Replace the naive 50-random-documents candidate retrieval with a three-layer pipeline: counterparty-filtered SQL search → evidence scoring → LLM decision.

Architecture: New normalized_counterparty and normalized_references columns on document enable SQL-level filtering by supplier identity. Candidate retrieval narrows by counterparty first, then ranks using BM25 + deterministic signals. Evidence scoring adds new signals (quantity match, date alignment, description overlap, currency mismatch). LLM trigger simplified to: clear single match = auto, everything else = top 3 to LLM.

Tech Stack: Clojure, PostgreSQL (tsvector, GIN indexes, array operators), HoneySQL, Malli, Apache Commons Text (Jaro-Winkler), Migratus migrations

Design doc: docs/plans/2026-02-25-candidate-retrieval-redesign.md


Task 1: Database migration — add normalized columns

Files:

Step 1: Create the migration

Run: bb migrate create "add-normalized-matching-columns"

Step 2: Write the up migration

ALTER TABLE document ADD COLUMN normalized_counterparty text;
ALTER TABLE document ADD COLUMN normalized_references text[] DEFAULT '{}';

CREATE INDEX idx_document_normalized_counterparty
  ON document(normalized_counterparty)
  WHERE normalized_counterparty IS NOT NULL;

CREATE INDEX idx_document_normalized_references
  ON document USING gin(normalized_references)
  WHERE normalized_references != '{}';

Step 3: Write the down migration

DROP INDEX IF EXISTS idx_document_normalized_references;
DROP INDEX IF EXISTS idx_document_normalized_counterparty;
ALTER TABLE document DROP COLUMN IF EXISTS normalized_references;
ALTER TABLE document DROP COLUMN IF EXISTS normalized_counterparty;

Step 4: Verify migration runs

Run: psql -h localhost -U postgres -d orcha -c "SELECT normalized_counterparty, normalized_references FROM document LIMIT 1"

Expected: Both columns exist, values are NULL/empty.

Step 5: Commit

git add resources/migrations/*add-normalized-matching-columns*
git commit -m "feat(matching): add normalized_counterparty and normalized_references columns"

Task 2: Counterparty and reference extraction functions

Files:

These functions extract and normalize the counterparty name and reference numbers from a document's structured data. They are used both at ingestion time (to populate the DB columns) and at query time (to build the search parameters).

Step 1: Write the failing tests

(ns com.getorcha.workers.matching.normalize-test
  (:require [clojure.test :refer [deftest is testing]]
            [com.getorcha.workers.matching.normalize :as normalize]))


(deftest extract-counterparty-test
  (testing "invoice uses issuer name"
    (is (= "abo kraft waerme ramstein"
           (normalize/extract-counterparty
            {:type :invoice
             :structured-data {:issuer {:name "ABO Kraft & Wärme Ramstein GmbH & Co.KG"}}}))))

  (testing "purchase-order uses supplier name"
    (is (= "acme"
           (normalize/extract-counterparty
            {:type :purchase-order
             :structured-data {:supplier {:name "ACME Corp"}}}))))

  (testing "contract uses party-b name"
    (is (= "abo kraft waerme ramstein"
           (normalize/extract-counterparty
            {:type :contract
             :structured-data {:party-b {:name "ABO Kraft & Wärme Ramstein GmbH & Co KG"}}}))))

  (testing "goods-received-note uses supplier name"
    (is (= "acme"
           (normalize/extract-counterparty
            {:type :goods-received-note
             :structured-data {:supplier {:name "Acme LLC"}}}))))

  (testing "formatting differences normalize to same value"
    (is (= (normalize/extract-counterparty
            {:type :invoice
             :structured-data {:issuer {:name "ABO Kraft & Wärme Ramstein GmbH & Co.KG"}}})
           (normalize/extract-counterparty
            {:type :contract
             :structured-data {:party-b {:name "ABO Kraft & Wärme Ramstein GmbH & Co KG"}}}))))

  (testing "nil name returns nil"
    (is (nil? (normalize/extract-counterparty
               {:type :invoice
                :structured-data {:issuer {:name nil}}})))))


(deftest extract-references-test
  (testing "invoice extracts invoice-number, po-reference, gr-reference"
    (is (= ["2025029ram" "po2024001"]
           (normalize/extract-references
            {:type :invoice
             :structured-data {:invoice-number "2025-029RAM"
                               :po-reference   "PO-2024-001"
                               :gr-reference   nil}}))))

  (testing "purchase-order extracts po-number, contract-reference, requisition-number"
    (is (= ["po2024001" "cfgabo001"]
           (normalize/extract-references
            {:type :purchase-order
             :structured-data {:po-number           "PO-2024-001"
                               :contract-reference  "CFG-ABO-001"
                               :requisition-number  nil}}))))

  (testing "contract extracts contract-number"
    (is (= ["cfgabo001"]
           (normalize/extract-references
            {:type :contract
             :structured-data {:contract-number "CFG-ABO-001"}}))))

  (testing "goods-received-note extracts grn-number, po-reference, delivery-note-number"
    (is (= ["grn001" "po2024001"]
           (normalize/extract-references
            {:type :goods-received-note
             :structured-data {:grn-number           "GRN-001"
                               :po-reference         "PO-2024-001"
                               :delivery-note-number nil}}))))

  (testing "all nil references returns empty vector"
    (is (= []
           (normalize/extract-references
            {:type :invoice
             :structured-data {:invoice-number nil
                               :po-reference   nil
                               :gr-reference   nil}}))))

  (testing "reference normalization strips all separators"
    (is (= ["po2024001"]
           (normalize/extract-references
            {:type :purchase-order
             :structured-data {:po-number "PO 2024/001"}})))))

Step 2: Run tests to verify they fail

Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.normalize-test]' Expected: FAIL — namespace not found

Step 3: Write the implementation

(ns com.getorcha.workers.matching.normalize
  "Extraction and normalization of counterparty names and reference numbers
   for SQL-based candidate retrieval. Used both at ingestion time (to populate
   indexed columns) and at query time (to build search parameters)."
  (:require [clojure.string :as str]
            [com.getorcha.util.text :as text]))


(defn ^:private get-counterparty-name
  "Extract the raw counterparty name from structured data based on document type.
   The legal entity is always the buyer; the counterparty is the supplier/vendor."
  [{:keys [type structured-data]}]
  (case type
    :invoice             (get-in structured-data [:issuer :name])
    :purchase-order      (get-in structured-data [:supplier :name])
    :contract            (get-in structured-data [:party-b :name])
    :goods-received-note (get-in structured-data [:supplier :name])
    nil))


(defn extract-counterparty
  "Extract and normalize the counterparty name from a document.
   Returns a normalized string suitable for exact SQL matching, or nil."
  [doc]
  (some-> (get-counterparty-name doc)
          text/normalize-supplier-name
          not-empty))


(defn ^:private normalize-reference
  "Normalize a single reference number: lowercase, strip all non-alphanumeric chars."
  [ref]
  (when (and ref (not (str/blank? ref)))
    (-> ref
        str/lower-case
        (str/replace #"[^a-z0-9]" "")
        not-empty)))


(defn ^:private get-raw-references
  "Extract all raw reference numbers from structured data based on document type."
  [{:keys [type structured-data]}]
  (case type
    :invoice
    [(:invoice-number structured-data)
     (:po-reference structured-data)
     (:gr-reference structured-data)]

    :purchase-order
    [(:po-number structured-data)
     (:contract-reference structured-data)
     (:requisition-number structured-data)]

    :contract
    [(:contract-number structured-data)]

    :goods-received-note
    [(:grn-number structured-data)
     (:po-reference structured-data)
     (:delivery-note-number structured-data)]

    []))


(defn extract-references
  "Extract and normalize all reference numbers from a document.
   Returns a vector of normalized reference strings (no nils, no blanks)."
  [doc]
  (->> (get-raw-references doc)
       (keep normalize-reference)
       vec))

Step 4: Run tests to verify they pass

Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.normalize-test]' Expected: PASS

Step 5: Commit

git add src/com/getorcha/workers/matching/normalize.clj test/com/getorcha/workers/matching/normalize_test.clj
git commit -m "feat(matching): add counterparty and reference normalization functions"

Task 3: Populate normalized columns at ingestion time

Files:

The matching worker already fetches the document and has its structured data. Populate the normalized columns there, before running matching. This keeps ingestion untouched and co-locates the logic with matching.

Step 1: Write the failing test

Add to test/com/getorcha/workers/matching/integration_test.clj:

(deftest normalized-columns-populated-on-match-test
  (testing "matching populates normalized_counterparty and normalized_references"
    (let [le-id  (helpers/create-legal-entity!)
          _po-id (create-document-with-type!
                   le-id :purchase-order
                   {:po-number "PO-2024-001"
                    :supplier  {:name   "ACME Corp"
                                :vat-id "DE123456789"}
                    :total-value 10000})
          inv-id (create-document-with-type!
                   le-id :invoice
                   {:invoice-number "INV-2024-001"
                    :po-reference   "PO-2024-001"
                    :issuer         {:name   "ACME Corp"
                                     :vat-id "DE123456789"}
                    :total          10000})
          source {:id              inv-id
                  :type            :invoice
                  :legal-entity-id le-id
                  :structured-data {:invoice-number "INV-2024-001"
                                    :po-reference   "PO-2024-001"
                                    :issuer         {:name   "ACME Corp"
                                                     :vat-id "DE123456789"}
                                    :total          10000}}]

      (matching/match-document! fixtures/*db* {} nil source)

      ;; Verify normalized columns were populated on the source document
      (let [doc (get-document inv-id)]
        (is (= "acme" (:document/normalized-counterparty doc)))
        (is (= ["inv2024001" "po2024001"]
               (vec (:document/normalized-references doc))))))))

Step 2: Run test to verify it fails

Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.integration-test]' Expected: FAIL — normalized columns are nil

Step 3: Add DB function for updating normalized columns

In src/com/getorcha/db/document_matching.clj, add:

(defn set-normalized-fields!
  "Update normalized counterparty and references for a document."
  [db document-id counterparty references]
  (db.sql/execute! db
    {:update :document
     :set    {:normalized-counterparty counterparty
              :normalized-references   [:lift (or references [])]}
     :where  [:= :id document-id]}))

Step 4: Update match-document! to populate normalized columns

In src/com/getorcha/workers/matching/core.clj:

Add to requires:

[com.getorcha.workers.matching.normalize :as normalize]

At the beginning of match-document!, before clearing previous state, add:

;; Populate normalized columns for candidate retrieval
(let [counterparty (normalize/extract-counterparty doc)
      references   (normalize/extract-references doc)]
  (db.matching/set-normalized-fields! db (:id doc) counterparty references))

Step 5: Run tests to verify they pass

Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.integration-test]' Expected: PASS (all tests including the new one)

Step 6: Commit

git add src/com/getorcha/workers/matching/core.clj src/com/getorcha/db/document_matching.clj test/com/getorcha/workers/matching/integration_test.clj
git commit -m "feat(matching): populate normalized columns during matching"

Task 4: Rewrite candidate retrieval with counterparty-filtered SQL

Files:

This is the core of Layer 1. The new find-candidates filters by normalized counterparty, then ranks using BM25 + reference overlap + amount proximity.

Step 1: Write the failing test

In test/com/getorcha/workers/matching/candidates_test.clj, add a test that requires counterparty filtering:

(deftest find-candidates-filters-by-counterparty-test
  (testing "candidates are filtered to same normalized counterparty"
    (let [le-id (helpers/create-legal-entity!)
          ;; Same supplier, different formatting
          po-id (create-document-with-type!
                  le-id :purchase-order
                  {:po-number "PO-001"
                   :supplier  {:name "ACME Corp GmbH"}
                   :total-value 10000})
          ;; Different supplier
          _other-po (create-document-with-type!
                      le-id :purchase-order
                      {:po-number "PO-002"
                       :supplier  {:name "Other Corp"}
                       :total-value 10000})]

      ;; Set normalized columns (normally done by match-document!)
      (db.matching/set-normalized-fields!
       fixtures/*db* po-id "acme" ["po001"])
      (db.matching/set-normalized-fields!
       fixtures/*db* _other-po "other" ["po002"])

      (let [candidates (candidates/find-candidates
                        fixtures/*db* {}
                        {:id              (random-uuid)
                         :type            :invoice
                         :legal-entity-id le-id
                         :structured-data {:issuer {:name "Acme Corp"}}
                         :normalized-counterparty "acme"
                         :normalized-references ["inv001"]})]
        ;; Only the ACME PO should be returned, not Other Corp
        (is (= 1 (count candidates)))
        (is (= po-id (:document/id (first candidates))))))))

Step 2: Run test to verify it fails

Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.candidates-test]' Expected: FAIL — current implementation doesn't filter by counterparty

Step 3: Rewrite find-candidates

Replace the content of src/com/getorcha/workers/matching/candidates.clj:

(ns com.getorcha.workers.matching.candidates
  "Candidate retrieval for document matching.

   Layer 1 of the matching pipeline. Filters by normalized counterparty name
   to narrow the search space, then ranks using BM25 text overlap, reference
   number overlap, and amount proximity. Returns up to 50 candidates."
  (:require [com.getorcha.db.sql :as db.sql]
            [com.getorcha.workers.matching.searchable-text :as searchable-text]))


(def ^:private matchable-pairs
  "Valid document type pairs for matching."
  #{#{:invoice :purchase-order}
    #{:invoice :contract}
    #{:purchase-order :contract}
    #{:goods-received-note :purchase-order}})


(defn get-matchable-types
  "Get document types that can match with the given type."
  [doc-type]
  (->> matchable-pairs
       (filter #(contains? % doc-type))
       (mapcat identity)
       (remove #(= % doc-type))
       set))


(defn find-candidates
  "Find candidate documents for matching using counterparty-filtered ranked search.

   The source document must include `:normalized-counterparty` and
   `:normalized-references` keys (populated by `match-document!` before calling).

   Ranking combines:
   - BM25 text overlap on searchable_text (commodity, quantities, descriptions)
   - Reference number overlap (any normalized reference in common)
   - Amount proximity (within 50% of source total)

   Returns up to 50 candidate document rows, ordered by rank score descending."
  [db _search-config doc]
  (let [matchable-types      (get-matchable-types (:type doc))
        counterparty         (:normalized-counterparty doc)
        source-refs          (:normalized-references doc)
        source-total         (or (get-in doc [:structured-data :total])
                                 (get-in doc [:structured-data :total-value])
                                 0)
        query-text           (searchable-text/build-searchable-text doc)]
    (when counterparty
      (db.sql/execute! db
        {:select   [:*
                    [[:+
                      ;; BM25 text overlap
                      [:* [:coalesce
                           [:ts_rank
                            [:to_tsvector [:inline "simple"] :searchable-text]
                            [:plainto_tsquery [:inline "simple"] query-text]]
                           0]
                       10]
                      ;; Reference overlap bonus
                      [:case
                       [:and
                        [:is-not :normalized-references nil]
                        [:<> [:cast [:inline "{}"] :text/1] [:cast :normalized-references :text]]
                        [:raw (str "normalized_references && ARRAY["
                                   (->> source-refs
                                        (map #(str "'" (db.sql/escape-string %) "'"))
                                        (clojure.string/join ","))
                                   "]::text[]")]]
                       50
                       :else 0]
                      ;; Amount proximity bonus
                      [:case
                       [:and
                        [:> source-total 0]
                        [:< [:/ [:abs [:- [:coalesce
                                           [:cast [:-> :structured-data [:inline "total"]] :numeric]
                                           [:cast [:-> :structured-data [:inline "total-value"]] :numeric]
                                           0]
                                          source-total]]
                             source-total]
                         0.5]]
                       10
                       :else 0]]
                     :rank-score]]
         :from     [:document]
         :where    [:and
                    [:= :legal-entity-id (:legal-entity-id doc)]
                    [:in :type (mapv #(db.sql/->cast % :document-type) matchable-types)]
                    [:= :normalized-counterparty counterparty]
                    [:is-not :structured-data nil]
                    [:<> :id (:id doc)]]
         :order-by [[:rank-score :desc]]
         :limit    50}))))

Important: The HoneySQL for the ranking expression is complex. The exact syntax may need adjustment depending on how db.sql handles raw SQL fragments. If HoneySQL can't express the full ranking query cleanly, use db.sql/execute! with a raw SQL string and parameters instead. The key behavior to preserve:

  1. Filter by normalized_counterparty = :counterparty
  2. Rank by BM25 + reference overlap + amount proximity
  3. Limit 50

Step 4: Run all matching tests

Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.candidates-test]' Expected: PASS

Then run the full matching test suite to verify no regressions: Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.integration-test]' Expected: PASS — existing tests must still pass. They create documents with matching supplier names, so the counterparty filter should not break them. However, the integration tests will also need to populate the normalized columns on ALL test documents (not just the source). Update create-document-with-type! in the integration test to also set normalized columns, or update match-document! to normalize ALL candidates it encounters.

Design note: match-document! currently only normalizes the source document. Candidates need normalized columns too. The simplest approach: when match-document! finds candidates with NULL normalized_counterparty, backfill them inline. This handles the transition period and existing data. Add to match-document! after finding candidates:

;; Backfill normalized columns on candidates that don't have them yet
(doseq [row candidate-rows
        :when (nil? (:document/normalized-counterparty row))]
  (let [candidate-doc (candidate-row->doc row)]
    (db.matching/set-normalized-fields!
     db (:document/id row)
     (normalize/extract-counterparty candidate-doc)
     (normalize/extract-references candidate-doc))))

Step 5: Commit

git add src/com/getorcha/workers/matching/candidates.clj test/com/getorcha/workers/matching/candidates_test.clj
git commit -m "feat(matching): rewrite candidate retrieval with counterparty filter and ranked search"

Task 5: Add new evidence signals

Files:

This expands Layer 2 scoring with the signals needed for the no-reference matching case.

Step 1: Update the Signal schema

In src/com/getorcha/schema/matching.clj, expand the Signal enum:

(def Signal
  "Evidence signal type."
  [:enum
   :po-number-exact
   :contract-ref-exact
   :po-ref-exact
   :vat-id-match
   :iban-match
   :amount-within-2pct
   :amount-within-5pct
   :supplier-name-fuzzy
   :date-in-validity
   :date-within-period
   :delivery-date-match
   :quantity-exact
   :description-overlap
   :currency-mismatch
   :vat-id-mismatch])

Step 2: Write failing tests for new signals

In test/com/getorcha/workers/matching/evidence_test.clj, add tests. Key tests:

(deftest quantity-exact-signal-test
  (testing "fires when invoice line-item quantity matches contract deliverable"
    (let [{:keys [evidence]} (evidence/compute-score
                              {:type :contract
                               :structured-data {:deliverables ["1.200.000 kWh (HS,N) Biomethan am Gastag 22.10.2025"]
                                                 :party-b {:name "ABO"}}}
                              {:type :invoice
                               :structured-data {:line-items [{:quantity 1200000.0 :unit "kWh Hs"}]
                                                 :issuer {:name "ABO"}}})]
      (is (some #(= :quantity-exact (:signal %)) evidence)))))

(deftest currency-mismatch-signal-test
  (testing "fires negative signal when currencies differ"
    (let [{:keys [evidence]} (evidence/compute-score
                              {:type :invoice
                               :structured-data {:currency "EUR" :issuer {:name "X"}}}
                              {:type :purchase-order
                               :structured-data {:currency "USD" :supplier {:name "X"}}})]
      (is (some #(= :currency-mismatch (:signal %)) evidence)))))

(deftest supplier-name-fuzzy-signal-test
  (testing "fires when normalized supplier names have >0.8 Jaro-Winkler similarity"
    (let [{:keys [evidence]} (evidence/compute-score
                              {:type :invoice
                               :structured-data {:issuer {:name "ABO Kraft & Wärme Ramstein GmbH & Co.KG"}}}
                              {:type :contract
                               :structured-data {:party-b {:name "ABO Kraft & Wärme Ramstein GmbH & Co KG"}}})]
      (is (some #(= :supplier-name-fuzzy (:signal %)) evidence)))))

Step 3: Run tests to verify they fail

Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.evidence-test]' Expected: FAIL — new signals not implemented

Step 4: Implement new signals in collect-signals

In src/com/getorcha/workers/matching/evidence.clj:

Add to requires:

[com.getorcha.util.text :as text]

Add import for Jaro-Winkler:

(:import (org.apache.commons.text.similarity JaroWinklerSimilarity))

Update evidence-signals map to match the design doc (add :quantity-exact 35, :date-within-period 20, :delivery-date-match 25, :description-overlap 10, :currency-mismatch -30; remove :medium from match-thresholds).

Add helper functions:

(def ^:private jaro-winkler (JaroWinklerSimilarity.))

(defn ^:private get-currency
  [{:keys [structured-data]}]
  (:currency structured-data))

(defn ^:private get-counterparty-name
  "Extract raw counterparty name from document."
  [{:keys [type structured-data]}]
  (case type
    :invoice             (get-in structured-data [:issuer :name])
    :purchase-order      (get-in structured-data [:supplier :name])
    :contract            (get-in structured-data [:party-b :name])
    :goods-received-note (get-in structured-data [:supplier :name])
    nil))

(defn ^:private get-quantities
  "Extract numeric quantities from document line items and deliverables."
  [{:keys [type structured-data]}]
  (case type
    :invoice
    (->> (:line-items structured-data)
         (keep :quantity)
         set)

    :purchase-order
    (->> (:line-items structured-data)
         (keep :quantity)
         set)

    :contract
    ;; Parse quantities from deliverables strings (e.g., "1.200.000 kWh")
    (->> (:deliverables structured-data)
         (keep #(when-let [m (re-find #"[\d.,]+" %)]
                  (try
                    (-> m
                        (clojure.string/replace "." "")
                        (clojure.string/replace "," ".")
                        Double/parseDouble)
                    (catch Exception _ nil))))
         set)

    :goods-received-note
    (->> (:line-items structured-data)
         (keep :quantity-received)
         set)

    #{}))

Add new signal checks in collect-signals:

;; Supplier name fuzzy match (Jaro-Winkler on normalized names)
(let [name-a (some-> (get-counterparty-name doc-a) text/normalize-supplier-name)
      name-b (some-> (get-counterparty-name doc-b) text/normalize-supplier-name)]
  (when (and name-a name-b
             (> (.apply jaro-winkler name-a name-b) 0.8))
    (conj! signals {:signal :supplier-name-fuzzy
                    :value  (str name-a " ~ " name-b)
                    :weight (:supplier-name-fuzzy evidence-signals)})))

;; Quantity exact match
(let [qtys-a (get-quantities doc-a)
      qtys-b (get-quantities doc-b)
      common (clojure.set/intersection qtys-a qtys-b)]
  (when (seq common)
    (conj! signals {:signal :quantity-exact
                    :value  (str (first common))
                    :weight (:quantity-exact evidence-signals)})))

;; Currency mismatch
(let [curr-a (get-currency doc-a)
      curr-b (get-currency doc-b)]
  (when (and curr-a curr-b (not= curr-a curr-b))
    (conj! signals {:signal :currency-mismatch
                    :value  (str curr-a " vs " curr-b)
                    :weight (:currency-mismatch evidence-signals)})))

Note on date signals (:date-within-period, :delivery-date-match): These require parsing contract effective/expiration dates and invoice service periods. Implement the comparison but keep it straightforward — compare date strings directly since they're in ISO format ("2025-10-01" compares correctly as a string for basic range checks). The parsing of contract deliverable dates from free-text strings is more involved; use a simple regex for "Gastag DD.MM.YYYY" patterns.

Note on :description-overlap: Compare the document description/title fields. For invoices, use the first line-item description. For contracts, use the title. Check for shared significant words (excluding stop words). Fire the signal if >50% of significant words overlap.

Step 5: Run tests

Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.evidence-test]' Expected: PASS

Step 6: Commit

git add src/com/getorcha/workers/matching/evidence.clj test/com/getorcha/workers/matching/evidence_test.clj src/com/getorcha/schema/matching.clj
git commit -m "feat(matching): add quantity, date, description, and currency evidence signals"

Task 6: Simplify LLM trigger logic

Files:

The new rule: 0 candidates = no match; 1 candidate ≥ high = auto-match; anything else = top 3 to LLM.

Step 1: Write the failing test

(deftest decide-matches-sends-top-3-to-llm-test
  (testing "sends only top 3 candidates to LLM when multiple exist"
    (let [candidates (mapv #(hash-map :doc {:document/id (random-uuid)}
                                      :score %
                                      :evidence [])
                           [0.65 0.60 0.55 0.50 0.45])
          llm-calls  (atom [])
          llm-config {:matching {:provider :test}}]
      (with-redefs [llm-decision/llm-match-decision
                    (fn [_ _ cands]
                      (reset! llm-calls cands)
                      {:matches [{:candidate 1 :confidence "high" :reasoning "test"}]})]
        (matching/decide-matches llm-config {} candidates)
        ;; LLM should receive only top 3, not all 5
        (is (= 3 (count @llm-calls)))))))

Step 2: Run test to verify it fails

Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.core-test]' Expected: FAIL — current implementation sends all candidates

Step 3: Update decide-matches

(defn decide-matches
  "Decide which candidates to match based on scores.

   Rules:
   - 0 candidates → no match
   - 1 candidate ≥ high threshold → auto-match (rule-based)
   - Anything else → top 3 to LLM (if config provided)

   Returns seq of `{:doc :score :evidence :match-method}`."
  [llm-config source-doc candidates]
  (when (seq candidates)
    (let [high-candidates (filter #(>= (:score %) (:high evidence/match-thresholds)) candidates)]
      (cond
        ;; Single clear winner above high threshold — no LLM needed
        (= 1 (count high-candidates))
        [(assoc (first high-candidates) :match-method "rule-based")]

        ;; Any other case — LLM decides on top 3
        llm-config
        (let [top-3 (take 3 candidates)]
          (resolve-llm-matches
           (llm-decision/llm-match-decision llm-config source-doc top-3)
           top-3))

        ;; No LLM config fallback: pick single top if above high threshold
        (= 1 (count high-candidates))
        [(assoc (first high-candidates) :match-method "rule-based")]))))

Step 4: Run all core tests

Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.core-test]' Expected: PASS

Step 5: Commit

git add src/com/getorcha/workers/matching/core.clj test/com/getorcha/workers/matching/core_test.clj
git commit -m "feat(matching): simplify LLM trigger to top-3 candidates rule"

Task 7: Update match-document! to pass normalized data through

Files:

Step 1: Update match-document! to attach normalized fields to the doc before finding candidates

In match-document!, after populating the normalized columns in the DB, also attach them to the doc map so find-candidates can use them:

(defn match-document!
  [db search-config llm-config doc]
  {:pre [(m/validate schema.matching/SourceDocument doc)]}
  ;; Populate normalized columns
  (let [counterparty (normalize/extract-counterparty doc)
        references   (normalize/extract-references doc)
        doc          (assoc doc
                            :normalized-counterparty counterparty
                            :normalized-references   references)]
    (db.matching/set-normalized-fields! db (:id doc) counterparty references)

    ;; Clear previous state
    (db.matching/delete-matches-for-document! db (:id doc))
    (db.matching/set-cluster-id! db [(:id doc)] nil)

    (if (nil? counterparty)
      (log/warn "No counterparty extracted, skipping matching"
                {:document-id (:id doc) :type (:type doc)})
      (let [candidate-rows (candidates/find-candidates db search-config doc)]
        ;; ... rest of existing logic unchanged
        ))))

Step 2: Run full integration test suite

Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.integration-test]' Expected: PASS — all existing tests must still pass

Step 3: Commit

git add src/com/getorcha/workers/matching/core.clj
git commit -m "feat(matching): wire normalized fields through match-document! pipeline"

Task 8: Backfill migration for existing documents

Files:

For existing documents that already have structured data, backfill the normalized columns. This is a SQL-only migration that handles the common cases. Edge cases (complex company suffixes) will be corrected when documents are re-matched.

Step 1: Create the migration

Run: bb migrate create "backfill-normalized-matching-columns"

Step 2: Write the up migration

The backfill needs to extract counterparty names from JSONB and normalize them. SQL can handle the basic normalization (lowercase, trim) but not the full normalize-supplier-name logic (umlaut transliteration, suffix stripping). Two approaches:

Option A (recommended): Write a Clojure backfill script that runs after migration, using the actual normalization functions. Create scripts/backfill_normalized_columns.clj as a Babashka-compatible script or a REPL-evaluated snippet.

Option B: SQL-only approximation — lowercase + trim the counterparty name. This gets most matches working but won't handle umlauts or company suffixes.

Go with Option A. The migration itself just ensures the columns exist (already done in Task 1). Add a REPL-executable backfill function:

In src/com/getorcha/workers/matching/normalize.clj, add:

(defn backfill-document!
  "Backfill normalized columns for a single document row.
   Used for migrating existing data."
  [db {:keys [document/id document/type document/structured-data] :as _row}]
  (let [doc {:type            (keyword type)
             :structured-data structured-data}]
    (db.matching/set-normalized-fields!
     db id
     (extract-counterparty doc)
     (extract-references doc))))

Document the backfill process in a comment block:

(comment
  ;; Backfill all documents with structured data:
  ;; (require '[com.getorcha.db.sql :as db.sql])
  ;; (let [db (:com.getorcha.db/pool integrant.repl.state/system)
  ;;       docs (db.sql/execute! db {:select [:id :type :structured-data]
  ;;                                  :from [:document]
  ;;                                  :where [:and
  ;;                                          [:is-not :structured-data nil]
  ;;                                          [:is :normalized-counterparty nil]]})]
  ;;   (doseq [doc docs]
  ;;     (backfill-document! db doc))
  ;;   (count docs))
  )

Step 3: The up migration is a no-op (columns already added in Task 1). Create empty file or add a comment:

-- Backfill is done via REPL using com.getorcha.workers.matching.normalize/backfill-document!
-- See the (comment ...) block in that namespace for instructions.
SELECT 1;

Step 4: Commit

git add src/com/getorcha/workers/matching/normalize.clj resources/migrations/*backfill*
git commit -m "feat(matching): add backfill function for normalized matching columns"

Task 9: Integration test — contract-invoice matching without cross-references

Files:

This validates the entire three-layer pipeline works for the no-reference case.

Step 1: Write the integration test

(deftest contract-invoice-matching-without-references-test
  (testing "invoice matches contract via supplier identity, quantity, and date alignment"
    (let [le-id       (helpers/create-legal-entity!)
          contract-id (create-document-with-type!
                        le-id :contract
                        {:contract-number "CFG-ABO-001"
                         :party-b         {:name "ABO Kraft & Wärme Ramstein GmbH & Co KG"}
                         :total-value     250000
                         :currency        "EUR"
                         :effective-date  "2025-10-20"
                         :deliverables    ["1.200.000 kWh (HS,N) Biomethan am Gastag 22.10.2025"
                                           "800.000 kWh (HS,N) Biomethan am Gastag 08.01.2026"]})
          inv-id      (create-document-with-type!
                        le-id :invoice
                        {:invoice-number "2025-029RAM"
                         :issuer         {:name   "ABO Kraft & Wärme Ramstein GmbH & Co.KG"
                                          :vat-id "DE302232673"}
                         :total          157277.93
                         :currency       "EUR"
                         :service-period {:start "2025-10-01" :end "2025-10-31"}
                         :line-items     [{:description "Lieferung von Biomethan"
                                           :quantity    1200000.0
                                           :unit        "kWh Hs"
                                           :amount      157277.93}]})
          ;; First match the contract so its normalized columns get populated
          contract-source {:id              contract-id
                           :type            :contract
                           :legal-entity-id le-id
                           :structured-data {:contract-number "CFG-ABO-001"
                                             :party-b         {:name "ABO Kraft & Wärme Ramstein GmbH & Co KG"}
                                             :total-value     250000
                                             :currency        "EUR"
                                             :effective-date  "2025-10-20"
                                             :deliverables    ["1.200.000 kWh (HS,N) Biomethan am Gastag 22.10.2025"
                                                               "800.000 kWh (HS,N) Biomethan am Gastag 08.01.2026"]}}
          ;; Then match the invoice
          invoice-source {:id              inv-id
                          :type            :invoice
                          :legal-entity-id le-id
                          :structured-data {:invoice-number "2025-029RAM"
                                            :issuer         {:name   "ABO Kraft & Wärme Ramstein GmbH & Co.KG"
                                                             :vat-id "DE302232673"}
                                            :total          157277.93
                                            :currency       "EUR"
                                            :service-period {:start "2025-10-01" :end "2025-10-31"}
                                            :line-items     [{:description "Lieferung von Biomethan"
                                                              :quantity    1200000.0
                                                              :unit        "kWh Hs"
                                                              :amount      157277.93}]}}]

      ;; Run matching on contract first (populates its normalized columns)
      (matching/match-document! fixtures/*db* {} nil contract-source)
      ;; Run matching on invoice
      (matching/match-document! fixtures/*db* {} nil invoice-source)

      ;; Verify match was created
      (let [matches (db.matching/get-matches-for-document fixtures/*db* inv-id)]
        (is (= 1 (count matches)) "Invoice should match exactly one contract")
        (let [match (first matches)]
          (is (>= (:document-match/confidence match) 0.7M)
              "Score should be above high threshold (quantity + date + name + description)")
          (is (= "rule-based" (:document-match/match-method match))
              "Should auto-match without LLM"))))))

Step 2: Run test

Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.integration-test]' Expected: PASS

Step 3: Commit

git add test/com/getorcha/workers/matching/integration_test.clj
git commit -m "test(matching): add integration test for contract-invoice matching without references"

Task 10: Run full test suite and verify no regressions

Step 1: Run all matching tests

Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.normalize-test com.getorcha.workers.matching.evidence-test com.getorcha.workers.matching.candidates-test com.getorcha.workers.matching.core-test com.getorcha.workers.matching.integration-test com.getorcha.db.document-matching-test]'

Expected: All PASS

Step 2: Run the full project test suite

Run: clj -X:test:silent 2>&1 | grep -A 5 -E "(FAIL in|ERROR in|Execution error|failed because|Ran .* tests)"

Expected: No failures related to matching changes. Pre-existing failures unrelated to this work are acceptable.

Step 3: Final commit if any cleanup needed


Dependency Graph

Task 1 (migration) ──────────────────────────────────┐
                                                      │
Task 2 (normalize functions) ─────────────┐           │
                                          │           │
Task 3 (populate at ingestion) ───────────┤           │
                                          │           │
Task 4 (candidate retrieval rewrite) ─────┤           │
                                          ├── Task 9 (integration test)
Task 5 (new evidence signals) ────────────┤           │
                                          │           │
Task 6 (simplify LLM trigger) ───────────┤           │
                                          │           │
Task 7 (wire through match-document!) ────┤           │
                                          │           │
Task 8 (backfill) ────────────────────────┘           │
                                                      │
                                          Task 10 (full test suite) ──┘

Tasks 1 and 2 can run in parallel. Tasks 3-8 depend on both. Task 9 depends on all of 3-8. Task 10 is final verification.