Note (2026-04-24): After this document was written,
legal_entitywas renamed totenantand the oldtenantwas renamed toorganization. Read references to these terms with the pre-rename meaning.
For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
Goal: Replace the naive 50-random-documents candidate retrieval with a three-layer pipeline: counterparty-filtered SQL search → evidence scoring → LLM decision.
Architecture: New normalized_counterparty and normalized_references columns on document enable SQL-level filtering by supplier identity. Candidate retrieval narrows by counterparty first, then ranks using BM25 + deterministic signals. Evidence scoring adds new signals (quantity match, date alignment, description overlap, currency mismatch). LLM trigger simplified to: clear single match = auto, everything else = top 3 to LLM.
Tech Stack: Clojure, PostgreSQL (tsvector, GIN indexes, array operators), HoneySQL, Malli, Apache Commons Text (Jaro-Winkler), Migratus migrations
Design doc: docs/plans/2026-02-25-candidate-retrieval-redesign.md
Files:
resources/migrations/YYYYMMDDHHMMSS-add-normalized-matching-columns.up.sqlresources/migrations/YYYYMMDDHHMMSS-add-normalized-matching-columns.down.sqlStep 1: Create the migration
Run: bb migrate create "add-normalized-matching-columns"
Step 2: Write the up migration
ALTER TABLE document ADD COLUMN normalized_counterparty text;
ALTER TABLE document ADD COLUMN normalized_references text[] DEFAULT '{}';
CREATE INDEX idx_document_normalized_counterparty
ON document(normalized_counterparty)
WHERE normalized_counterparty IS NOT NULL;
CREATE INDEX idx_document_normalized_references
ON document USING gin(normalized_references)
WHERE normalized_references != '{}';
Step 3: Write the down migration
DROP INDEX IF EXISTS idx_document_normalized_references;
DROP INDEX IF EXISTS idx_document_normalized_counterparty;
ALTER TABLE document DROP COLUMN IF EXISTS normalized_references;
ALTER TABLE document DROP COLUMN IF EXISTS normalized_counterparty;
Step 4: Verify migration runs
Run: psql -h localhost -U postgres -d orcha -c "SELECT normalized_counterparty, normalized_references FROM document LIMIT 1"
Expected: Both columns exist, values are NULL/empty.
Step 5: Commit
git add resources/migrations/*add-normalized-matching-columns*
git commit -m "feat(matching): add normalized_counterparty and normalized_references columns"
Files:
src/com/getorcha/workers/matching/normalize.cljtest/com/getorcha/workers/matching/normalize_test.cljThese functions extract and normalize the counterparty name and reference numbers from a document's structured data. They are used both at ingestion time (to populate the DB columns) and at query time (to build the search parameters).
Step 1: Write the failing tests
(ns com.getorcha.workers.matching.normalize-test
(:require [clojure.test :refer [deftest is testing]]
[com.getorcha.workers.matching.normalize :as normalize]))
(deftest extract-counterparty-test
(testing "invoice uses issuer name"
(is (= "abo kraft waerme ramstein"
(normalize/extract-counterparty
{:type :invoice
:structured-data {:issuer {:name "ABO Kraft & Wärme Ramstein GmbH & Co.KG"}}}))))
(testing "purchase-order uses supplier name"
(is (= "acme"
(normalize/extract-counterparty
{:type :purchase-order
:structured-data {:supplier {:name "ACME Corp"}}}))))
(testing "contract uses party-b name"
(is (= "abo kraft waerme ramstein"
(normalize/extract-counterparty
{:type :contract
:structured-data {:party-b {:name "ABO Kraft & Wärme Ramstein GmbH & Co KG"}}}))))
(testing "goods-received-note uses supplier name"
(is (= "acme"
(normalize/extract-counterparty
{:type :goods-received-note
:structured-data {:supplier {:name "Acme LLC"}}}))))
(testing "formatting differences normalize to same value"
(is (= (normalize/extract-counterparty
{:type :invoice
:structured-data {:issuer {:name "ABO Kraft & Wärme Ramstein GmbH & Co.KG"}}})
(normalize/extract-counterparty
{:type :contract
:structured-data {:party-b {:name "ABO Kraft & Wärme Ramstein GmbH & Co KG"}}}))))
(testing "nil name returns nil"
(is (nil? (normalize/extract-counterparty
{:type :invoice
:structured-data {:issuer {:name nil}}})))))
(deftest extract-references-test
(testing "invoice extracts invoice-number, po-reference, gr-reference"
(is (= ["2025029ram" "po2024001"]
(normalize/extract-references
{:type :invoice
:structured-data {:invoice-number "2025-029RAM"
:po-reference "PO-2024-001"
:gr-reference nil}}))))
(testing "purchase-order extracts po-number, contract-reference, requisition-number"
(is (= ["po2024001" "cfgabo001"]
(normalize/extract-references
{:type :purchase-order
:structured-data {:po-number "PO-2024-001"
:contract-reference "CFG-ABO-001"
:requisition-number nil}}))))
(testing "contract extracts contract-number"
(is (= ["cfgabo001"]
(normalize/extract-references
{:type :contract
:structured-data {:contract-number "CFG-ABO-001"}}))))
(testing "goods-received-note extracts grn-number, po-reference, delivery-note-number"
(is (= ["grn001" "po2024001"]
(normalize/extract-references
{:type :goods-received-note
:structured-data {:grn-number "GRN-001"
:po-reference "PO-2024-001"
:delivery-note-number nil}}))))
(testing "all nil references returns empty vector"
(is (= []
(normalize/extract-references
{:type :invoice
:structured-data {:invoice-number nil
:po-reference nil
:gr-reference nil}}))))
(testing "reference normalization strips all separators"
(is (= ["po2024001"]
(normalize/extract-references
{:type :purchase-order
:structured-data {:po-number "PO 2024/001"}})))))
Step 2: Run tests to verify they fail
Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.normalize-test]'
Expected: FAIL — namespace not found
Step 3: Write the implementation
(ns com.getorcha.workers.matching.normalize
"Extraction and normalization of counterparty names and reference numbers
for SQL-based candidate retrieval. Used both at ingestion time (to populate
indexed columns) and at query time (to build search parameters)."
(:require [clojure.string :as str]
[com.getorcha.util.text :as text]))
(defn ^:private get-counterparty-name
"Extract the raw counterparty name from structured data based on document type.
The legal entity is always the buyer; the counterparty is the supplier/vendor."
[{:keys [type structured-data]}]
(case type
:invoice (get-in structured-data [:issuer :name])
:purchase-order (get-in structured-data [:supplier :name])
:contract (get-in structured-data [:party-b :name])
:goods-received-note (get-in structured-data [:supplier :name])
nil))
(defn extract-counterparty
"Extract and normalize the counterparty name from a document.
Returns a normalized string suitable for exact SQL matching, or nil."
[doc]
(some-> (get-counterparty-name doc)
text/normalize-supplier-name
not-empty))
(defn ^:private normalize-reference
"Normalize a single reference number: lowercase, strip all non-alphanumeric chars."
[ref]
(when (and ref (not (str/blank? ref)))
(-> ref
str/lower-case
(str/replace #"[^a-z0-9]" "")
not-empty)))
(defn ^:private get-raw-references
"Extract all raw reference numbers from structured data based on document type."
[{:keys [type structured-data]}]
(case type
:invoice
[(:invoice-number structured-data)
(:po-reference structured-data)
(:gr-reference structured-data)]
:purchase-order
[(:po-number structured-data)
(:contract-reference structured-data)
(:requisition-number structured-data)]
:contract
[(:contract-number structured-data)]
:goods-received-note
[(:grn-number structured-data)
(:po-reference structured-data)
(:delivery-note-number structured-data)]
[]))
(defn extract-references
"Extract and normalize all reference numbers from a document.
Returns a vector of normalized reference strings (no nils, no blanks)."
[doc]
(->> (get-raw-references doc)
(keep normalize-reference)
vec))
Step 4: Run tests to verify they pass
Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.normalize-test]'
Expected: PASS
Step 5: Commit
git add src/com/getorcha/workers/matching/normalize.clj test/com/getorcha/workers/matching/normalize_test.clj
git commit -m "feat(matching): add counterparty and reference normalization functions"
Files:
src/com/getorcha/workers/matching/worker.clj — populate columns when matching worker picks up a documentsrc/com/getorcha/db/document_matching.clj — add DB functionThe matching worker already fetches the document and has its structured data. Populate the normalized columns there, before running matching. This keeps ingestion untouched and co-locates the logic with matching.
Step 1: Write the failing test
Add to test/com/getorcha/workers/matching/integration_test.clj:
(deftest normalized-columns-populated-on-match-test
(testing "matching populates normalized_counterparty and normalized_references"
(let [le-id (helpers/create-legal-entity!)
_po-id (create-document-with-type!
le-id :purchase-order
{:po-number "PO-2024-001"
:supplier {:name "ACME Corp"
:vat-id "DE123456789"}
:total-value 10000})
inv-id (create-document-with-type!
le-id :invoice
{:invoice-number "INV-2024-001"
:po-reference "PO-2024-001"
:issuer {:name "ACME Corp"
:vat-id "DE123456789"}
:total 10000})
source {:id inv-id
:type :invoice
:legal-entity-id le-id
:structured-data {:invoice-number "INV-2024-001"
:po-reference "PO-2024-001"
:issuer {:name "ACME Corp"
:vat-id "DE123456789"}
:total 10000}}]
(matching/match-document! fixtures/*db* {} nil source)
;; Verify normalized columns were populated on the source document
(let [doc (get-document inv-id)]
(is (= "acme" (:document/normalized-counterparty doc)))
(is (= ["inv2024001" "po2024001"]
(vec (:document/normalized-references doc))))))))
Step 2: Run test to verify it fails
Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.integration-test]'
Expected: FAIL — normalized columns are nil
Step 3: Add DB function for updating normalized columns
In src/com/getorcha/db/document_matching.clj, add:
(defn set-normalized-fields!
"Update normalized counterparty and references for a document."
[db document-id counterparty references]
(db.sql/execute! db
{:update :document
:set {:normalized-counterparty counterparty
:normalized-references [:lift (or references [])]}
:where [:= :id document-id]}))
Step 4: Update match-document! to populate normalized columns
In src/com/getorcha/workers/matching/core.clj:
Add to requires:
[com.getorcha.workers.matching.normalize :as normalize]
At the beginning of match-document!, before clearing previous state, add:
;; Populate normalized columns for candidate retrieval
(let [counterparty (normalize/extract-counterparty doc)
references (normalize/extract-references doc)]
(db.matching/set-normalized-fields! db (:id doc) counterparty references))
Step 5: Run tests to verify they pass
Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.integration-test]'
Expected: PASS (all tests including the new one)
Step 6: Commit
git add src/com/getorcha/workers/matching/core.clj src/com/getorcha/db/document_matching.clj test/com/getorcha/workers/matching/integration_test.clj
git commit -m "feat(matching): populate normalized columns during matching"
Files:
src/com/getorcha/workers/matching/candidates.clj — replace find-candidates-by-type with counterparty-filtered ranked querytest/com/getorcha/workers/matching/candidates_test.clj — update testsThis is the core of Layer 1. The new find-candidates filters by normalized counterparty, then ranks using BM25 + reference overlap + amount proximity.
Step 1: Write the failing test
In test/com/getorcha/workers/matching/candidates_test.clj, add a test that requires counterparty filtering:
(deftest find-candidates-filters-by-counterparty-test
(testing "candidates are filtered to same normalized counterparty"
(let [le-id (helpers/create-legal-entity!)
;; Same supplier, different formatting
po-id (create-document-with-type!
le-id :purchase-order
{:po-number "PO-001"
:supplier {:name "ACME Corp GmbH"}
:total-value 10000})
;; Different supplier
_other-po (create-document-with-type!
le-id :purchase-order
{:po-number "PO-002"
:supplier {:name "Other Corp"}
:total-value 10000})]
;; Set normalized columns (normally done by match-document!)
(db.matching/set-normalized-fields!
fixtures/*db* po-id "acme" ["po001"])
(db.matching/set-normalized-fields!
fixtures/*db* _other-po "other" ["po002"])
(let [candidates (candidates/find-candidates
fixtures/*db* {}
{:id (random-uuid)
:type :invoice
:legal-entity-id le-id
:structured-data {:issuer {:name "Acme Corp"}}
:normalized-counterparty "acme"
:normalized-references ["inv001"]})]
;; Only the ACME PO should be returned, not Other Corp
(is (= 1 (count candidates)))
(is (= po-id (:document/id (first candidates))))))))
Step 2: Run test to verify it fails
Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.candidates-test]'
Expected: FAIL — current implementation doesn't filter by counterparty
Step 3: Rewrite find-candidates
Replace the content of src/com/getorcha/workers/matching/candidates.clj:
(ns com.getorcha.workers.matching.candidates
"Candidate retrieval for document matching.
Layer 1 of the matching pipeline. Filters by normalized counterparty name
to narrow the search space, then ranks using BM25 text overlap, reference
number overlap, and amount proximity. Returns up to 50 candidates."
(:require [com.getorcha.db.sql :as db.sql]
[com.getorcha.workers.matching.searchable-text :as searchable-text]))
(def ^:private matchable-pairs
"Valid document type pairs for matching."
#{#{:invoice :purchase-order}
#{:invoice :contract}
#{:purchase-order :contract}
#{:goods-received-note :purchase-order}})
(defn get-matchable-types
"Get document types that can match with the given type."
[doc-type]
(->> matchable-pairs
(filter #(contains? % doc-type))
(mapcat identity)
(remove #(= % doc-type))
set))
(defn find-candidates
"Find candidate documents for matching using counterparty-filtered ranked search.
The source document must include `:normalized-counterparty` and
`:normalized-references` keys (populated by `match-document!` before calling).
Ranking combines:
- BM25 text overlap on searchable_text (commodity, quantities, descriptions)
- Reference number overlap (any normalized reference in common)
- Amount proximity (within 50% of source total)
Returns up to 50 candidate document rows, ordered by rank score descending."
[db _search-config doc]
(let [matchable-types (get-matchable-types (:type doc))
counterparty (:normalized-counterparty doc)
source-refs (:normalized-references doc)
source-total (or (get-in doc [:structured-data :total])
(get-in doc [:structured-data :total-value])
0)
query-text (searchable-text/build-searchable-text doc)]
(when counterparty
(db.sql/execute! db
{:select [:*
[[:+
;; BM25 text overlap
[:* [:coalesce
[:ts_rank
[:to_tsvector [:inline "simple"] :searchable-text]
[:plainto_tsquery [:inline "simple"] query-text]]
0]
10]
;; Reference overlap bonus
[:case
[:and
[:is-not :normalized-references nil]
[:<> [:cast [:inline "{}"] :text/1] [:cast :normalized-references :text]]
[:raw (str "normalized_references && ARRAY["
(->> source-refs
(map #(str "'" (db.sql/escape-string %) "'"))
(clojure.string/join ","))
"]::text[]")]]
50
:else 0]
;; Amount proximity bonus
[:case
[:and
[:> source-total 0]
[:< [:/ [:abs [:- [:coalesce
[:cast [:-> :structured-data [:inline "total"]] :numeric]
[:cast [:-> :structured-data [:inline "total-value"]] :numeric]
0]
source-total]]
source-total]
0.5]]
10
:else 0]]
:rank-score]]
:from [:document]
:where [:and
[:= :legal-entity-id (:legal-entity-id doc)]
[:in :type (mapv #(db.sql/->cast % :document-type) matchable-types)]
[:= :normalized-counterparty counterparty]
[:is-not :structured-data nil]
[:<> :id (:id doc)]]
:order-by [[:rank-score :desc]]
:limit 50}))))
Important: The HoneySQL for the ranking expression is complex. The exact syntax may need adjustment depending on how db.sql handles raw SQL fragments. If HoneySQL can't express the full ranking query cleanly, use db.sql/execute! with a raw SQL string and parameters instead. The key behavior to preserve:
normalized_counterparty = :counterpartyStep 4: Run all matching tests
Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.candidates-test]'
Expected: PASS
Then run the full matching test suite to verify no regressions:
Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.integration-test]'
Expected: PASS — existing tests must still pass. They create documents with matching supplier names, so the counterparty filter should not break them. However, the integration tests will also need to populate the normalized columns on ALL test documents (not just the source). Update create-document-with-type! in the integration test to also set normalized columns, or update match-document! to normalize ALL candidates it encounters.
Design note: match-document! currently only normalizes the source document. Candidates need normalized columns too. The simplest approach: when match-document! finds candidates with NULL normalized_counterparty, backfill them inline. This handles the transition period and existing data. Add to match-document! after finding candidates:
;; Backfill normalized columns on candidates that don't have them yet
(doseq [row candidate-rows
:when (nil? (:document/normalized-counterparty row))]
(let [candidate-doc (candidate-row->doc row)]
(db.matching/set-normalized-fields!
db (:document/id row)
(normalize/extract-counterparty candidate-doc)
(normalize/extract-references candidate-doc))))
Step 5: Commit
git add src/com/getorcha/workers/matching/candidates.clj test/com/getorcha/workers/matching/candidates_test.clj
git commit -m "feat(matching): rewrite candidate retrieval with counterparty filter and ranked search"
Files:
src/com/getorcha/workers/matching/evidence.clj — add quantity, date, description, currency signalstest/com/getorcha/workers/matching/evidence_test.clj — add tests for each new signalsrc/com/getorcha/schema/matching.clj — add new signal enum valuesThis expands Layer 2 scoring with the signals needed for the no-reference matching case.
Step 1: Update the Signal schema
In src/com/getorcha/schema/matching.clj, expand the Signal enum:
(def Signal
"Evidence signal type."
[:enum
:po-number-exact
:contract-ref-exact
:po-ref-exact
:vat-id-match
:iban-match
:amount-within-2pct
:amount-within-5pct
:supplier-name-fuzzy
:date-in-validity
:date-within-period
:delivery-date-match
:quantity-exact
:description-overlap
:currency-mismatch
:vat-id-mismatch])
Step 2: Write failing tests for new signals
In test/com/getorcha/workers/matching/evidence_test.clj, add tests. Key tests:
(deftest quantity-exact-signal-test
(testing "fires when invoice line-item quantity matches contract deliverable"
(let [{:keys [evidence]} (evidence/compute-score
{:type :contract
:structured-data {:deliverables ["1.200.000 kWh (HS,N) Biomethan am Gastag 22.10.2025"]
:party-b {:name "ABO"}}}
{:type :invoice
:structured-data {:line-items [{:quantity 1200000.0 :unit "kWh Hs"}]
:issuer {:name "ABO"}}})]
(is (some #(= :quantity-exact (:signal %)) evidence)))))
(deftest currency-mismatch-signal-test
(testing "fires negative signal when currencies differ"
(let [{:keys [evidence]} (evidence/compute-score
{:type :invoice
:structured-data {:currency "EUR" :issuer {:name "X"}}}
{:type :purchase-order
:structured-data {:currency "USD" :supplier {:name "X"}}})]
(is (some #(= :currency-mismatch (:signal %)) evidence)))))
(deftest supplier-name-fuzzy-signal-test
(testing "fires when normalized supplier names have >0.8 Jaro-Winkler similarity"
(let [{:keys [evidence]} (evidence/compute-score
{:type :invoice
:structured-data {:issuer {:name "ABO Kraft & Wärme Ramstein GmbH & Co.KG"}}}
{:type :contract
:structured-data {:party-b {:name "ABO Kraft & Wärme Ramstein GmbH & Co KG"}}})]
(is (some #(= :supplier-name-fuzzy (:signal %)) evidence)))))
Step 3: Run tests to verify they fail
Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.evidence-test]'
Expected: FAIL — new signals not implemented
Step 4: Implement new signals in collect-signals
In src/com/getorcha/workers/matching/evidence.clj:
Add to requires:
[com.getorcha.util.text :as text]
Add import for Jaro-Winkler:
(:import (org.apache.commons.text.similarity JaroWinklerSimilarity))
Update evidence-signals map to match the design doc (add :quantity-exact 35, :date-within-period 20, :delivery-date-match 25, :description-overlap 10, :currency-mismatch -30; remove :medium from match-thresholds).
Add helper functions:
(def ^:private jaro-winkler (JaroWinklerSimilarity.))
(defn ^:private get-currency
[{:keys [structured-data]}]
(:currency structured-data))
(defn ^:private get-counterparty-name
"Extract raw counterparty name from document."
[{:keys [type structured-data]}]
(case type
:invoice (get-in structured-data [:issuer :name])
:purchase-order (get-in structured-data [:supplier :name])
:contract (get-in structured-data [:party-b :name])
:goods-received-note (get-in structured-data [:supplier :name])
nil))
(defn ^:private get-quantities
"Extract numeric quantities from document line items and deliverables."
[{:keys [type structured-data]}]
(case type
:invoice
(->> (:line-items structured-data)
(keep :quantity)
set)
:purchase-order
(->> (:line-items structured-data)
(keep :quantity)
set)
:contract
;; Parse quantities from deliverables strings (e.g., "1.200.000 kWh")
(->> (:deliverables structured-data)
(keep #(when-let [m (re-find #"[\d.,]+" %)]
(try
(-> m
(clojure.string/replace "." "")
(clojure.string/replace "," ".")
Double/parseDouble)
(catch Exception _ nil))))
set)
:goods-received-note
(->> (:line-items structured-data)
(keep :quantity-received)
set)
#{}))
Add new signal checks in collect-signals:
;; Supplier name fuzzy match (Jaro-Winkler on normalized names)
(let [name-a (some-> (get-counterparty-name doc-a) text/normalize-supplier-name)
name-b (some-> (get-counterparty-name doc-b) text/normalize-supplier-name)]
(when (and name-a name-b
(> (.apply jaro-winkler name-a name-b) 0.8))
(conj! signals {:signal :supplier-name-fuzzy
:value (str name-a " ~ " name-b)
:weight (:supplier-name-fuzzy evidence-signals)})))
;; Quantity exact match
(let [qtys-a (get-quantities doc-a)
qtys-b (get-quantities doc-b)
common (clojure.set/intersection qtys-a qtys-b)]
(when (seq common)
(conj! signals {:signal :quantity-exact
:value (str (first common))
:weight (:quantity-exact evidence-signals)})))
;; Currency mismatch
(let [curr-a (get-currency doc-a)
curr-b (get-currency doc-b)]
(when (and curr-a curr-b (not= curr-a curr-b))
(conj! signals {:signal :currency-mismatch
:value (str curr-a " vs " curr-b)
:weight (:currency-mismatch evidence-signals)})))
Note on date signals (:date-within-period, :delivery-date-match): These require parsing contract effective/expiration dates and invoice service periods. Implement the comparison but keep it straightforward — compare date strings directly since they're in ISO format ("2025-10-01" compares correctly as a string for basic range checks). The parsing of contract deliverable dates from free-text strings is more involved; use a simple regex for "Gastag DD.MM.YYYY" patterns.
Note on :description-overlap: Compare the document description/title fields. For invoices, use the first line-item description. For contracts, use the title. Check for shared significant words (excluding stop words). Fire the signal if >50% of significant words overlap.
Step 5: Run tests
Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.evidence-test]'
Expected: PASS
Step 6: Commit
git add src/com/getorcha/workers/matching/evidence.clj test/com/getorcha/workers/matching/evidence_test.clj src/com/getorcha/schema/matching.clj
git commit -m "feat(matching): add quantity, date, description, and currency evidence signals"
Files:
src/com/getorcha/workers/matching/core.clj — simplify decide-matchestest/com/getorcha/workers/matching/core_test.clj — update testsThe new rule: 0 candidates = no match; 1 candidate ≥ high = auto-match; anything else = top 3 to LLM.
Step 1: Write the failing test
(deftest decide-matches-sends-top-3-to-llm-test
(testing "sends only top 3 candidates to LLM when multiple exist"
(let [candidates (mapv #(hash-map :doc {:document/id (random-uuid)}
:score %
:evidence [])
[0.65 0.60 0.55 0.50 0.45])
llm-calls (atom [])
llm-config {:matching {:provider :test}}]
(with-redefs [llm-decision/llm-match-decision
(fn [_ _ cands]
(reset! llm-calls cands)
{:matches [{:candidate 1 :confidence "high" :reasoning "test"}]})]
(matching/decide-matches llm-config {} candidates)
;; LLM should receive only top 3, not all 5
(is (= 3 (count @llm-calls)))))))
Step 2: Run test to verify it fails
Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.core-test]'
Expected: FAIL — current implementation sends all candidates
Step 3: Update decide-matches
(defn decide-matches
"Decide which candidates to match based on scores.
Rules:
- 0 candidates → no match
- 1 candidate ≥ high threshold → auto-match (rule-based)
- Anything else → top 3 to LLM (if config provided)
Returns seq of `{:doc :score :evidence :match-method}`."
[llm-config source-doc candidates]
(when (seq candidates)
(let [high-candidates (filter #(>= (:score %) (:high evidence/match-thresholds)) candidates)]
(cond
;; Single clear winner above high threshold — no LLM needed
(= 1 (count high-candidates))
[(assoc (first high-candidates) :match-method "rule-based")]
;; Any other case — LLM decides on top 3
llm-config
(let [top-3 (take 3 candidates)]
(resolve-llm-matches
(llm-decision/llm-match-decision llm-config source-doc top-3)
top-3))
;; No LLM config fallback: pick single top if above high threshold
(= 1 (count high-candidates))
[(assoc (first high-candidates) :match-method "rule-based")]))))
Step 4: Run all core tests
Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.core-test]'
Expected: PASS
Step 5: Commit
git add src/com/getorcha/workers/matching/core.clj test/com/getorcha/workers/matching/core_test.clj
git commit -m "feat(matching): simplify LLM trigger to top-3 candidates rule"
match-document! to pass normalized data throughFiles:
src/com/getorcha/workers/matching/core.clj — pass normalized fields to find-candidatessrc/com/getorcha/workers/matching/worker.clj — include normalized fields in the doc map passed to match-document!Step 1: Update match-document! to attach normalized fields to the doc before finding candidates
In match-document!, after populating the normalized columns in the DB, also attach them to the doc map so find-candidates can use them:
(defn match-document!
[db search-config llm-config doc]
{:pre [(m/validate schema.matching/SourceDocument doc)]}
;; Populate normalized columns
(let [counterparty (normalize/extract-counterparty doc)
references (normalize/extract-references doc)
doc (assoc doc
:normalized-counterparty counterparty
:normalized-references references)]
(db.matching/set-normalized-fields! db (:id doc) counterparty references)
;; Clear previous state
(db.matching/delete-matches-for-document! db (:id doc))
(db.matching/set-cluster-id! db [(:id doc)] nil)
(if (nil? counterparty)
(log/warn "No counterparty extracted, skipping matching"
{:document-id (:id doc) :type (:type doc)})
(let [candidate-rows (candidates/find-candidates db search-config doc)]
;; ... rest of existing logic unchanged
))))
Step 2: Run full integration test suite
Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.integration-test]'
Expected: PASS — all existing tests must still pass
Step 3: Commit
git add src/com/getorcha/workers/matching/core.clj
git commit -m "feat(matching): wire normalized fields through match-document! pipeline"
Files:
resources/migrations/YYYYMMDDHHMMSS-backfill-normalized-matching-columns.up.sqlresources/migrations/YYYYMMDDHHMMSS-backfill-normalized-matching-columns.down.sqlFor existing documents that already have structured data, backfill the normalized columns. This is a SQL-only migration that handles the common cases. Edge cases (complex company suffixes) will be corrected when documents are re-matched.
Step 1: Create the migration
Run: bb migrate create "backfill-normalized-matching-columns"
Step 2: Write the up migration
The backfill needs to extract counterparty names from JSONB and normalize them. SQL can handle the basic normalization (lowercase, trim) but not the full normalize-supplier-name logic (umlaut transliteration, suffix stripping). Two approaches:
Option A (recommended): Write a Clojure backfill script that runs after migration, using the actual normalization functions. Create scripts/backfill_normalized_columns.clj as a Babashka-compatible script or a REPL-evaluated snippet.
Option B: SQL-only approximation — lowercase + trim the counterparty name. This gets most matches working but won't handle umlauts or company suffixes.
Go with Option A. The migration itself just ensures the columns exist (already done in Task 1). Add a REPL-executable backfill function:
In src/com/getorcha/workers/matching/normalize.clj, add:
(defn backfill-document!
"Backfill normalized columns for a single document row.
Used for migrating existing data."
[db {:keys [document/id document/type document/structured-data] :as _row}]
(let [doc {:type (keyword type)
:structured-data structured-data}]
(db.matching/set-normalized-fields!
db id
(extract-counterparty doc)
(extract-references doc))))
Document the backfill process in a comment block:
(comment
;; Backfill all documents with structured data:
;; (require '[com.getorcha.db.sql :as db.sql])
;; (let [db (:com.getorcha.db/pool integrant.repl.state/system)
;; docs (db.sql/execute! db {:select [:id :type :structured-data]
;; :from [:document]
;; :where [:and
;; [:is-not :structured-data nil]
;; [:is :normalized-counterparty nil]]})]
;; (doseq [doc docs]
;; (backfill-document! db doc))
;; (count docs))
)
Step 3: The up migration is a no-op (columns already added in Task 1). Create empty file or add a comment:
-- Backfill is done via REPL using com.getorcha.workers.matching.normalize/backfill-document!
-- See the (comment ...) block in that namespace for instructions.
SELECT 1;
Step 4: Commit
git add src/com/getorcha/workers/matching/normalize.clj resources/migrations/*backfill*
git commit -m "feat(matching): add backfill function for normalized matching columns"
Files:
test/com/getorcha/workers/matching/integration_test.clj — add test based on the ABO Kraft/Carbon Farming test caseThis validates the entire three-layer pipeline works for the no-reference case.
Step 1: Write the integration test
(deftest contract-invoice-matching-without-references-test
(testing "invoice matches contract via supplier identity, quantity, and date alignment"
(let [le-id (helpers/create-legal-entity!)
contract-id (create-document-with-type!
le-id :contract
{:contract-number "CFG-ABO-001"
:party-b {:name "ABO Kraft & Wärme Ramstein GmbH & Co KG"}
:total-value 250000
:currency "EUR"
:effective-date "2025-10-20"
:deliverables ["1.200.000 kWh (HS,N) Biomethan am Gastag 22.10.2025"
"800.000 kWh (HS,N) Biomethan am Gastag 08.01.2026"]})
inv-id (create-document-with-type!
le-id :invoice
{:invoice-number "2025-029RAM"
:issuer {:name "ABO Kraft & Wärme Ramstein GmbH & Co.KG"
:vat-id "DE302232673"}
:total 157277.93
:currency "EUR"
:service-period {:start "2025-10-01" :end "2025-10-31"}
:line-items [{:description "Lieferung von Biomethan"
:quantity 1200000.0
:unit "kWh Hs"
:amount 157277.93}]})
;; First match the contract so its normalized columns get populated
contract-source {:id contract-id
:type :contract
:legal-entity-id le-id
:structured-data {:contract-number "CFG-ABO-001"
:party-b {:name "ABO Kraft & Wärme Ramstein GmbH & Co KG"}
:total-value 250000
:currency "EUR"
:effective-date "2025-10-20"
:deliverables ["1.200.000 kWh (HS,N) Biomethan am Gastag 22.10.2025"
"800.000 kWh (HS,N) Biomethan am Gastag 08.01.2026"]}}
;; Then match the invoice
invoice-source {:id inv-id
:type :invoice
:legal-entity-id le-id
:structured-data {:invoice-number "2025-029RAM"
:issuer {:name "ABO Kraft & Wärme Ramstein GmbH & Co.KG"
:vat-id "DE302232673"}
:total 157277.93
:currency "EUR"
:service-period {:start "2025-10-01" :end "2025-10-31"}
:line-items [{:description "Lieferung von Biomethan"
:quantity 1200000.0
:unit "kWh Hs"
:amount 157277.93}]}}]
;; Run matching on contract first (populates its normalized columns)
(matching/match-document! fixtures/*db* {} nil contract-source)
;; Run matching on invoice
(matching/match-document! fixtures/*db* {} nil invoice-source)
;; Verify match was created
(let [matches (db.matching/get-matches-for-document fixtures/*db* inv-id)]
(is (= 1 (count matches)) "Invoice should match exactly one contract")
(let [match (first matches)]
(is (>= (:document-match/confidence match) 0.7M)
"Score should be above high threshold (quantity + date + name + description)")
(is (= "rule-based" (:document-match/match-method match))
"Should auto-match without LLM"))))))
Step 2: Run test
Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.integration-test]'
Expected: PASS
Step 3: Commit
git add test/com/getorcha/workers/matching/integration_test.clj
git commit -m "test(matching): add integration test for contract-invoice matching without references"
Step 1: Run all matching tests
Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.normalize-test com.getorcha.workers.matching.evidence-test com.getorcha.workers.matching.candidates-test com.getorcha.workers.matching.core-test com.getorcha.workers.matching.integration-test com.getorcha.db.document-matching-test]'
Expected: All PASS
Step 2: Run the full project test suite
Run: clj -X:test:silent 2>&1 | grep -A 5 -E "(FAIL in|ERROR in|Execution error|failed because|Ran .* tests)"
Expected: No failures related to matching changes. Pre-existing failures unrelated to this work are acceptable.
Step 3: Final commit if any cleanup needed
Task 1 (migration) ──────────────────────────────────┐
│
Task 2 (normalize functions) ─────────────┐ │
│ │
Task 3 (populate at ingestion) ───────────┤ │
│ │
Task 4 (candidate retrieval rewrite) ─────┤ │
├── Task 9 (integration test)
Task 5 (new evidence signals) ────────────┤ │
│ │
Task 6 (simplify LLM trigger) ───────────┤ │
│ │
Task 7 (wire through match-document!) ────┤ │
│ │
Task 8 (backfill) ────────────────────────┘ │
│
Task 10 (full test suite) ──┘
Tasks 1 and 2 can run in parallel. Tasks 3-8 depend on both. Task 9 depends on all of 3-8. Task 10 is final verification.