Note (2026-04-24): After this document was written,
legal_entitywas renamed totenantand the oldtenantwas renamed toorganization. Read references to these terms with the pre-rename meaning.
For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
Goal: Automatically match related financial documents (Invoice ↔ PO ↔ Contract ↔ GRN) using hybrid search and LLM-assisted decision making.
Architecture: Async SQS worker consumes document-ready events, uses hybrid search (BM25 + semantic) to find candidates, scores with evidence signals, and uses LLM for ambiguous cases. Pairwise match edges stored in document_match table with denormalized cluster_id on documents.
Tech Stack: Clojure, PostgreSQL (pgvector, pg_trgm), HoneySQL, Integrant, Google Vertex AI embeddings, LLM via workers/llm.clj
Design Doc: docs/plans/2026-02-24-document-matching-design.md
Files:
resources/migrations/20260224100000-add-document-matching.up.sqlresources/migrations/20260224100000-add-document-matching.down.sqlStep 1: Create up migration
-- resources/migrations/20260224100000-add-document-matching.up.sql
-- Add cluster_id and search columns to document table
ALTER TABLE document ADD COLUMN cluster_id uuid;
ALTER TABLE document ADD COLUMN searchable_text text;
ALTER TABLE document ADD COLUMN embedding vector(768);
CREATE INDEX idx_document_cluster_id ON document(cluster_id);
CREATE INDEX idx_document_embedding ON document USING hnsw (embedding vector_cosine_ops);
CREATE INDEX idx_document_searchable_text ON document USING gin (to_tsvector('simple', searchable_text));
-- Pairwise match edges (source of truth for document relationships)
CREATE TABLE document_match (
document_a_id uuid REFERENCES document(id) ON DELETE CASCADE,
document_b_id uuid REFERENCES document(id) ON DELETE CASCADE,
confidence decimal(5,4),
match_method text NOT NULL,
evidence jsonb,
created_at timestamptz NOT NULL DEFAULT now(),
updated_at timestamptz NOT NULL DEFAULT now(),
PRIMARY KEY (document_a_id, document_b_id),
CHECK (document_a_id < document_b_id)
);
CREATE INDEX idx_document_match_b ON document_match(document_b_id);
CREATE INDEX idx_document_match_created ON document_match(created_at);
-- Trigger for updated_at
CREATE TRIGGER trigger_document_match_updated_at
BEFORE UPDATE ON document_match
FOR EACH ROW
EXECUTE FUNCTION update_updated_at_column();
Step 2: Create down migration
-- resources/migrations/20260224100000-add-document-matching.down.sql
DROP TRIGGER IF EXISTS trigger_document_match_updated_at ON document_match;
DROP TABLE IF EXISTS document_match;
DROP INDEX IF EXISTS idx_document_searchable_text;
DROP INDEX IF EXISTS idx_document_embedding;
DROP INDEX IF EXISTS idx_document_cluster_id;
ALTER TABLE document DROP COLUMN IF EXISTS embedding;
ALTER TABLE document DROP COLUMN IF EXISTS searchable_text;
ALTER TABLE document DROP COLUMN IF EXISTS cluster_id;
Step 3: Run migration
Run: bb migrate
Expected: Migration applies successfully
Step 4: Verify schema
Run: psql -h localhost -U postgres -d orcha -c "\d document_match"
Expected: Table with columns document_a_id, document_b_id, confidence, match_method, evidence, created_at, updated_at
Step 5: Commit
git add resources/migrations/20260224100000-add-document-matching.up.sql resources/migrations/20260224100000-add-document-matching.down.sql
git commit -m "feat(matching): add document_match table and search columns"
Files:
resources/com/getorcha/config.ednscripts/init_aws.cljtest/com/getorcha/test/fixtures.cljStep 1: Add queue to config.edn
Find the :queues map in :com.getorcha/aws and add the matching queue:
:queues {:ingestion "v1-orcha-global-ingest"
:acquisition "v1-orcha-global-email-acquire"
:matching "v1-orcha-global-doc-matching"}
Step 2: Add queue extraction in init_aws.clj
After the existing queue extractions (~line 96), add:
(def sqs-matching-queue (get-in config [:com.getorcha/aws :queues :matching]))
Step 3: Create queue in init_aws.clj
In create-sqs-queues function, add:
(create-queue-with-dlq! sqs-matching-queue)
Step 4: Update print statement in init_aws.clj
Update the queue print line to include matching queue:
(println " SQS Queues:" sqs-ingestion-queue "," sqs-acquisition-queue "," sqs-matching-queue)
Step 5: Add queue creation to test fixtures.clj
In with-running-system, find where queues are created and add the matching queue:
(doseq [queue-name [(get-in config [::aws/state :queues :ingestion])
(get-in config [::aws/state :queues :acquisition])
(get-in config [::aws/state :queues :matching])]]
(.createQueue sqs-client
^CreateQueueRequest
(.build (doto (CreateQueueRequest/builder)
(.queueName queue-name)))))
Step 6: Verify LocalStack setup
Run: bb dev:init-aws
Expected: Output shows matching queue created
Step 7: Commit
git add resources/com/getorcha/config.edn scripts/init_aws.clj test/com/getorcha/test/fixtures.clj
git commit -m "feat(matching): add SQS queue for document matching"
Files:
src/com/getorcha/db/document_matching.cljtest/com/getorcha/db/document_matching_test.cljStep 1: Write test for get-matches-for-document
;; test/com/getorcha/db/document_matching_test.clj
(ns com.getorcha.db.document-matching-test
(:require [clojure.test :refer [deftest is testing use-fixtures]]
[com.getorcha.db.document-matching :as matching]
[com.getorcha.test.fixtures :as fixtures]))
(use-fixtures :once fixtures/with-running-system)
(use-fixtures :each fixtures/with-db-rollback)
(deftest get-matches-for-document-test
(testing "returns empty when no matches exist"
(let [doc-id (random-uuid)]
(is (empty? (matching/get-matches-for-document fixtures/*db* doc-id)))))
(testing "returns matches for document as either a or b"
(let [le-id (:legal-entity/id (fixtures/create-legal-entity!))
doc-a (fixtures/create-document! {:legal-entity-id le-id :type "invoice"})
doc-b (fixtures/create-document! {:legal-entity-id le-id :type "purchase-order"})
doc-c (fixtures/create-document! {:legal-entity-id le-id :type "contract"})
_ (matching/create-match! fixtures/*db*
{:document-a-id (:document/id doc-a)
:document-b-id (:document/id doc-b)
:confidence 0.85M
:match-method "rule-based"
:evidence []})
_ (matching/create-match! fixtures/*db*
{:document-a-id (:document/id doc-b)
:document-b-id (:document/id doc-c)
:confidence 0.72M
:match-method "llm"
:evidence []})]
(is (= 1 (count (matching/get-matches-for-document fixtures/*db* (:document/id doc-a)))))
(is (= 2 (count (matching/get-matches-for-document fixtures/*db* (:document/id doc-b))))))))
Step 2: Run test to verify it fails
Run: clj -X:test:silent :nses '[com.getorcha.db.document-matching-test]'
Expected: FAIL - namespace not found
Step 3: Write minimal implementation
;; src/com/getorcha/db/document_matching.clj
(ns com.getorcha.db.document-matching
"Database operations for document matching."
(:require [com.getorcha.db.sql :as db.sql]
[honey.sql.helpers :as h]))
(defn get-matches-for-document
"Get all matches for a document (as either document_a or document_b)."
[db document-id]
(db.sql/execute! db
(-> (h/select :*)
(h/from :document-match)
(h/where [:or
[:= :document-a-id document-id]
[:= :document-b-id document-id]])
(h/order-by [:confidence :desc]))))
(defn create-match!
"Create a match between two documents. Ensures canonical ordering (a < b)."
[db {:keys [document-a-id document-b-id confidence match-method evidence]}]
(let [[id-a id-b] (sort [document-a-id document-b-id])]
(db.sql/execute-one! db
(-> (h/insert-into :document-match)
(h/values [{:document-a-id id-a
:document-b-id id-b
:confidence confidence
:match-method match-method
:evidence [:lift evidence]}])
(h/on-conflict :document-a-id :document-b-id)
(h/do-update-set :confidence :match-method :evidence :updated-at)))))
Step 4: Run test to verify it passes
Run: clj -X:test:silent :nses '[com.getorcha.db.document-matching-test]'
Expected: PASS
Step 5: Commit
git add src/com/getorcha/db/document_matching.clj test/com/getorcha/db/document_matching_test.clj
git commit -m "feat(matching): add get-matches-for-document and create-match! db operations"
Files:
src/com/getorcha/db/document_matching.cljtest/com/getorcha/db/document_matching_test.cljStep 1: Write test for cluster operations
;; Add to test/com/getorcha/db/document_matching_test.clj
(deftest cluster-operations-test
(testing "set-cluster-id! updates documents"
(let [le-id (:legal-entity/id (fixtures/create-legal-entity!))
doc-a (fixtures/create-document! {:legal-entity-id le-id :type "invoice"})
doc-b (fixtures/create-document! {:legal-entity-id le-id :type "purchase-order"})
cluster-id (random-uuid)]
(matching/set-cluster-id! fixtures/*db* [(:document/id doc-a) (:document/id doc-b)] cluster-id)
(let [docs (matching/get-documents-by-ids fixtures/*db* [(:document/id doc-a) (:document/id doc-b)])]
(is (every? #(= cluster-id (:document/cluster-id %)) docs)))))
(testing "get-cluster-documents returns all documents in cluster"
(let [le-id (:legal-entity/id (fixtures/create-legal-entity!))
doc-a (fixtures/create-document! {:legal-entity-id le-id :type "invoice"})
doc-b (fixtures/create-document! {:legal-entity-id le-id :type "purchase-order"})
doc-c (fixtures/create-document! {:legal-entity-id le-id :type "contract"})
cluster-id (random-uuid)]
(matching/set-cluster-id! fixtures/*db* [(:document/id doc-a) (:document/id doc-b)] cluster-id)
(let [cluster-docs (matching/get-cluster-documents fixtures/*db* cluster-id)]
(is (= 2 (count cluster-docs)))
(is (not (some #(= (:document/id doc-c) (:document/id %)) cluster-docs)))))))
Step 2: Run test to verify it fails
Run: clj -X:test:silent :nses '[com.getorcha.db.document-matching-test]'
Expected: FAIL - set-cluster-id! not found
Step 3: Add cluster operations to implementation
;; Add to src/com/getorcha/db/document_matching.clj
(defn set-cluster-id!
"Set cluster_id for multiple documents."
[db document-ids cluster-id]
(db.sql/execute! db
(-> (h/update :document)
(h/set {:cluster-id cluster-id})
(h/where [:in :id document-ids]))))
(defn get-cluster-documents
"Get all documents in a cluster."
[db cluster-id]
(db.sql/execute! db
(-> (h/select :*)
(h/from :document)
(h/where [:= :cluster-id cluster-id]))))
(defn get-documents-by-ids
"Get documents by their IDs."
[db document-ids]
(db.sql/execute! db
(-> (h/select :*)
(h/from :document)
(h/where [:in :id document-ids]))))
(defn delete-matches-for-document!
"Delete all matches involving a document."
[db document-id]
(db.sql/execute! db
(-> (h/delete-from :document-match)
(h/where [:or
[:= :document-a-id document-id]
[:= :document-b-id document-id]]))))
(defn get-cluster-edges
"Get all match edges within a cluster."
[db cluster-id]
(db.sql/execute! db
{:select [:dm.*]
:from [[:document-match :dm]]
:join [[:document :da] [:= :dm.document-a-id :da.id]]
:where [:= :da.cluster-id cluster-id]}))
Step 4: Run test to verify it passes
Run: clj -X:test:silent :nses '[com.getorcha.db.document-matching-test]'
Expected: PASS
Step 5: Commit
git add src/com/getorcha/db/document_matching.clj test/com/getorcha/db/document_matching_test.clj
git commit -m "feat(matching): add cluster management db operations"
Files:
src/com/getorcha/workers/matching/searchable_text.cljtest/com/getorcha/workers/matching/searchable_text_test.cljStep 1: Write test for build-searchable-text
;; test/com/getorcha/workers/matching/searchable_text_test.clj
(ns com.getorcha.workers.matching.searchable-text-test
(:require [clojure.test :refer [deftest is testing]]
[com.getorcha.workers.matching.searchable-text :as searchable]))
(deftest build-searchable-text-test
(testing "invoice extracts supplier, VAT, number, total, currency, PO ref"
(let [doc {:type :invoice
:structured-data {:issuer {:name "ACME Corp"
:vat-id "DE123456789"}
:invoice-number "INV-2024-001"
:total 15000.00
:currency "EUR"
:po-reference "PO-2024-050"}}
result (searchable/build-searchable-text doc)]
(is (clojure.string/includes? result "ACME Corp"))
(is (clojure.string/includes? result "DE123456789"))
(is (clojure.string/includes? result "INV-2024-001"))
(is (clojure.string/includes? result "15000"))
(is (clojure.string/includes? result "EUR"))
(is (clojure.string/includes? result "PO-2024-050"))))
(testing "purchase-order extracts supplier, VAT, PO number, total, contract ref"
(let [doc {:type :purchase-order
:structured-data {:supplier {:name "Widgets Inc"
:vat-id "DE987654321"}
:po-number "PO-2024-050"
:total 20000.00
:currency "EUR"
:contract-reference "C-2024-010"}}
result (searchable/build-searchable-text doc)]
(is (clojure.string/includes? result "Widgets Inc"))
(is (clojure.string/includes? result "PO-2024-050"))
(is (clojure.string/includes? result "C-2024-010"))))
(testing "handles missing fields gracefully"
(let [doc {:type :invoice
:structured-data {:issuer {:name "Some Corp"}}}
result (searchable/build-searchable-text doc)]
(is (clojure.string/includes? result "Some Corp"))
(is (string? result)))))
Step 2: Run test to verify it fails
Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.searchable-text-test]'
Expected: FAIL - namespace not found
Step 3: Write implementation
;; src/com/getorcha/workers/matching/searchable_text.clj
(ns com.getorcha.workers.matching.searchable-text
"Build searchable text from document structured data for hybrid search."
(:require [clojure.string :as str]))
(defn- join-non-nil
"Join non-nil values with separator."
[separator & values]
(->> values
(remove nil?)
(map str)
(str/join separator)))
(defn build-searchable-text
"Build searchable text string from document for hybrid search indexing and querying."
[{:keys [type structured-data]}]
(case type
:invoice
(join-non-nil " | "
(get-in structured-data [:issuer :name])
(get-in structured-data [:issuer :vat-id])
(:invoice-number structured-data)
(:total structured-data)
(:currency structured-data)
(:po-reference structured-data))
:purchase-order
(join-non-nil " | "
(get-in structured-data [:supplier :name])
(get-in structured-data [:supplier :vat-id])
(:po-number structured-data)
(:total structured-data)
(:currency structured-data)
(:contract-reference structured-data))
:contract
(join-non-nil " | "
(get-in structured-data [:party-b :name])
(get-in structured-data [:party-b :vat-id])
(:contract-number structured-data))
:goods-received-note
(join-non-nil " | "
(get-in structured-data [:supplier :name])
(get-in structured-data [:supplier :vat-id])
(:grn-number structured-data)
(:po-reference structured-data))
""))
Step 4: Run test to verify it passes
Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.searchable-text-test]'
Expected: PASS
Step 5: Commit
git add src/com/getorcha/workers/matching/searchable_text.clj test/com/getorcha/workers/matching/searchable_text_test.clj
git commit -m "feat(matching): add searchable text builder for documents"
Files:
src/com/getorcha/workers/matching/evidence.cljtest/com/getorcha/workers/matching/evidence_test.cljStep 1: Write test for evidence signal collection
;; test/com/getorcha/workers/matching/evidence_test.clj
(ns com.getorcha.workers.matching.evidence-test
(:require [clojure.test :refer [deftest is testing]]
[com.getorcha.workers.matching.evidence :as evidence]))
(deftest collect-signals-test
(testing "exact PO number match"
(let [invoice {:type :invoice
:structured-data {:po-reference "PO-2024-001"
:issuer {:vat-id "DE123"}}}
po {:type :purchase-order
:structured-data {:po-number "PO-2024-001"
:supplier {:vat-id "DE123"}}}
signals (evidence/collect-signals invoice po)]
(is (some #(= :po-number-exact (:signal %)) signals))
(is (some #(= :vat-id-match (:signal %)) signals))))
(testing "VAT ID mismatch is negative signal"
(let [invoice {:type :invoice
:structured-data {:issuer {:vat-id "DE123"}}}
po {:type :purchase-order
:structured-data {:supplier {:vat-id "DE999"}}}
signals (evidence/collect-signals invoice po)]
(is (some #(= :vat-id-mismatch (:signal %)) signals))))
(testing "amount within tolerance"
(let [invoice {:type :invoice
:structured-data {:total 9900}}
po {:type :purchase-order
:structured-data {:total 10000}}
signals (evidence/collect-signals invoice po)]
(is (some #(= :amount-within-2pct (:signal %)) signals)))))
(deftest compute-score-test
(testing "score normalizes to 0-1 range"
(let [invoice {:type :invoice
:structured-data {:po-reference "PO-001"
:issuer {:vat-id "DE123"}
:total 10000}}
po {:type :purchase-order
:structured-data {:po-number "PO-001"
:supplier {:vat-id "DE123"}
:total 10000}}
{:keys [score]} (evidence/compute-score invoice po)]
(is (>= score 0.0))
(is (<= score 1.0))
(is (> score 0.7)))))
Step 2: Run test to verify it fails
Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.evidence-test]'
Expected: FAIL - namespace not found
Step 3: Write implementation
;; src/com/getorcha/workers/matching/evidence.clj
(ns com.getorcha.workers.matching.evidence
"Evidence signal collection and scoring for document matching.")
;; Signal weights: positive signals build confidence, negative reduce it
(def evidence-signals
{:po-number-exact 60 ; PO number on invoice matches PO
:contract-ref-exact 55 ; Contract reference matches
:po-ref-exact 55 ; PO reference on GRN matches
:vat-id-match 30 ; Supplier VAT IDs match
:iban-match 25 ; Supplier bank accounts match
:amount-within-2pct 20 ; Amounts within 2% tolerance
:amount-within-5pct 10 ; Amounts within 5% tolerance
:supplier-name-fuzzy 15 ; Supplier names >0.8 similarity
:date-in-validity 10 ; Document date within validity period
:vat-id-mismatch -40}) ; VAT IDs present but don't match
(def match-thresholds
{:high 0.70
:medium 0.50
:low 0.30})
(defn- get-vat-id
"Extract VAT ID from document based on type."
[{:keys [type structured-data]}]
(case type
:invoice (get-in structured-data [:issuer :vat-id])
:purchase-order (get-in structured-data [:supplier :vat-id])
:contract (get-in structured-data [:party-b :vat-id])
:goods-received-note (get-in structured-data [:supplier :vat-id])
nil))
(defn- get-total
"Extract total amount from document."
[{:keys [structured-data]}]
(:total structured-data))
(defn- get-po-number
"Extract PO number/reference from document."
[{:keys [type structured-data]}]
(case type
:invoice (:po-reference structured-data)
:purchase-order (:po-number structured-data)
:goods-received-note (:po-reference structured-data)
nil))
(defn- within-tolerance?
"Check if two numbers are within percentage tolerance."
[a b tolerance-pct]
(when (and a b (pos? b))
(<= (abs (- 1 (/ a b))) (/ tolerance-pct 100.0))))
(defn collect-signals
"Collect all evidence signals between two documents."
[doc-a doc-b]
(let [vat-a (get-vat-id doc-a)
vat-b (get-vat-id doc-b)
po-num-a (get-po-number doc-a)
po-num-b (get-po-number doc-b)
total-a (get-total doc-a)
total-b (get-total doc-b)
signals (transient [])]
(when (and po-num-a po-num-b (= po-num-a po-num-b))
(conj! signals {:signal :po-number-exact
:value po-num-a
:weight (:po-number-exact evidence-signals)}))
(when (and vat-a vat-b)
(if (= vat-a vat-b)
(conj! signals {:signal :vat-id-match
:value vat-a
:weight (:vat-id-match evidence-signals)})
(conj! signals {:signal :vat-id-mismatch
:value (str vat-a " vs " vat-b)
:weight (:vat-id-mismatch evidence-signals)})))
(when (and total-a total-b)
(cond
(within-tolerance? total-a total-b 2)
(conj! signals {:signal :amount-within-2pct
:value (str total-a " ~ " total-b)
:weight (:amount-within-2pct evidence-signals)})
(within-tolerance? total-a total-b 5)
(conj! signals {:signal :amount-within-5pct
:value (str total-a " ~ " total-b)
:weight (:amount-within-5pct evidence-signals)})))
(persistent! signals)))
(defn compute-score
"Compute match score from evidence signals. Returns {:score :evidence}."
[doc-a doc-b]
(let [signals (collect-signals doc-a doc-b)
raw-score (reduce + 0 (map :weight signals))
normalized (-> raw-score (/ 100.0) (max 0.0) (min 1.0))]
{:score normalized
:evidence signals}))
Step 4: Run test to verify it passes
Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.evidence-test]'
Expected: PASS
Step 5: Commit
git add src/com/getorcha/workers/matching/evidence.clj test/com/getorcha/workers/matching/evidence_test.clj
git commit -m "feat(matching): add evidence signal collection and scoring"
Files:
src/com/getorcha/workers/matching/candidates.cljtest/com/getorcha/workers/matching/candidates_test.cljStep 1: Write test for matchable types
;; test/com/getorcha/workers/matching/candidates_test.clj
(ns com.getorcha.workers.matching.candidates-test
(:require [clojure.test :refer [deftest is testing use-fixtures]]
[com.getorcha.workers.matching.candidates :as candidates]
[com.getorcha.test.fixtures :as fixtures]))
(use-fixtures :once fixtures/with-running-system)
(use-fixtures :each fixtures/with-db-rollback)
(deftest get-matchable-types-test
(testing "invoice can match PO and contract"
(is (= #{:purchase-order :contract}
(candidates/get-matchable-types :invoice))))
(testing "purchase-order can match invoice and contract"
(is (= #{:invoice :contract}
(candidates/get-matchable-types :purchase-order))))
(testing "goods-received-note can only match purchase-order"
(is (= #{:purchase-order}
(candidates/get-matchable-types :goods-received-note)))))
(deftest find-candidates-by-type-test
(testing "finds candidates of correct type within same legal entity"
(let [le-id (:legal-entity/id (fixtures/create-legal-entity!))
_invoice (fixtures/create-document! {:legal-entity-id le-id
:type "invoice"
:structured-data {:issuer {:name "ACME"}}})
po (fixtures/create-document! {:legal-entity-id le-id
:type "purchase-order"
:structured-data {:supplier {:name "ACME"}}})
candidates (candidates/find-candidates-by-type
fixtures/*db*
{:legal-entity-id le-id
:types [:purchase-order]
:exclude-id nil})]
(is (= 1 (count candidates)))
(is (= (:document/id po) (:document/id (first candidates)))))))
Step 2: Run test to verify it fails
Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.candidates-test]'
Expected: FAIL - namespace not found
Step 3: Write implementation
;; src/com/getorcha/workers/matching/candidates.clj
(ns com.getorcha.workers.matching.candidates
"Candidate retrieval for document matching."
(:require [com.getorcha.db.sql :as db.sql]
[com.getorcha.workers.matching.searchable-text :as searchable]
[honey.sql.helpers :as h]))
(def ^:private matchable-pairs
"Valid document type pairs for matching."
#{#{:invoice :purchase-order}
#{:invoice :contract}
#{:purchase-order :contract}
#{:goods-received-note :purchase-order}})
(defn get-matchable-types
"Get document types that can match with the given type."
[doc-type]
(->> matchable-pairs
(filter #(contains? % doc-type))
(mapcat identity)
(remove #(= % doc-type))
set))
(defn find-candidates-by-type
"Find candidate documents by type within same legal entity."
[db {:keys [legal-entity-id types exclude-id]}]
(db.sql/execute! db
(-> (h/select :*)
(h/from :document)
(h/where [:and
[:= :legal-entity-id legal-entity-id]
[:in :type (map name types)]
[:is-not :structured-data nil]
(when exclude-id
[:<> :id exclude-id])])
(h/limit 50))))
(defn find-candidates
"Find candidate documents for matching.
Uses simple type-based query; hybrid search integration is future work."
[db _search-config doc]
(let [matchable-types (get-matchable-types (:type doc))]
(find-candidates-by-type db
{:legal-entity-id (:legal-entity-id doc)
:types matchable-types
:exclude-id (:id doc)})))
Step 4: Run test to verify it passes
Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.candidates-test]'
Expected: PASS
Step 5: Commit
git add src/com/getorcha/workers/matching/candidates.clj test/com/getorcha/workers/matching/candidates_test.clj
git commit -m "feat(matching): add candidate retrieval with matchable type filtering"
Files:
src/com/getorcha/workers/matching/llm_decision.cljtest/com/getorcha/workers/matching/llm_decision_test.cljStep 1: Write test for prompt building and response parsing
;; test/com/getorcha/workers/matching/llm_decision_test.clj
(ns com.getorcha.workers.matching.llm-decision-test
(:require [clojure.test :refer [deftest is testing]]
[com.getorcha.workers.matching.llm-decision :as llm-decision]))
(deftest format-document-summary-test
(testing "formats invoice summary"
(let [doc {:type :invoice
:structured-data {:invoice-number "INV-001"
:invoice-date "2024-02-20"
:issuer {:name "ACME Corp"
:vat-id "DE123"}
:total 15000
:currency "EUR"}}
summary (llm-decision/format-document-summary doc)]
(is (clojure.string/includes? summary "Invoice"))
(is (clojure.string/includes? summary "INV-001"))
(is (clojure.string/includes? summary "ACME Corp"))
(is (clojure.string/includes? summary "15000"))))
(testing "formats purchase-order summary"
(let [doc {:type :purchase-order
:structured-data {:po-number "PO-001"
:supplier {:name "Widgets Inc"}
:total 20000}}
summary (llm-decision/format-document-summary doc)]
(is (clojure.string/includes? summary "Purchase Order"))
(is (clojure.string/includes? summary "PO-001")))))
(deftest build-match-prompt-test
(testing "builds prompt with source and candidates"
(let [source {:type :invoice :structured-data {:invoice-number "INV-001"}}
candidates [{:doc {:type :purchase-order :structured-data {:po-number "PO-001"}}
:score 0.65
:evidence [{:signal :vat-id-match}]}]
{:keys [system user]} (llm-decision/build-match-prompt source candidates)]
(is (string? system))
(is (clojure.string/includes? user "Source Document"))
(is (clojure.string/includes? user "Candidate")))))
(deftest parse-llm-response-test
(testing "parses valid JSON response"
(let [response "{\"matches\": [{\"candidate\": 1, \"confidence\": \"high\", \"reasoning\": \"PO number matches\"}]}"
parsed (llm-decision/parse-llm-response response)]
(is (= 1 (count (:matches parsed))))
(is (= "high" (-> parsed :matches first :confidence)))))
(testing "handles no matches"
(let [response "{\"matches\": []}"
parsed (llm-decision/parse-llm-response response)]
(is (empty? (:matches parsed)))))
(testing "handles malformed JSON gracefully"
(let [response "not json at all"
parsed (llm-decision/parse-llm-response response)]
(is (empty? (:matches parsed))))))
Step 2: Run test to verify it fails
Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.llm-decision-test]'
Expected: FAIL - namespace not found
Step 3: Write implementation
;; src/com/getorcha/workers/matching/llm_decision.clj
(ns com.getorcha.workers.matching.llm-decision
"LLM-based document matching decision making."
(:require [cheshire.core :as json]
[clojure.string :as str]
[com.getorcha.workers.llm :as llm]))
(defn format-document-summary
"Format a document for LLM prompt."
[{:keys [type structured-data]}]
(case type
:invoice
(format "Type: Invoice\nNumber: %s\nDate: %s\nSupplier: %s (VAT: %s)\nTotal: %s %s"
(:invoice-number structured-data "N/A")
(:invoice-date structured-data "N/A")
(get-in structured-data [:issuer :name] "N/A")
(get-in structured-data [:issuer :vat-id] "N/A")
(:total structured-data "N/A")
(:currency structured-data ""))
:purchase-order
(format "Type: Purchase Order\nNumber: %s\nDate: %s\nSupplier: %s (VAT: %s)\nTotal: %s %s\nContract Ref: %s"
(:po-number structured-data "N/A")
(:po-date structured-data "N/A")
(get-in structured-data [:supplier :name] "N/A")
(get-in structured-data [:supplier :vat-id] "N/A")
(:total structured-data "N/A")
(:currency structured-data "")
(:contract-reference structured-data "none"))
:contract
(format "Type: Contract\nNumber: %s\nParty: %s (VAT: %s)\nEffective: %s to %s"
(:contract-number structured-data "N/A")
(get-in structured-data [:party-b :name] "N/A")
(get-in structured-data [:party-b :vat-id] "N/A")
(:effective-date structured-data "N/A")
(:expiration-date structured-data "N/A"))
:goods-received-note
(format "Type: Goods Received Note\nNumber: %s\nDate: %s\nSupplier: %s\nPO Ref: %s"
(:grn-number structured-data "N/A")
(:receipt-date structured-data "N/A")
(get-in structured-data [:supplier :name] "N/A")
(:po-reference structured-data "N/A"))
(str "Type: " (name type))))
(defn- format-candidates
"Format candidates list for LLM prompt."
[candidates]
(->> candidates
(map-indexed
(fn [i {:keys [doc score evidence]}]
(str "### Candidate " (inc i) "\n"
(format-document-summary doc)
"\nPreliminary score: " (format "%.2f" (double score))
"\nEvidence: " (pr-str (map :signal evidence)))))
(str/join "\n\n---\n\n")))
(defn build-match-prompt
"Build LLM prompt for match decision."
[source-doc candidates]
{:system "You are a document matching assistant for financial documents.
Determine which candidate document(s) match the source document.
Consider: supplier identity, amounts, dates, reference numbers.
Be conservative - only confirm matches you're confident about."
:user (str "## Source Document\n"
(format-document-summary source-doc)
"\n\n## Candidates\n\n"
(format-candidates candidates)
"\n\n## Task\n"
"Which candidate(s) match the source document? Return JSON:\n"
"{\"matches\": [{\"candidate\": 1, \"confidence\": \"high|medium|low\", \"reasoning\": \"...\"}]}\n\n"
"If none match confidently, return {\"matches\": []}")})
(defn parse-llm-response
"Parse LLM JSON response."
[response]
(try
(json/parse-string response true)
(catch Exception _e
{:matches []})))
(defn llm-match-decision
"Ask LLM to decide which candidates match the source document.
Returns {:matches [{:candidate :confidence :reasoning}]}"
[llm-config source-doc candidates]
(let [{:keys [system user]} (build-match-prompt source-doc candidates)
response (llm/complete llm-config
{:system system
:messages [{:role "user" :content user}]})]
(parse-llm-response (:content response))))
Step 4: Run test to verify it passes
Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.llm-decision-test]'
Expected: PASS
Step 5: Commit
git add src/com/getorcha/workers/matching/llm_decision.clj test/com/getorcha/workers/matching/llm_decision_test.clj
git commit -m "feat(matching): add LLM-based match decision making"
Files:
src/com/getorcha/workers/matching/core.cljtest/com/getorcha/workers/matching/core_test.cljStep 1: Write test for decision logic and cluster assignment
;; test/com/getorcha/workers/matching/core_test.clj
(ns com.getorcha.workers.matching.core-test
(:require [clojure.test :refer [deftest is testing use-fixtures]]
[com.getorcha.workers.matching.core :as matching]
[com.getorcha.db.document-matching :as db.matching]
[com.getorcha.test.fixtures :as fixtures]))
(use-fixtures :once fixtures/with-running-system)
(use-fixtures :each fixtures/with-db-rollback)
(deftest decide-matches-test
(testing "single high-confidence candidate matches without LLM"
(let [candidates [{:doc {:id (random-uuid) :type :purchase-order}
:score 0.85
:evidence [{:signal :po-number-exact}]}]
result (matching/decide-matches nil nil candidates)]
(is (= 1 (count result)))
(is (= "rule-based" (:match-method (first result))))))
(testing "no candidates above threshold returns empty"
(let [candidates [{:doc {:id (random-uuid)} :score 0.20 :evidence []}]]
(is (empty? (matching/decide-matches nil nil candidates))))))
(deftest cluster-assignment-test
(testing "creates new cluster when neither doc has one"
(let [le-id (:legal-entity/id (fixtures/create-legal-entity!))
doc-a (fixtures/create-document! {:legal-entity-id le-id :type "invoice"})
doc-b (fixtures/create-document! {:legal-entity-id le-id :type "purchase-order"})]
(matching/assign-cluster! fixtures/*db* (:document/id doc-a) (:document/id doc-b))
(let [docs (db.matching/get-documents-by-ids fixtures/*db*
[(:document/id doc-a) (:document/id doc-b)])]
(is (every? :document/cluster-id docs))
(is (= (:document/cluster-id (first docs))
(:document/cluster-id (second docs)))))))
(testing "assigns existing cluster when one doc has it"
(let [le-id (:legal-entity/id (fixtures/create-legal-entity!))
cluster-id (random-uuid)
doc-a (fixtures/create-document! {:legal-entity-id le-id :type "invoice"})
doc-b (fixtures/create-document! {:legal-entity-id le-id :type "purchase-order"})
_ (db.matching/set-cluster-id! fixtures/*db* [(:document/id doc-a)] cluster-id)]
(matching/assign-cluster! fixtures/*db* (:document/id doc-a) (:document/id doc-b))
(let [docs (db.matching/get-documents-by-ids fixtures/*db*
[(:document/id doc-a) (:document/id doc-b)])
doc-b-updated (first (filter #(= (:document/id doc-b) (:document/id %)) docs))]
(is (= cluster-id (:document/cluster-id doc-b-updated)))))))
Step 2: Run test to verify it fails
Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.core-test]'
Expected: FAIL - namespace not found
Step 3: Write implementation
;; src/com/getorcha/workers/matching/core.clj
(ns com.getorcha.workers.matching.core
"Core document matching logic."
(:require [com.getorcha.db.document-matching :as db.matching]
[com.getorcha.workers.matching.candidates :as candidates]
[com.getorcha.workers.matching.evidence :as evidence]
[com.getorcha.workers.matching.llm-decision :as llm-decision]))
(defn- score-candidates
"Score all candidates against source document."
[source-doc candidate-docs]
(->> candidate-docs
(map (fn [candidate]
(let [candidate-doc {:type (keyword (:document/type candidate))
:structured-data (:document/structured-data candidate)}
{:keys [score evidence]} (evidence/compute-score source-doc candidate-doc)]
{:doc candidate
:score score
:evidence evidence})))
(filter #(>= (:score %) (:low evidence/match-thresholds)))
(sort-by :score >)))
(defn decide-matches
"Decide which candidates to match based on scores.
Returns seq of {:doc :score :evidence :match-method}."
[llm-config source-doc candidates]
(let [high-threshold (:high evidence/match-thresholds)
high-candidates (filter #(>= (:score %) high-threshold) candidates)]
(cond
(empty? candidates)
[]
(= 1 (count high-candidates))
[(assoc (first high-candidates) :match-method "rule-based")]
(seq candidates)
(if llm-config
(let [llm-result (llm-decision/llm-match-decision llm-config source-doc candidates)]
(->> (:matches llm-result)
(filter #(#{"high" "medium"} (:confidence %)))
(map (fn [{:keys [candidate]}]
(-> (nth candidates (dec candidate))
(assoc :match-method "llm"))))))
(when (>= (:score (first candidates)) (:medium evidence/match-thresholds))
[(assoc (first candidates) :match-method "rule-based")]))
:else
[])))
(defn assign-cluster!
"Assign cluster to matched documents."
[db doc-a-id doc-b-id]
(let [[doc-a doc-b] (db.matching/get-documents-by-ids db [doc-a-id doc-b-id])
cluster-a (:document/cluster-id doc-a)
cluster-b (:document/cluster-id doc-b)]
(cond
(and cluster-a cluster-b (= cluster-a cluster-b))
nil
(and cluster-a cluster-b)
(let [target-cluster cluster-a
docs-to-update (db.matching/get-cluster-documents db cluster-b)]
(db.matching/set-cluster-id! db (map :document/id docs-to-update) target-cluster))
cluster-a
(db.matching/set-cluster-id! db [doc-b-id] cluster-a)
cluster-b
(db.matching/set-cluster-id! db [doc-a-id] cluster-b)
:else
(let [new-cluster (random-uuid)]
(db.matching/set-cluster-id! db [doc-a-id doc-b-id] new-cluster)))))
(defn create-match!
"Create a match between source and candidate, update clusters."
[db source-doc {:keys [doc score evidence match-method]}]
(db.matching/create-match! db
{:document-a-id (:id source-doc)
:document-b-id (:document/id doc)
:confidence (bigdec score)
:match-method match-method
:evidence evidence})
(assign-cluster! db (:id source-doc) (:document/id doc)))
(defn match-document!
"Main entry point: match a document against candidates."
[db search-config llm-config doc]
(db.matching/delete-matches-for-document! db (:id doc))
(db.matching/set-cluster-id! db [(:id doc)] nil)
(let [candidate-docs (candidates/find-candidates db search-config doc)]
(when (seq candidate-docs)
(let [scored (score-candidates doc candidate-docs)
by-type (group-by #(keyword (:document/type (:doc %))) scored)]
(doseq [[_doc-type type-candidates] by-type]
(let [matches (decide-matches llm-config doc type-candidates)]
(doseq [match matches]
(create-match! db doc match))))))))
Step 4: Run test to verify it passes
Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.core-test]'
Expected: PASS
Step 5: Commit
git add src/com/getorcha/workers/matching/core.clj test/com/getorcha/workers/matching/core_test.clj
git commit -m "feat(matching): add core matching logic with cluster management"
Files:
src/com/getorcha/workers/matching/worker.cljStep 1: Create worker namespace
;; src/com/getorcha/workers/matching/worker.clj
(ns com.getorcha.workers.matching.worker
"SQS worker for document matching."
(:require [com.getorcha.aws :as aws]
[com.getorcha.db.sql :as db.sql]
[com.getorcha.workers.matching.core :as matching]
[honey.sql.helpers :as h]
[integrant.core :as ig]
[taoensso.timbre :as log]))
(defn- fetch-document
"Fetch document with structured data."
[db document-id]
(first
(db.sql/execute! db
(-> (h/select :*)
(h/from :document)
(h/where [:= :id document-id])))))
(defn process-message!
"Process a single SQS message for document matching."
[{:keys [db-pool search-config llm-config]} message]
(let [{:keys [document-id]} (aws/parse-message-body message)]
(log/info "Processing document for matching" {:document-id document-id})
(try
(let [doc (fetch-document db-pool document-id)]
(if (and doc (:document/structured-data doc))
(do
(matching/match-document! db-pool search-config llm-config
{:id (:document/id doc)
:type (keyword (:document/type doc))
:legal-entity-id (:document/legal-entity-id doc)
:structured-data (:document/structured-data doc)
:cluster-id (:document/cluster-id doc)})
(log/info "Document matching complete" {:document-id document-id}))
(log/warn "Document not found or missing structured data" {:document-id document-id})))
(catch Exception e
(log/error e "Failed to process document for matching" {:document-id document-id})
(throw e)))))
(defn- poll-loop!
"Main polling loop for SQS messages."
[{:keys [aws-state] :as context} running?]
(let [queue-url (get-in aws-state [:queue-urls :matching])]
(log/info "Starting matching worker polling loop" {:queue (get-in aws-state [:queues :matching])})
(while @running?
(try
(let [messages (aws/receive-messages aws-state queue-url
{:max-messages 5
:wait-time-seconds 20})]
(doseq [message messages]
(try
(process-message! context message)
(aws/delete-message! aws-state queue-url message)
(catch Exception e
(log/error e "Failed to process message, will retry")))))
(catch Exception e
(log/error e "Error in polling loop")
(Thread/sleep 5000))))))
(defmethod ig/init-key ::orchestrator
[_ {:keys [aws-state db-pool search-config llm-config] :as config}]
(log/info "Initializing document matching orchestrator")
(let [running? (atom true)
context {:aws-state aws-state
:db-pool db-pool
:search-config search-config
:llm-config llm-config}
thread (Thread. #(poll-loop! context running?))]
(.start thread)
{:thread thread
:running? running?
:config config}))
(defmethod ig/halt-key! ::orchestrator
[_ {:keys [thread running?]}]
(log/info "Stopping document matching orchestrator")
(reset! running? false)
(.join thread 5000))
Step 2: Commit worker
git add src/com/getorcha/workers/matching/worker.clj
git commit -m "feat(matching): add SQS worker for document matching"
Files:
resources/com/getorcha/config.ednStep 1: Add matching worker to config.edn
Add the following to config.edn (find appropriate location among other worker configs):
:com.getorcha.workers.matching.worker/orchestrator
{:aws-state #ig/ref :com.getorcha.aws/state
:db-pool #ig/ref :com.getorcha.db/pool
:search-config nil ; TODO: wire up when hybrid search is integrated
:llm-config #profile {:test nil
:default {:model "claude-3-haiku-20240307"}}}
Step 2: Commit config
git add resources/com/getorcha/config.edn
git commit -m "feat(matching): add matching worker to Integrant config"
Files:
test/com/getorcha/workers/matching/integration_test.cljStep 1: Write integration test
;; test/com/getorcha/workers/matching/integration_test.clj
(ns com.getorcha.workers.matching.integration-test
(:require [clojure.test :refer [deftest is testing use-fixtures]]
[com.getorcha.db.document-matching :as db.matching]
[com.getorcha.workers.matching.core :as matching]
[com.getorcha.test.fixtures :as fixtures]))
(use-fixtures :once fixtures/with-running-system)
(use-fixtures :each fixtures/with-db-rollback)
(deftest invoice-po-matching-integration-test
(testing "invoice with PO reference matches PO with same number"
(let [le-id (:legal-entity/id (fixtures/create-legal-entity!))
po (fixtures/create-document!
{:legal-entity-id le-id
:type "purchase-order"
:structured-data {:po-number "PO-2024-001"
:supplier {:name "ACME Corp"
:vat-id "DE123456789"}
:total 10000
:currency "EUR"}})
invoice (fixtures/create-document!
{:legal-entity-id le-id
:type "invoice"
:structured-data {:invoice-number "INV-2024-001"
:po-reference "PO-2024-001"
:issuer {:name "ACME Corp"
:vat-id "DE123456789"}
:total 10000
:currency "EUR"}})]
(matching/match-document! fixtures/*db* nil nil
{:id (:document/id invoice)
:type :invoice
:legal-entity-id le-id
:structured-data (:document/structured-data invoice)
:cluster-id nil})
(let [matches (db.matching/get-matches-for-document fixtures/*db* (:document/id invoice))]
(is (= 1 (count matches)))
(is (>= (:document-match/confidence (first matches)) 0.7M)))
(let [[inv-doc po-doc] (db.matching/get-documents-by-ids
fixtures/*db*
[(:document/id invoice) (:document/id po)])]
(is (some? (:document/cluster-id inv-doc)))
(is (= (:document/cluster-id inv-doc) (:document/cluster-id po-doc)))))))
(deftest no-match-when-different-suppliers-test
(testing "invoice does not match PO with different supplier VAT"
(let [le-id (:legal-entity/id (fixtures/create-legal-entity!))
_po (fixtures/create-document!
{:legal-entity-id le-id
:type "purchase-order"
:structured-data {:po-number "PO-2024-002"
:supplier {:name "Different Corp"
:vat-id "DE999999999"}
:total 5000}})
invoice (fixtures/create-document!
{:legal-entity-id le-id
:type "invoice"
:structured-data {:invoice-number "INV-2024-002"
:issuer {:name "ACME Corp"
:vat-id "DE123456789"}
:total 5000}})]
(matching/match-document! fixtures/*db* nil nil
{:id (:document/id invoice)
:type :invoice
:legal-entity-id le-id
:structured-data (:document/structured-data invoice)
:cluster-id nil})
(let [matches (db.matching/get-matches-for-document fixtures/*db* (:document/id invoice))]
(is (empty? matches))))))
Step 2: Run integration test
Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.integration-test]'
Expected: PASS
Step 3: Commit
git add test/com/getorcha/workers/matching/integration_test.clj
git commit -m "test(matching): add end-to-end integration tests"
Files:
src/com/getorcha/workers/ingestion/orchestrator.clj (or wherever ingestion completion happens)Step 1: Find where ingestion completes successfully
Look for where valid_structured_data = true is set or where ingestion status becomes completed.
Step 2: Add SQS message publish
After successful ingestion, publish to matching queue:
(aws/send-message! aws-state
(get-in aws-state [:queue-urls :matching])
{:document-id document-id})
Step 3: Commit
git add src/com/getorcha/workers/ingestion/orchestrator.clj
git commit -m "feat(matching): publish document-ready event after ingestion"
Files created:
resources/migrations/20260224100000-add-document-matching.up.sqlresources/migrations/20260224100000-add-document-matching.down.sqlsrc/com/getorcha/db/document_matching.cljsrc/com/getorcha/workers/matching/searchable_text.cljsrc/com/getorcha/workers/matching/evidence.cljsrc/com/getorcha/workers/matching/candidates.cljsrc/com/getorcha/workers/matching/llm_decision.cljsrc/com/getorcha/workers/matching/core.cljsrc/com/getorcha/workers/matching/worker.cljtest/com/getorcha/db/document_matching_test.cljtest/com/getorcha/workers/matching/searchable_text_test.cljtest/com/getorcha/workers/matching/evidence_test.cljtest/com/getorcha/workers/matching/candidates_test.cljtest/com/getorcha/workers/matching/llm_decision_test.cljtest/com/getorcha/workers/matching/core_test.cljtest/com/getorcha/workers/matching/integration_test.cljFiles modified:
resources/com/getorcha/config.edn (queue + worker config)scripts/init_aws.clj (queue creation)test/com/getorcha/test/fixtures.clj (test queue)src/com/getorcha/workers/ingestion/orchestrator.clj (publish event)Not implemented (future work):
com.getorcha.search)