Note (2026-04-24): After this document was written, legal_entity was renamed to tenant and the old tenant was renamed to organization. Read references to these terms with the pre-rename meaning.

Four-Way Document Matching Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Goal: Automatically match related financial documents (Invoice ↔ PO ↔ Contract ↔ GRN) using hybrid search and LLM-assisted decision making.

Architecture: Async SQS worker consumes document-ready events, uses hybrid search (BM25 + semantic) to find candidates, scores with evidence signals, and uses LLM for ambiguous cases. Pairwise match edges stored in document_match table with denormalized cluster_id on documents.

Tech Stack: Clojure, PostgreSQL (pgvector, pg_trgm), HoneySQL, Integrant, Google Vertex AI embeddings, LLM via workers/llm.clj

Design Doc: docs/plans/2026-02-24-document-matching-design.md


Task 1: Database Migration - Schema Setup

Files:

Step 1: Create up migration

-- resources/migrations/20260224100000-add-document-matching.up.sql

-- Add cluster_id and search columns to document table
ALTER TABLE document ADD COLUMN cluster_id uuid;
ALTER TABLE document ADD COLUMN searchable_text text;
ALTER TABLE document ADD COLUMN embedding vector(768);

CREATE INDEX idx_document_cluster_id ON document(cluster_id);
CREATE INDEX idx_document_embedding ON document USING hnsw (embedding vector_cosine_ops);
CREATE INDEX idx_document_searchable_text ON document USING gin (to_tsvector('simple', searchable_text));

-- Pairwise match edges (source of truth for document relationships)
CREATE TABLE document_match (
  document_a_id  uuid REFERENCES document(id) ON DELETE CASCADE,
  document_b_id  uuid REFERENCES document(id) ON DELETE CASCADE,
  confidence     decimal(5,4),
  match_method   text NOT NULL,
  evidence       jsonb,
  created_at     timestamptz NOT NULL DEFAULT now(),
  updated_at     timestamptz NOT NULL DEFAULT now(),
  PRIMARY KEY (document_a_id, document_b_id),
  CHECK (document_a_id < document_b_id)
);

CREATE INDEX idx_document_match_b ON document_match(document_b_id);
CREATE INDEX idx_document_match_created ON document_match(created_at);

-- Trigger for updated_at
CREATE TRIGGER trigger_document_match_updated_at
  BEFORE UPDATE ON document_match
  FOR EACH ROW
  EXECUTE FUNCTION update_updated_at_column();

Step 2: Create down migration

-- resources/migrations/20260224100000-add-document-matching.down.sql

DROP TRIGGER IF EXISTS trigger_document_match_updated_at ON document_match;
DROP TABLE IF EXISTS document_match;
DROP INDEX IF EXISTS idx_document_searchable_text;
DROP INDEX IF EXISTS idx_document_embedding;
DROP INDEX IF EXISTS idx_document_cluster_id;
ALTER TABLE document DROP COLUMN IF EXISTS embedding;
ALTER TABLE document DROP COLUMN IF EXISTS searchable_text;
ALTER TABLE document DROP COLUMN IF EXISTS cluster_id;

Step 3: Run migration

Run: bb migrate Expected: Migration applies successfully

Step 4: Verify schema

Run: psql -h localhost -U postgres -d orcha -c "\d document_match" Expected: Table with columns document_a_id, document_b_id, confidence, match_method, evidence, created_at, updated_at

Step 5: Commit

git add resources/migrations/20260224100000-add-document-matching.up.sql resources/migrations/20260224100000-add-document-matching.down.sql
git commit -m "feat(matching): add document_match table and search columns"

Task 2: SQS Queue Configuration

Files:

Step 1: Add queue to config.edn

Find the :queues map in :com.getorcha/aws and add the matching queue:

:queues {:ingestion   "v1-orcha-global-ingest"
         :acquisition "v1-orcha-global-email-acquire"
         :matching    "v1-orcha-global-doc-matching"}

Step 2: Add queue extraction in init_aws.clj

After the existing queue extractions (~line 96), add:

(def sqs-matching-queue (get-in config [:com.getorcha/aws :queues :matching]))

Step 3: Create queue in init_aws.clj

In create-sqs-queues function, add:

(create-queue-with-dlq! sqs-matching-queue)

Step 4: Update print statement in init_aws.clj

Update the queue print line to include matching queue:

(println "  SQS Queues:" sqs-ingestion-queue "," sqs-acquisition-queue "," sqs-matching-queue)

Step 5: Add queue creation to test fixtures.clj

In with-running-system, find where queues are created and add the matching queue:

(doseq [queue-name [(get-in config [::aws/state :queues :ingestion])
                    (get-in config [::aws/state :queues :acquisition])
                    (get-in config [::aws/state :queues :matching])]]
  (.createQueue sqs-client
                ^CreateQueueRequest
                (.build (doto (CreateQueueRequest/builder)
                          (.queueName queue-name)))))

Step 6: Verify LocalStack setup

Run: bb dev:init-aws Expected: Output shows matching queue created

Step 7: Commit

git add resources/com/getorcha/config.edn scripts/init_aws.clj test/com/getorcha/test/fixtures.clj
git commit -m "feat(matching): add SQS queue for document matching"

Task 3: Database Operations - Core Queries

Files:

Step 1: Write test for get-matches-for-document

;; test/com/getorcha/db/document_matching_test.clj
(ns com.getorcha.db.document-matching-test
  (:require [clojure.test :refer [deftest is testing use-fixtures]]
            [com.getorcha.db.document-matching :as matching]
            [com.getorcha.test.fixtures :as fixtures]))


(use-fixtures :once fixtures/with-running-system)
(use-fixtures :each fixtures/with-db-rollback)


(deftest get-matches-for-document-test
  (testing "returns empty when no matches exist"
    (let [doc-id (random-uuid)]
      (is (empty? (matching/get-matches-for-document fixtures/*db* doc-id)))))

  (testing "returns matches for document as either a or b"
    (let [le-id (:legal-entity/id (fixtures/create-legal-entity!))
          doc-a (fixtures/create-document! {:legal-entity-id le-id :type "invoice"})
          doc-b (fixtures/create-document! {:legal-entity-id le-id :type "purchase-order"})
          doc-c (fixtures/create-document! {:legal-entity-id le-id :type "contract"})
          _ (matching/create-match! fixtures/*db*
                                    {:document-a-id (:document/id doc-a)
                                     :document-b-id (:document/id doc-b)
                                     :confidence 0.85M
                                     :match-method "rule-based"
                                     :evidence []})
          _ (matching/create-match! fixtures/*db*
                                    {:document-a-id (:document/id doc-b)
                                     :document-b-id (:document/id doc-c)
                                     :confidence 0.72M
                                     :match-method "llm"
                                     :evidence []})]
      (is (= 1 (count (matching/get-matches-for-document fixtures/*db* (:document/id doc-a)))))
      (is (= 2 (count (matching/get-matches-for-document fixtures/*db* (:document/id doc-b))))))))

Step 2: Run test to verify it fails

Run: clj -X:test:silent :nses '[com.getorcha.db.document-matching-test]' Expected: FAIL - namespace not found

Step 3: Write minimal implementation

;; src/com/getorcha/db/document_matching.clj
(ns com.getorcha.db.document-matching
  "Database operations for document matching."
  (:require [com.getorcha.db.sql :as db.sql]
            [honey.sql.helpers :as h]))


(defn get-matches-for-document
  "Get all matches for a document (as either document_a or document_b)."
  [db document-id]
  (db.sql/execute! db
    (-> (h/select :*)
        (h/from :document-match)
        (h/where [:or
                  [:= :document-a-id document-id]
                  [:= :document-b-id document-id]])
        (h/order-by [:confidence :desc]))))


(defn create-match!
  "Create a match between two documents. Ensures canonical ordering (a < b)."
  [db {:keys [document-a-id document-b-id confidence match-method evidence]}]
  (let [[id-a id-b] (sort [document-a-id document-b-id])]
    (db.sql/execute-one! db
      (-> (h/insert-into :document-match)
          (h/values [{:document-a-id id-a
                      :document-b-id id-b
                      :confidence confidence
                      :match-method match-method
                      :evidence [:lift evidence]}])
          (h/on-conflict :document-a-id :document-b-id)
          (h/do-update-set :confidence :match-method :evidence :updated-at)))))

Step 4: Run test to verify it passes

Run: clj -X:test:silent :nses '[com.getorcha.db.document-matching-test]' Expected: PASS

Step 5: Commit

git add src/com/getorcha/db/document_matching.clj test/com/getorcha/db/document_matching_test.clj
git commit -m "feat(matching): add get-matches-for-document and create-match! db operations"

Task 4: Database Operations - Cluster Management

Files:

Step 1: Write test for cluster operations

;; Add to test/com/getorcha/db/document_matching_test.clj

(deftest cluster-operations-test
  (testing "set-cluster-id! updates documents"
    (let [le-id (:legal-entity/id (fixtures/create-legal-entity!))
          doc-a (fixtures/create-document! {:legal-entity-id le-id :type "invoice"})
          doc-b (fixtures/create-document! {:legal-entity-id le-id :type "purchase-order"})
          cluster-id (random-uuid)]
      (matching/set-cluster-id! fixtures/*db* [(:document/id doc-a) (:document/id doc-b)] cluster-id)
      (let [docs (matching/get-documents-by-ids fixtures/*db* [(:document/id doc-a) (:document/id doc-b)])]
        (is (every? #(= cluster-id (:document/cluster-id %)) docs)))))

  (testing "get-cluster-documents returns all documents in cluster"
    (let [le-id (:legal-entity/id (fixtures/create-legal-entity!))
          doc-a (fixtures/create-document! {:legal-entity-id le-id :type "invoice"})
          doc-b (fixtures/create-document! {:legal-entity-id le-id :type "purchase-order"})
          doc-c (fixtures/create-document! {:legal-entity-id le-id :type "contract"})
          cluster-id (random-uuid)]
      (matching/set-cluster-id! fixtures/*db* [(:document/id doc-a) (:document/id doc-b)] cluster-id)
      (let [cluster-docs (matching/get-cluster-documents fixtures/*db* cluster-id)]
        (is (= 2 (count cluster-docs)))
        (is (not (some #(= (:document/id doc-c) (:document/id %)) cluster-docs)))))))

Step 2: Run test to verify it fails

Run: clj -X:test:silent :nses '[com.getorcha.db.document-matching-test]' Expected: FAIL - set-cluster-id! not found

Step 3: Add cluster operations to implementation

;; Add to src/com/getorcha/db/document_matching.clj

(defn set-cluster-id!
  "Set cluster_id for multiple documents."
  [db document-ids cluster-id]
  (db.sql/execute! db
    (-> (h/update :document)
        (h/set {:cluster-id cluster-id})
        (h/where [:in :id document-ids]))))


(defn get-cluster-documents
  "Get all documents in a cluster."
  [db cluster-id]
  (db.sql/execute! db
    (-> (h/select :*)
        (h/from :document)
        (h/where [:= :cluster-id cluster-id]))))


(defn get-documents-by-ids
  "Get documents by their IDs."
  [db document-ids]
  (db.sql/execute! db
    (-> (h/select :*)
        (h/from :document)
        (h/where [:in :id document-ids]))))


(defn delete-matches-for-document!
  "Delete all matches involving a document."
  [db document-id]
  (db.sql/execute! db
    (-> (h/delete-from :document-match)
        (h/where [:or
                  [:= :document-a-id document-id]
                  [:= :document-b-id document-id]]))))


(defn get-cluster-edges
  "Get all match edges within a cluster."
  [db cluster-id]
  (db.sql/execute! db
    {:select [:dm.*]
     :from [[:document-match :dm]]
     :join [[:document :da] [:= :dm.document-a-id :da.id]]
     :where [:= :da.cluster-id cluster-id]}))

Step 4: Run test to verify it passes

Run: clj -X:test:silent :nses '[com.getorcha.db.document-matching-test]' Expected: PASS

Step 5: Commit

git add src/com/getorcha/db/document_matching.clj test/com/getorcha/db/document_matching_test.clj
git commit -m "feat(matching): add cluster management db operations"

Task 5: Searchable Text Builder

Files:

Step 1: Write test for build-searchable-text

;; test/com/getorcha/workers/matching/searchable_text_test.clj
(ns com.getorcha.workers.matching.searchable-text-test
  (:require [clojure.test :refer [deftest is testing]]
            [com.getorcha.workers.matching.searchable-text :as searchable]))


(deftest build-searchable-text-test
  (testing "invoice extracts supplier, VAT, number, total, currency, PO ref"
    (let [doc {:type :invoice
               :structured-data {:issuer {:name "ACME Corp"
                                          :vat-id "DE123456789"}
                                 :invoice-number "INV-2024-001"
                                 :total 15000.00
                                 :currency "EUR"
                                 :po-reference "PO-2024-050"}}
          result (searchable/build-searchable-text doc)]
      (is (clojure.string/includes? result "ACME Corp"))
      (is (clojure.string/includes? result "DE123456789"))
      (is (clojure.string/includes? result "INV-2024-001"))
      (is (clojure.string/includes? result "15000"))
      (is (clojure.string/includes? result "EUR"))
      (is (clojure.string/includes? result "PO-2024-050"))))

  (testing "purchase-order extracts supplier, VAT, PO number, total, contract ref"
    (let [doc {:type :purchase-order
               :structured-data {:supplier {:name "Widgets Inc"
                                            :vat-id "DE987654321"}
                                 :po-number "PO-2024-050"
                                 :total 20000.00
                                 :currency "EUR"
                                 :contract-reference "C-2024-010"}}
          result (searchable/build-searchable-text doc)]
      (is (clojure.string/includes? result "Widgets Inc"))
      (is (clojure.string/includes? result "PO-2024-050"))
      (is (clojure.string/includes? result "C-2024-010"))))

  (testing "handles missing fields gracefully"
    (let [doc {:type :invoice
               :structured-data {:issuer {:name "Some Corp"}}}
          result (searchable/build-searchable-text doc)]
      (is (clojure.string/includes? result "Some Corp"))
      (is (string? result)))))

Step 2: Run test to verify it fails

Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.searchable-text-test]' Expected: FAIL - namespace not found

Step 3: Write implementation

;; src/com/getorcha/workers/matching/searchable_text.clj
(ns com.getorcha.workers.matching.searchable-text
  "Build searchable text from document structured data for hybrid search."
  (:require [clojure.string :as str]))


(defn- join-non-nil
  "Join non-nil values with separator."
  [separator & values]
  (->> values
       (remove nil?)
       (map str)
       (str/join separator)))


(defn build-searchable-text
  "Build searchable text string from document for hybrid search indexing and querying."
  [{:keys [type structured-data]}]
  (case type
    :invoice
    (join-non-nil " | "
                  (get-in structured-data [:issuer :name])
                  (get-in structured-data [:issuer :vat-id])
                  (:invoice-number structured-data)
                  (:total structured-data)
                  (:currency structured-data)
                  (:po-reference structured-data))

    :purchase-order
    (join-non-nil " | "
                  (get-in structured-data [:supplier :name])
                  (get-in structured-data [:supplier :vat-id])
                  (:po-number structured-data)
                  (:total structured-data)
                  (:currency structured-data)
                  (:contract-reference structured-data))

    :contract
    (join-non-nil " | "
                  (get-in structured-data [:party-b :name])
                  (get-in structured-data [:party-b :vat-id])
                  (:contract-number structured-data))

    :goods-received-note
    (join-non-nil " | "
                  (get-in structured-data [:supplier :name])
                  (get-in structured-data [:supplier :vat-id])
                  (:grn-number structured-data)
                  (:po-reference structured-data))

    ""))

Step 4: Run test to verify it passes

Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.searchable-text-test]' Expected: PASS

Step 5: Commit

git add src/com/getorcha/workers/matching/searchable_text.clj test/com/getorcha/workers/matching/searchable_text_test.clj
git commit -m "feat(matching): add searchable text builder for documents"

Task 6: Evidence Signals - Configuration and Scoring

Files:

Step 1: Write test for evidence signal collection

;; test/com/getorcha/workers/matching/evidence_test.clj
(ns com.getorcha.workers.matching.evidence-test
  (:require [clojure.test :refer [deftest is testing]]
            [com.getorcha.workers.matching.evidence :as evidence]))


(deftest collect-signals-test
  (testing "exact PO number match"
    (let [invoice {:type :invoice
                   :structured-data {:po-reference "PO-2024-001"
                                     :issuer {:vat-id "DE123"}}}
          po {:type :purchase-order
              :structured-data {:po-number "PO-2024-001"
                                :supplier {:vat-id "DE123"}}}
          signals (evidence/collect-signals invoice po)]
      (is (some #(= :po-number-exact (:signal %)) signals))
      (is (some #(= :vat-id-match (:signal %)) signals))))

  (testing "VAT ID mismatch is negative signal"
    (let [invoice {:type :invoice
                   :structured-data {:issuer {:vat-id "DE123"}}}
          po {:type :purchase-order
              :structured-data {:supplier {:vat-id "DE999"}}}
          signals (evidence/collect-signals invoice po)]
      (is (some #(= :vat-id-mismatch (:signal %)) signals))))

  (testing "amount within tolerance"
    (let [invoice {:type :invoice
                   :structured-data {:total 9900}}
          po {:type :purchase-order
              :structured-data {:total 10000}}
          signals (evidence/collect-signals invoice po)]
      (is (some #(= :amount-within-2pct (:signal %)) signals)))))


(deftest compute-score-test
  (testing "score normalizes to 0-1 range"
    (let [invoice {:type :invoice
                   :structured-data {:po-reference "PO-001"
                                     :issuer {:vat-id "DE123"}
                                     :total 10000}}
          po {:type :purchase-order
              :structured-data {:po-number "PO-001"
                                :supplier {:vat-id "DE123"}
                                :total 10000}}
          {:keys [score]} (evidence/compute-score invoice po)]
      (is (>= score 0.0))
      (is (<= score 1.0))
      (is (> score 0.7)))))

Step 2: Run test to verify it fails

Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.evidence-test]' Expected: FAIL - namespace not found

Step 3: Write implementation

;; src/com/getorcha/workers/matching/evidence.clj
(ns com.getorcha.workers.matching.evidence
  "Evidence signal collection and scoring for document matching.")


;; Signal weights: positive signals build confidence, negative reduce it
(def evidence-signals
  {:po-number-exact     60   ; PO number on invoice matches PO
   :contract-ref-exact  55   ; Contract reference matches
   :po-ref-exact        55   ; PO reference on GRN matches
   :vat-id-match        30   ; Supplier VAT IDs match
   :iban-match          25   ; Supplier bank accounts match
   :amount-within-2pct  20   ; Amounts within 2% tolerance
   :amount-within-5pct  10   ; Amounts within 5% tolerance
   :supplier-name-fuzzy 15   ; Supplier names >0.8 similarity
   :date-in-validity    10   ; Document date within validity period
   :vat-id-mismatch    -40}) ; VAT IDs present but don't match


(def match-thresholds
  {:high   0.70
   :medium 0.50
   :low    0.30})


(defn- get-vat-id
  "Extract VAT ID from document based on type."
  [{:keys [type structured-data]}]
  (case type
    :invoice (get-in structured-data [:issuer :vat-id])
    :purchase-order (get-in structured-data [:supplier :vat-id])
    :contract (get-in structured-data [:party-b :vat-id])
    :goods-received-note (get-in structured-data [:supplier :vat-id])
    nil))


(defn- get-total
  "Extract total amount from document."
  [{:keys [structured-data]}]
  (:total structured-data))


(defn- get-po-number
  "Extract PO number/reference from document."
  [{:keys [type structured-data]}]
  (case type
    :invoice (:po-reference structured-data)
    :purchase-order (:po-number structured-data)
    :goods-received-note (:po-reference structured-data)
    nil))


(defn- within-tolerance?
  "Check if two numbers are within percentage tolerance."
  [a b tolerance-pct]
  (when (and a b (pos? b))
    (<= (abs (- 1 (/ a b))) (/ tolerance-pct 100.0))))


(defn collect-signals
  "Collect all evidence signals between two documents."
  [doc-a doc-b]
  (let [vat-a (get-vat-id doc-a)
        vat-b (get-vat-id doc-b)
        po-num-a (get-po-number doc-a)
        po-num-b (get-po-number doc-b)
        total-a (get-total doc-a)
        total-b (get-total doc-b)
        signals (transient [])]

    (when (and po-num-a po-num-b (= po-num-a po-num-b))
      (conj! signals {:signal :po-number-exact
                      :value po-num-a
                      :weight (:po-number-exact evidence-signals)}))

    (when (and vat-a vat-b)
      (if (= vat-a vat-b)
        (conj! signals {:signal :vat-id-match
                        :value vat-a
                        :weight (:vat-id-match evidence-signals)})
        (conj! signals {:signal :vat-id-mismatch
                        :value (str vat-a " vs " vat-b)
                        :weight (:vat-id-mismatch evidence-signals)})))

    (when (and total-a total-b)
      (cond
        (within-tolerance? total-a total-b 2)
        (conj! signals {:signal :amount-within-2pct
                        :value (str total-a " ~ " total-b)
                        :weight (:amount-within-2pct evidence-signals)})

        (within-tolerance? total-a total-b 5)
        (conj! signals {:signal :amount-within-5pct
                        :value (str total-a " ~ " total-b)
                        :weight (:amount-within-5pct evidence-signals)})))

    (persistent! signals)))


(defn compute-score
  "Compute match score from evidence signals. Returns {:score :evidence}."
  [doc-a doc-b]
  (let [signals (collect-signals doc-a doc-b)
        raw-score (reduce + 0 (map :weight signals))
        normalized (-> raw-score (/ 100.0) (max 0.0) (min 1.0))]
    {:score normalized
     :evidence signals}))

Step 4: Run test to verify it passes

Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.evidence-test]' Expected: PASS

Step 5: Commit

git add src/com/getorcha/workers/matching/evidence.clj test/com/getorcha/workers/matching/evidence_test.clj
git commit -m "feat(matching): add evidence signal collection and scoring"

Task 7: Candidate Retrieval

Files:

Step 1: Write test for matchable types

;; test/com/getorcha/workers/matching/candidates_test.clj
(ns com.getorcha.workers.matching.candidates-test
  (:require [clojure.test :refer [deftest is testing use-fixtures]]
            [com.getorcha.workers.matching.candidates :as candidates]
            [com.getorcha.test.fixtures :as fixtures]))


(use-fixtures :once fixtures/with-running-system)
(use-fixtures :each fixtures/with-db-rollback)


(deftest get-matchable-types-test
  (testing "invoice can match PO and contract"
    (is (= #{:purchase-order :contract}
           (candidates/get-matchable-types :invoice))))

  (testing "purchase-order can match invoice and contract"
    (is (= #{:invoice :contract}
           (candidates/get-matchable-types :purchase-order))))

  (testing "goods-received-note can only match purchase-order"
    (is (= #{:purchase-order}
           (candidates/get-matchable-types :goods-received-note)))))


(deftest find-candidates-by-type-test
  (testing "finds candidates of correct type within same legal entity"
    (let [le-id (:legal-entity/id (fixtures/create-legal-entity!))
          _invoice (fixtures/create-document! {:legal-entity-id le-id
                                               :type "invoice"
                                               :structured-data {:issuer {:name "ACME"}}})
          po (fixtures/create-document! {:legal-entity-id le-id
                                         :type "purchase-order"
                                         :structured-data {:supplier {:name "ACME"}}})
          candidates (candidates/find-candidates-by-type
                       fixtures/*db*
                       {:legal-entity-id le-id
                        :types [:purchase-order]
                        :exclude-id nil})]
      (is (= 1 (count candidates)))
      (is (= (:document/id po) (:document/id (first candidates)))))))

Step 2: Run test to verify it fails

Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.candidates-test]' Expected: FAIL - namespace not found

Step 3: Write implementation

;; src/com/getorcha/workers/matching/candidates.clj
(ns com.getorcha.workers.matching.candidates
  "Candidate retrieval for document matching."
  (:require [com.getorcha.db.sql :as db.sql]
            [com.getorcha.workers.matching.searchable-text :as searchable]
            [honey.sql.helpers :as h]))


(def ^:private matchable-pairs
  "Valid document type pairs for matching."
  #{#{:invoice :purchase-order}
    #{:invoice :contract}
    #{:purchase-order :contract}
    #{:goods-received-note :purchase-order}})


(defn get-matchable-types
  "Get document types that can match with the given type."
  [doc-type]
  (->> matchable-pairs
       (filter #(contains? % doc-type))
       (mapcat identity)
       (remove #(= % doc-type))
       set))


(defn find-candidates-by-type
  "Find candidate documents by type within same legal entity."
  [db {:keys [legal-entity-id types exclude-id]}]
  (db.sql/execute! db
    (-> (h/select :*)
        (h/from :document)
        (h/where [:and
                  [:= :legal-entity-id legal-entity-id]
                  [:in :type (map name types)]
                  [:is-not :structured-data nil]
                  (when exclude-id
                    [:<> :id exclude-id])])
        (h/limit 50))))


(defn find-candidates
  "Find candidate documents for matching.
   Uses simple type-based query; hybrid search integration is future work."
  [db _search-config doc]
  (let [matchable-types (get-matchable-types (:type doc))]
    (find-candidates-by-type db
                             {:legal-entity-id (:legal-entity-id doc)
                              :types matchable-types
                              :exclude-id (:id doc)})))

Step 4: Run test to verify it passes

Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.candidates-test]' Expected: PASS

Step 5: Commit

git add src/com/getorcha/workers/matching/candidates.clj test/com/getorcha/workers/matching/candidates_test.clj
git commit -m "feat(matching): add candidate retrieval with matchable type filtering"

Task 8: LLM Match Decision

Files:

Step 1: Write test for prompt building and response parsing

;; test/com/getorcha/workers/matching/llm_decision_test.clj
(ns com.getorcha.workers.matching.llm-decision-test
  (:require [clojure.test :refer [deftest is testing]]
            [com.getorcha.workers.matching.llm-decision :as llm-decision]))


(deftest format-document-summary-test
  (testing "formats invoice summary"
    (let [doc {:type :invoice
               :structured-data {:invoice-number "INV-001"
                                 :invoice-date "2024-02-20"
                                 :issuer {:name "ACME Corp"
                                          :vat-id "DE123"}
                                 :total 15000
                                 :currency "EUR"}}
          summary (llm-decision/format-document-summary doc)]
      (is (clojure.string/includes? summary "Invoice"))
      (is (clojure.string/includes? summary "INV-001"))
      (is (clojure.string/includes? summary "ACME Corp"))
      (is (clojure.string/includes? summary "15000"))))

  (testing "formats purchase-order summary"
    (let [doc {:type :purchase-order
               :structured-data {:po-number "PO-001"
                                 :supplier {:name "Widgets Inc"}
                                 :total 20000}}
          summary (llm-decision/format-document-summary doc)]
      (is (clojure.string/includes? summary "Purchase Order"))
      (is (clojure.string/includes? summary "PO-001")))))


(deftest build-match-prompt-test
  (testing "builds prompt with source and candidates"
    (let [source {:type :invoice :structured-data {:invoice-number "INV-001"}}
          candidates [{:doc {:type :purchase-order :structured-data {:po-number "PO-001"}}
                       :score 0.65
                       :evidence [{:signal :vat-id-match}]}]
          {:keys [system user]} (llm-decision/build-match-prompt source candidates)]
      (is (string? system))
      (is (clojure.string/includes? user "Source Document"))
      (is (clojure.string/includes? user "Candidate")))))


(deftest parse-llm-response-test
  (testing "parses valid JSON response"
    (let [response "{\"matches\": [{\"candidate\": 1, \"confidence\": \"high\", \"reasoning\": \"PO number matches\"}]}"
          parsed (llm-decision/parse-llm-response response)]
      (is (= 1 (count (:matches parsed))))
      (is (= "high" (-> parsed :matches first :confidence)))))

  (testing "handles no matches"
    (let [response "{\"matches\": []}"
          parsed (llm-decision/parse-llm-response response)]
      (is (empty? (:matches parsed)))))

  (testing "handles malformed JSON gracefully"
    (let [response "not json at all"
          parsed (llm-decision/parse-llm-response response)]
      (is (empty? (:matches parsed))))))

Step 2: Run test to verify it fails

Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.llm-decision-test]' Expected: FAIL - namespace not found

Step 3: Write implementation

;; src/com/getorcha/workers/matching/llm_decision.clj
(ns com.getorcha.workers.matching.llm-decision
  "LLM-based document matching decision making."
  (:require [cheshire.core :as json]
            [clojure.string :as str]
            [com.getorcha.workers.llm :as llm]))


(defn format-document-summary
  "Format a document for LLM prompt."
  [{:keys [type structured-data]}]
  (case type
    :invoice
    (format "Type: Invoice\nNumber: %s\nDate: %s\nSupplier: %s (VAT: %s)\nTotal: %s %s"
            (:invoice-number structured-data "N/A")
            (:invoice-date structured-data "N/A")
            (get-in structured-data [:issuer :name] "N/A")
            (get-in structured-data [:issuer :vat-id] "N/A")
            (:total structured-data "N/A")
            (:currency structured-data ""))

    :purchase-order
    (format "Type: Purchase Order\nNumber: %s\nDate: %s\nSupplier: %s (VAT: %s)\nTotal: %s %s\nContract Ref: %s"
            (:po-number structured-data "N/A")
            (:po-date structured-data "N/A")
            (get-in structured-data [:supplier :name] "N/A")
            (get-in structured-data [:supplier :vat-id] "N/A")
            (:total structured-data "N/A")
            (:currency structured-data "")
            (:contract-reference structured-data "none"))

    :contract
    (format "Type: Contract\nNumber: %s\nParty: %s (VAT: %s)\nEffective: %s to %s"
            (:contract-number structured-data "N/A")
            (get-in structured-data [:party-b :name] "N/A")
            (get-in structured-data [:party-b :vat-id] "N/A")
            (:effective-date structured-data "N/A")
            (:expiration-date structured-data "N/A"))

    :goods-received-note
    (format "Type: Goods Received Note\nNumber: %s\nDate: %s\nSupplier: %s\nPO Ref: %s"
            (:grn-number structured-data "N/A")
            (:receipt-date structured-data "N/A")
            (get-in structured-data [:supplier :name] "N/A")
            (:po-reference structured-data "N/A"))

    (str "Type: " (name type))))


(defn- format-candidates
  "Format candidates list for LLM prompt."
  [candidates]
  (->> candidates
       (map-indexed
         (fn [i {:keys [doc score evidence]}]
           (str "### Candidate " (inc i) "\n"
                (format-document-summary doc)
                "\nPreliminary score: " (format "%.2f" (double score))
                "\nEvidence: " (pr-str (map :signal evidence)))))
       (str/join "\n\n---\n\n")))


(defn build-match-prompt
  "Build LLM prompt for match decision."
  [source-doc candidates]
  {:system "You are a document matching assistant for financial documents.
Determine which candidate document(s) match the source document.
Consider: supplier identity, amounts, dates, reference numbers.
Be conservative - only confirm matches you're confident about."

   :user (str "## Source Document\n"
              (format-document-summary source-doc)
              "\n\n## Candidates\n\n"
              (format-candidates candidates)
              "\n\n## Task\n"
              "Which candidate(s) match the source document? Return JSON:\n"
              "{\"matches\": [{\"candidate\": 1, \"confidence\": \"high|medium|low\", \"reasoning\": \"...\"}]}\n\n"
              "If none match confidently, return {\"matches\": []}")})


(defn parse-llm-response
  "Parse LLM JSON response."
  [response]
  (try
    (json/parse-string response true)
    (catch Exception _e
      {:matches []})))


(defn llm-match-decision
  "Ask LLM to decide which candidates match the source document.
   Returns {:matches [{:candidate :confidence :reasoning}]}"
  [llm-config source-doc candidates]
  (let [{:keys [system user]} (build-match-prompt source-doc candidates)
        response (llm/complete llm-config
                               {:system system
                                :messages [{:role "user" :content user}]})]
    (parse-llm-response (:content response))))

Step 4: Run test to verify it passes

Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.llm-decision-test]' Expected: PASS

Step 5: Commit

git add src/com/getorcha/workers/matching/llm_decision.clj test/com/getorcha/workers/matching/llm_decision_test.clj
git commit -m "feat(matching): add LLM-based match decision making"

Task 9: Core Matching Logic

Files:

Step 1: Write test for decision logic and cluster assignment

;; test/com/getorcha/workers/matching/core_test.clj
(ns com.getorcha.workers.matching.core-test
  (:require [clojure.test :refer [deftest is testing use-fixtures]]
            [com.getorcha.workers.matching.core :as matching]
            [com.getorcha.db.document-matching :as db.matching]
            [com.getorcha.test.fixtures :as fixtures]))


(use-fixtures :once fixtures/with-running-system)
(use-fixtures :each fixtures/with-db-rollback)


(deftest decide-matches-test
  (testing "single high-confidence candidate matches without LLM"
    (let [candidates [{:doc {:id (random-uuid) :type :purchase-order}
                       :score 0.85
                       :evidence [{:signal :po-number-exact}]}]
          result (matching/decide-matches nil nil candidates)]
      (is (= 1 (count result)))
      (is (= "rule-based" (:match-method (first result))))))

  (testing "no candidates above threshold returns empty"
    (let [candidates [{:doc {:id (random-uuid)} :score 0.20 :evidence []}]]
      (is (empty? (matching/decide-matches nil nil candidates))))))


(deftest cluster-assignment-test
  (testing "creates new cluster when neither doc has one"
    (let [le-id (:legal-entity/id (fixtures/create-legal-entity!))
          doc-a (fixtures/create-document! {:legal-entity-id le-id :type "invoice"})
          doc-b (fixtures/create-document! {:legal-entity-id le-id :type "purchase-order"})]
      (matching/assign-cluster! fixtures/*db* (:document/id doc-a) (:document/id doc-b))
      (let [docs (db.matching/get-documents-by-ids fixtures/*db*
                                                    [(:document/id doc-a) (:document/id doc-b)])]
        (is (every? :document/cluster-id docs))
        (is (= (:document/cluster-id (first docs))
               (:document/cluster-id (second docs)))))))

  (testing "assigns existing cluster when one doc has it"
    (let [le-id (:legal-entity/id (fixtures/create-legal-entity!))
          cluster-id (random-uuid)
          doc-a (fixtures/create-document! {:legal-entity-id le-id :type "invoice"})
          doc-b (fixtures/create-document! {:legal-entity-id le-id :type "purchase-order"})
          _ (db.matching/set-cluster-id! fixtures/*db* [(:document/id doc-a)] cluster-id)]
      (matching/assign-cluster! fixtures/*db* (:document/id doc-a) (:document/id doc-b))
      (let [docs (db.matching/get-documents-by-ids fixtures/*db*
                                                    [(:document/id doc-a) (:document/id doc-b)])
            doc-b-updated (first (filter #(= (:document/id doc-b) (:document/id %)) docs))]
        (is (= cluster-id (:document/cluster-id doc-b-updated)))))))

Step 2: Run test to verify it fails

Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.core-test]' Expected: FAIL - namespace not found

Step 3: Write implementation

;; src/com/getorcha/workers/matching/core.clj
(ns com.getorcha.workers.matching.core
  "Core document matching logic."
  (:require [com.getorcha.db.document-matching :as db.matching]
            [com.getorcha.workers.matching.candidates :as candidates]
            [com.getorcha.workers.matching.evidence :as evidence]
            [com.getorcha.workers.matching.llm-decision :as llm-decision]))


(defn- score-candidates
  "Score all candidates against source document."
  [source-doc candidate-docs]
  (->> candidate-docs
       (map (fn [candidate]
              (let [candidate-doc {:type (keyword (:document/type candidate))
                                   :structured-data (:document/structured-data candidate)}
                    {:keys [score evidence]} (evidence/compute-score source-doc candidate-doc)]
                {:doc candidate
                 :score score
                 :evidence evidence})))
       (filter #(>= (:score %) (:low evidence/match-thresholds)))
       (sort-by :score >)))


(defn decide-matches
  "Decide which candidates to match based on scores.
   Returns seq of {:doc :score :evidence :match-method}."
  [llm-config source-doc candidates]
  (let [high-threshold (:high evidence/match-thresholds)
        high-candidates (filter #(>= (:score %) high-threshold) candidates)]
    (cond
      (empty? candidates)
      []

      (= 1 (count high-candidates))
      [(assoc (first high-candidates) :match-method "rule-based")]

      (seq candidates)
      (if llm-config
        (let [llm-result (llm-decision/llm-match-decision llm-config source-doc candidates)]
          (->> (:matches llm-result)
               (filter #(#{"high" "medium"} (:confidence %)))
               (map (fn [{:keys [candidate]}]
                      (-> (nth candidates (dec candidate))
                          (assoc :match-method "llm"))))))
        (when (>= (:score (first candidates)) (:medium evidence/match-thresholds))
          [(assoc (first candidates) :match-method "rule-based")]))

      :else
      [])))


(defn assign-cluster!
  "Assign cluster to matched documents."
  [db doc-a-id doc-b-id]
  (let [[doc-a doc-b] (db.matching/get-documents-by-ids db [doc-a-id doc-b-id])
        cluster-a (:document/cluster-id doc-a)
        cluster-b (:document/cluster-id doc-b)]
    (cond
      (and cluster-a cluster-b (= cluster-a cluster-b))
      nil

      (and cluster-a cluster-b)
      (let [target-cluster cluster-a
            docs-to-update (db.matching/get-cluster-documents db cluster-b)]
        (db.matching/set-cluster-id! db (map :document/id docs-to-update) target-cluster))

      cluster-a
      (db.matching/set-cluster-id! db [doc-b-id] cluster-a)

      cluster-b
      (db.matching/set-cluster-id! db [doc-a-id] cluster-b)

      :else
      (let [new-cluster (random-uuid)]
        (db.matching/set-cluster-id! db [doc-a-id doc-b-id] new-cluster)))))


(defn create-match!
  "Create a match between source and candidate, update clusters."
  [db source-doc {:keys [doc score evidence match-method]}]
  (db.matching/create-match! db
                             {:document-a-id (:id source-doc)
                              :document-b-id (:document/id doc)
                              :confidence (bigdec score)
                              :match-method match-method
                              :evidence evidence})
  (assign-cluster! db (:id source-doc) (:document/id doc)))


(defn match-document!
  "Main entry point: match a document against candidates."
  [db search-config llm-config doc]
  (db.matching/delete-matches-for-document! db (:id doc))
  (db.matching/set-cluster-id! db [(:id doc)] nil)

  (let [candidate-docs (candidates/find-candidates db search-config doc)]
    (when (seq candidate-docs)
      (let [scored (score-candidates doc candidate-docs)
            by-type (group-by #(keyword (:document/type (:doc %))) scored)]
        (doseq [[_doc-type type-candidates] by-type]
          (let [matches (decide-matches llm-config doc type-candidates)]
            (doseq [match matches]
              (create-match! db doc match))))))))

Step 4: Run test to verify it passes

Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.core-test]' Expected: PASS

Step 5: Commit

git add src/com/getorcha/workers/matching/core.clj test/com/getorcha/workers/matching/core_test.clj
git commit -m "feat(matching): add core matching logic with cluster management"

Task 10: SQS Worker

Files:

Step 1: Create worker namespace

;; src/com/getorcha/workers/matching/worker.clj
(ns com.getorcha.workers.matching.worker
  "SQS worker for document matching."
  (:require [com.getorcha.aws :as aws]
            [com.getorcha.db.sql :as db.sql]
            [com.getorcha.workers.matching.core :as matching]
            [honey.sql.helpers :as h]
            [integrant.core :as ig]
            [taoensso.timbre :as log]))


(defn- fetch-document
  "Fetch document with structured data."
  [db document-id]
  (first
    (db.sql/execute! db
      (-> (h/select :*)
          (h/from :document)
          (h/where [:= :id document-id])))))


(defn process-message!
  "Process a single SQS message for document matching."
  [{:keys [db-pool search-config llm-config]} message]
  (let [{:keys [document-id]} (aws/parse-message-body message)]
    (log/info "Processing document for matching" {:document-id document-id})
    (try
      (let [doc (fetch-document db-pool document-id)]
        (if (and doc (:document/structured-data doc))
          (do
            (matching/match-document! db-pool search-config llm-config
                                      {:id (:document/id doc)
                                       :type (keyword (:document/type doc))
                                       :legal-entity-id (:document/legal-entity-id doc)
                                       :structured-data (:document/structured-data doc)
                                       :cluster-id (:document/cluster-id doc)})
            (log/info "Document matching complete" {:document-id document-id}))
          (log/warn "Document not found or missing structured data" {:document-id document-id})))
      (catch Exception e
        (log/error e "Failed to process document for matching" {:document-id document-id})
        (throw e)))))


(defn- poll-loop!
  "Main polling loop for SQS messages."
  [{:keys [aws-state] :as context} running?]
  (let [queue-url (get-in aws-state [:queue-urls :matching])]
    (log/info "Starting matching worker polling loop" {:queue (get-in aws-state [:queues :matching])})
    (while @running?
      (try
        (let [messages (aws/receive-messages aws-state queue-url
                                             {:max-messages 5
                                              :wait-time-seconds 20})]
          (doseq [message messages]
            (try
              (process-message! context message)
              (aws/delete-message! aws-state queue-url message)
              (catch Exception e
                (log/error e "Failed to process message, will retry")))))
        (catch Exception e
          (log/error e "Error in polling loop")
          (Thread/sleep 5000))))))


(defmethod ig/init-key ::orchestrator
  [_ {:keys [aws-state db-pool search-config llm-config] :as config}]
  (log/info "Initializing document matching orchestrator")
  (let [running? (atom true)
        context {:aws-state aws-state
                 :db-pool db-pool
                 :search-config search-config
                 :llm-config llm-config}
        thread (Thread. #(poll-loop! context running?))]
    (.start thread)
    {:thread thread
     :running? running?
     :config config}))


(defmethod ig/halt-key! ::orchestrator
  [_ {:keys [thread running?]}]
  (log/info "Stopping document matching orchestrator")
  (reset! running? false)
  (.join thread 5000))

Step 2: Commit worker

git add src/com/getorcha/workers/matching/worker.clj
git commit -m "feat(matching): add SQS worker for document matching"

Task 11: Integrant Configuration

Files:

Step 1: Add matching worker to config.edn

Add the following to config.edn (find appropriate location among other worker configs):

:com.getorcha.workers.matching.worker/orchestrator
{:aws-state     #ig/ref :com.getorcha.aws/state
 :db-pool       #ig/ref :com.getorcha.db/pool
 :search-config nil  ; TODO: wire up when hybrid search is integrated
 :llm-config    #profile {:test nil
                          :default {:model "claude-3-haiku-20240307"}}}

Step 2: Commit config

git add resources/com/getorcha/config.edn
git commit -m "feat(matching): add matching worker to Integrant config"

Task 12: Integration Test

Files:

Step 1: Write integration test

;; test/com/getorcha/workers/matching/integration_test.clj
(ns com.getorcha.workers.matching.integration-test
  (:require [clojure.test :refer [deftest is testing use-fixtures]]
            [com.getorcha.db.document-matching :as db.matching]
            [com.getorcha.workers.matching.core :as matching]
            [com.getorcha.test.fixtures :as fixtures]))


(use-fixtures :once fixtures/with-running-system)
(use-fixtures :each fixtures/with-db-rollback)


(deftest invoice-po-matching-integration-test
  (testing "invoice with PO reference matches PO with same number"
    (let [le-id (:legal-entity/id (fixtures/create-legal-entity!))
          po (fixtures/create-document!
               {:legal-entity-id le-id
                :type "purchase-order"
                :structured-data {:po-number "PO-2024-001"
                                  :supplier {:name "ACME Corp"
                                             :vat-id "DE123456789"}
                                  :total 10000
                                  :currency "EUR"}})
          invoice (fixtures/create-document!
                    {:legal-entity-id le-id
                     :type "invoice"
                     :structured-data {:invoice-number "INV-2024-001"
                                       :po-reference "PO-2024-001"
                                       :issuer {:name "ACME Corp"
                                                :vat-id "DE123456789"}
                                       :total 10000
                                       :currency "EUR"}})]

      (matching/match-document! fixtures/*db* nil nil
                                {:id (:document/id invoice)
                                 :type :invoice
                                 :legal-entity-id le-id
                                 :structured-data (:document/structured-data invoice)
                                 :cluster-id nil})

      (let [matches (db.matching/get-matches-for-document fixtures/*db* (:document/id invoice))]
        (is (= 1 (count matches)))
        (is (>= (:document-match/confidence (first matches)) 0.7M)))

      (let [[inv-doc po-doc] (db.matching/get-documents-by-ids
                               fixtures/*db*
                               [(:document/id invoice) (:document/id po)])]
        (is (some? (:document/cluster-id inv-doc)))
        (is (= (:document/cluster-id inv-doc) (:document/cluster-id po-doc)))))))


(deftest no-match-when-different-suppliers-test
  (testing "invoice does not match PO with different supplier VAT"
    (let [le-id (:legal-entity/id (fixtures/create-legal-entity!))
          _po (fixtures/create-document!
                {:legal-entity-id le-id
                 :type "purchase-order"
                 :structured-data {:po-number "PO-2024-002"
                                   :supplier {:name "Different Corp"
                                              :vat-id "DE999999999"}
                                   :total 5000}})
          invoice (fixtures/create-document!
                    {:legal-entity-id le-id
                     :type "invoice"
                     :structured-data {:invoice-number "INV-2024-002"
                                       :issuer {:name "ACME Corp"
                                                :vat-id "DE123456789"}
                                       :total 5000}})]

      (matching/match-document! fixtures/*db* nil nil
                                {:id (:document/id invoice)
                                 :type :invoice
                                 :legal-entity-id le-id
                                 :structured-data (:document/structured-data invoice)
                                 :cluster-id nil})

      (let [matches (db.matching/get-matches-for-document fixtures/*db* (:document/id invoice))]
        (is (empty? matches))))))

Step 2: Run integration test

Run: clj -X:test:silent :nses '[com.getorcha.workers.matching.integration-test]' Expected: PASS

Step 3: Commit

git add test/com/getorcha/workers/matching/integration_test.clj
git commit -m "test(matching): add end-to-end integration tests"

Task 13: Publish Document-Ready Event from Ingestion

Files:

Step 1: Find where ingestion completes successfully

Look for where valid_structured_data = true is set or where ingestion status becomes completed.

Step 2: Add SQS message publish

After successful ingestion, publish to matching queue:

(aws/send-message! aws-state
                   (get-in aws-state [:queue-urls :matching])
                   {:document-id document-id})

Step 3: Commit

git add src/com/getorcha/workers/ingestion/orchestrator.clj
git commit -m "feat(matching): publish document-ready event after ingestion"

Summary

Files created:

Files modified:

Not implemented (future work):