Note (2026-04-24): After this document was written, legal_entity was renamed to tenant and the old tenant was renamed to organization. Read references to these terms with the pre-rename meaning.

Ranked Candidate Retrieval Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Goal: Replace unranked candidate retrieval with hybrid BM25+semantic search, enriching searchable_text to include discriminating content (line items, deliverables, quantities).

Architecture: Enrich build-searchable-text with line items/deliverables, persist searchable_text + embedding during matching, replace find-candidates with search/search hybrid query, backfill existing documents.

Tech Stack: PostgreSQL (tsvector GIN, pgvector HNSW), Google Vertex AI text-multilingual-embedding-002, HoneySQL, com.getorcha.search hybrid search


Task 1: Enrich build-searchable-text with Line Items and Deliverables

Files:

Context: The current function only includes names, IDs, reference numbers, totals, currency. It must also include line item descriptions+quantities+units (invoices, POs, GRNs) and deliverable descriptions (contracts) to make BM25 discriminating between documents from the same counterparty.

The join-non-nil helper already handles nil filtering and stringification. Line items are vectors of maps, deliverables are vectors of strings.

Step 1: Write failing tests for enriched content

Add test cases to searchable_text_test.clj for each document type with line items/deliverables:

(testing "invoice includes line item descriptions, quantities, and units"
  (let [doc {:type :invoice
             :structured-data {:issuer {:name "ACME Corp" :vat-id "DE123"}
                               :invoice-number "INV-001"
                               :total 15000.00
                               :currency "EUR"
                               :line-items [{:description "Lieferung von Biomethan"
                                             :quantity 1200000.0
                                             :unit "kWh Hs"
                                             :amount 15000.00}
                                            {:description "Transport"
                                             :quantity 1.0
                                             :unit "Stück"}]}}
        result (searchable/build-searchable-text doc)]
    (is (str/includes? result "Lieferung von Biomethan"))
    (is (str/includes? result "1200000"))
    (is (str/includes? result "kWh Hs"))
    (is (str/includes? result "Transport"))))

(testing "contract includes deliverable descriptions verbatim"
  (let [doc {:type :contract
             :structured-data {:counterparty {:name "ABO Kraft" :tax-id "DE302"}
                               :contract-number "CFG-ABO-001"
                               :total-value 250000
                               :currency "EUR"
                               :deliverables ["1.200.000 kWh (HS,N) Biomethan am Gastag 22.10.2025"
                                              "800.000 kWh (HS,N) Biomethan am Gastag 08.01.2026"]}}
        result (searchable/build-searchable-text doc)]
    (is (str/includes? result "1.200.000 kWh"))
    (is (str/includes? result "Biomethan"))
    (is (str/includes? result "22.10.2025"))
    (is (str/includes? result "800.000 kWh"))))

(testing "purchase-order includes line item descriptions and quantities"
  (let [doc {:type :purchase-order
             :structured-data {:supplier {:name "Widgets Inc" :vat-id "DE987"}
                               :po-number "PO-050"
                               :total-value 20000.00
                               :currency "EUR"
                               :line-items [{:description "Steel bolts M8x40"
                                             :quantity 5000
                                             :unit "Stück"}]}}
        result (searchable/build-searchable-text doc)]
    (is (str/includes? result "Steel bolts M8x40"))
    (is (str/includes? result "5000"))
    (is (str/includes? result "Stück"))))

(testing "goods-received-note includes line item descriptions and quantities"
  (let [doc {:type :goods-received-note
             :structured-data {:supplier {:name "Parts AG" :vat-id "DE444"}
                               :grn-number "GRN-100"
                               :po-reference "PO-050"
                               :line-items [{:description "Steel bolts M8x40"
                                             :quantity 4950
                                             :unit "Stück"}]}}
        result (searchable/build-searchable-text doc)]
    (is (str/includes? result "Steel bolts M8x40"))
    (is (str/includes? result "4950"))))

(testing "handles missing line-items and deliverables gracefully"
  (let [doc {:type :invoice
             :structured-data {:issuer {:name "Some Corp"}
                               :line-items nil}}
        result (searchable/build-searchable-text doc)]
    (is (str/includes? result "Some Corp"))
    (is (string? result))))

Step 2: Run tests to verify they fail

clj -X:test:silent :nses '[com.getorcha.workers.matching.searchable-text-test]'

Expected: FAIL — line item content not included in output.

Step 3: Implement enriched build-searchable-text

Add a helper to format line items, then append to each type's output:

(defn ^:private format-line-items
  "Format line items as pipe-separated strings: 'description quantity unit'."
  [line-items]
  (when (seq line-items)
    (->> line-items
         (map (fn [{:keys [description quantity unit]}]
                (join-non-nil " " description quantity unit)))
         (str/join " | "))))

Then for each document type, append line items or deliverables using join-non-nil:

For contracts, deliverables are a vector of strings — join with " | " and append.

Step 4: Run tests to verify they pass

clj -X:test:silent :nses '[com.getorcha.workers.matching.searchable-text-test]'

Expected: all PASS.

Step 5: Commit

git add src/com/getorcha/workers/matching/searchable_text.clj test/com/getorcha/workers/matching/searchable_text_test.clj
git commit -m "feat(matching): enrich searchable_text with line items and deliverables"

Task 2: Extend set-normalized-fields! to Persist searchable_text and embedding

Files:

Context: set-normalized-fields! currently persists normalized_counterparty and normalized_references. Extend it to also persist searchable_text and embedding. The embedding is a 768-dimensional float vector stored via pgvector [:cast (str (vec embedding)) :vector].

Step 1: Write failing test

Add to integration_test.clj:

(deftest searchable-text-populated-on-match-test
  (testing "matching populates searchable_text on the source document"
    (let [le-id  (helpers/create-legal-entity!)
          inv-id (create-document-with-type!
                   le-id :invoice
                   {:invoice-number "INV-ST-001"
                    :issuer         {:name   "Test Corp"
                                     :vat-id "DE111222333"}
                    :total          5000
                    :currency       "EUR"
                    :line-items     [{:description "Widget delivery"
                                      :quantity    100
                                      :unit        "Stück"}]})
          source {:id              inv-id
                  :type            :invoice
                  :legal-entity-id le-id
                  :structured-data {:invoice-number "INV-ST-001"
                                    :issuer         {:name   "Test Corp"
                                                     :vat-id "DE111222333"}
                                    :total          5000
                                    :currency       "EUR"
                                    :line-items     [{:description "Widget delivery"
                                                      :quantity    100
                                                      :unit        "Stück"}]}}]

      (matching/match-document! fixtures/*db* {} nil source)

      (let [doc (get-document inv-id)]
        (is (some? (:document/searchable-text doc))
            "searchable_text should be populated")
        (is (clojure.string/includes?
              (:document/searchable-text doc) "Widget delivery")
            "searchable_text should include line item descriptions")))))

Step 2: Run test to verify it fails

clj -X:test:silent :nses '[com.getorcha.workers.matching.integration-test]'

Expected: FAIL — searchable_text is nil.

Step 3: Implement changes

Rename set-normalized-fields! to reflect its expanded scope, or keep the name and add parameters. Simplest: add searchable-text and embedding params.

In db/document_matching.clj, update set-normalized-fields!:

(defn set-normalized-fields!
  "Update normalized fields, searchable text, and embedding for a document."
  [db document-id {:keys [counterparty references searchable-text embedding]}]
  (db.sql/execute! db
    {:update :document
     :set    (cond-> {:normalized-counterparty counterparty
                      :normalized-references   [:lift (or references [])]
                      :searchable-text         searchable-text}
               embedding (assoc :embedding [:cast (str (vec embedding)) :vector]))
     :where  [:= :id document-id]}))

In core.clj, update match-document! to build searchable_text and pass it through:

(let [counterparty   (normalize/extract-counterparty doc)
      references     (normalize/extract-references doc)
      searchable-txt (searchable-text/build-searchable-text doc)
      doc            (assoc doc
                       :normalized-counterparty counterparty
                       :normalized-references   references
                       :searchable-text         searchable-txt)]
  (db.matching/set-normalized-fields! db (:id doc)
    {:counterparty   counterparty
     :references     references
     :searchable-text searchable-txt}))

Note: embedding is NOT computed here yet — that's Task 3. This task only persists searchable_text.

Update all existing callers of set-normalized-fields! to use the new map signature. Check the backfill migration too.

Step 4: Run tests

clj -X:test:silent :nses '[com.getorcha.workers.matching.integration-test]'

Also run the full matching test suite to verify nothing broke:

clj -X:test:silent :nses '[com.getorcha.workers.matching.candidates-test com.getorcha.workers.matching.integration-test]'

Expected: all PASS.

Step 5: Commit

git add src/com/getorcha/db/document_matching.clj src/com/getorcha/workers/matching/core.clj test/com/getorcha/workers/matching/integration_test.clj
git commit -m "feat(matching): persist searchable_text during match-document!"

Task 3: Compute and Persist Embedding in match-document!

Files:

Context: After building searchable_text, compute a RETRIEVAL_DOCUMENT embedding via Vertex AI and persist it alongside the other normalized fields. In tests, stub search/embed-query with with-redefs to avoid real API calls.

The existing search/embed-query is synchronous and returns a 768-dim vector. It uses RETRIEVAL_QUERY task type. For document indexing, we need RETRIEVAL_DOCUMENT task type. Use search/call-embedding-api or add a public embed-document function to search.clj that wraps the single-text case with RETRIEVAL_DOCUMENT.

Step 1: Add embed-document to search.clj

(defn embed-document
  "Synchronous single-text embedding for document indexing.
   Returns 768-dimension vector optimized for retrieval document matching."
  [config text]
  (first (call-embedding-api config [text] "RETRIEVAL_DOCUMENT")))

Step 2: Write failing test

In integration_test.clj, add a test that stubs the embedding API:

(deftest embedding-populated-on-match-test
  (testing "matching computes and persists embedding on the source document"
    (let [fake-embedding (vec (repeat 768 0.1))
          le-id  (helpers/create-legal-entity!)
          inv-id (create-document-with-type!
                   le-id :invoice
                   {:invoice-number "INV-EMB-001"
                    :issuer         {:name "Test Corp" :vat-id "DE111"}
                    :total          5000})
          source {:id              inv-id
                  :type            :invoice
                  :legal-entity-id le-id
                  :structured-data {:invoice-number "INV-EMB-001"
                                    :issuer         {:name "Test Corp" :vat-id "DE111"}
                                    :total          5000}}]

      (with-redefs [search/embed-document (constantly fake-embedding)]
        (matching/match-document! fixtures/*db* {} nil source))

      (let [doc (get-document inv-id)]
        (is (some? (:document/embedding doc))
            "embedding should be populated")))))

Step 3: Run test to verify it fails

clj -X:test:silent :nses '[com.getorcha.workers.matching.integration-test]'

Step 4: Implement embedding computation in match-document!

In core.clj, after building searchable-txt:

(let [counterparty    (normalize/extract-counterparty doc)
      references      (normalize/extract-references doc)
      searchable-text (searchable-text/build-searchable-text doc)
      embedding       (when (and (seq searchable-text) (seq search-config))
                        (search/embed-document search-config searchable-text))
      doc             (assoc doc
                        :normalized-counterparty counterparty
                        :normalized-references   references
                        :searchable-text         searchable-text)]
  (db.matching/set-normalized-fields! db (:id doc)
    {:counterparty    counterparty
     :references      references
     :searchable-text searchable-text
     :embedding       embedding}))

Note: embedding failure propagates and fails the match — no try/catch. Retries are handled by the worker. See Task 5 note.

Step 5: Run tests

clj -X:test:silent :nses '[com.getorcha.workers.matching.integration-test]'

Expected: all PASS.

Step 6: Commit

git add src/com/getorcha/search.clj src/com/getorcha/workers/matching/core.clj test/com/getorcha/workers/matching/integration_test.clj
git commit -m "feat(matching): compute and persist embedding during match-document!"

Files:

Context: Replace the plain SQL query with search/search hybrid search. The function signature stays the same: [db search-config doc]. The search-config parameter (previously _search-config) is now used.

Requires search/search to support the :where clause interface (done by user). The doc map must have :searchable-text populated by match-document! before this is called.

Step 1: Update find-candidates implementation

(ns com.getorcha.workers.matching.candidates
  "Candidate retrieval for document matching.

   Uses hybrid BM25+semantic search to find and rank candidate documents
   within the same legal entity, filtered by counterparty identity and
   matchable document types."
  (:require [com.getorcha.db.sql :as db.sql]
            [com.getorcha.search :as search]))


(defn find-candidates
  "Find candidate documents for matching using hybrid BM25+semantic search.

   Searches documents by:
   - Same legal entity
   - Matchable document types (e.g., invoice matches PO and contract)
   - Same normalized counterparty (exact match)
   - Has structured data

   Ranks results using Reciprocal Rank Fusion of BM25 text relevance
   and semantic embedding similarity. Returns top 50 candidates.

   Returns nil when the source document has no normalized counterparty
   or no searchable text for the query."
  [db search-config doc]
  (let [counterparty (:normalized-counterparty doc)
        query-text   (:searchable-text doc)]
    (when (and counterparty (seq query-text))
      (let [matchable-types (get-matchable-types (:type doc))]
        (search/search db
          {:table            :document
           :id-column        :id
           :embedding-column :embedding
           :text-column      :searchable-text
           :where            [:and
                              [:= :legal-entity-id (:legal-entity-id doc)]
                              [:in :type (mapv #(db.sql/->cast % :document-type) matchable-types)]
                              [:is-not :structured-data nil]
                              [:= :normalized-counterparty counterparty]
                              [:<> :id (:id doc)]]}
          query-text
          (merge search-config {:k 50 :semantic-k 200 :bm25-k 200}))))))

Step 2: Update tests

The existing candidates_test.clj tests use find-candidates with {} as search-config and documents that have no searchable_text or embedding populated. These tests need to be updated:

  1. Candidate documents need searchable_text populated (for BM25 to find them)
  2. The source document needs :searchable-text in its map
  3. search/search calls need to be stubbed with with-redefs since we don't have a real embedding API in tests, OR we test BM25-only mode

Approach: Stub search/embed-query to return a fake embedding vector. The BM25 branch works against real PostgreSQL (embedded postgres in tests), so it exercises the actual ts_rank path. The semantic branch returns results based on the fake embedding. RRF fuses both.

Update create-document-with-type! to also set searchable_text:

(defn ^:private create-document-with-type!
  [legal-entity-id type structured-data & {:keys [searchable-text counterparty references embedding]}]
  (let [doc-id (helpers/create-document! legal-entity-id)]
    (db.sql/execute-one!
     fixtures/*db*
     {:update :document
      :set    (cond-> {:type            (db.sql/->cast type :document-type)
                       :structured-data [:lift structured-data]}
                searchable-text (assoc :searchable-text searchable-text)
                counterparty    (assoc :normalized-counterparty counterparty)
                references      (assoc :normalized-references [:lift (or references [])])
                embedding       (assoc :embedding [:cast (str (vec embedding)) :vector]))
      :where  [:= :id doc-id]})
    doc-id))

Update test cases to provide searchable-text on candidates and :searchable-text on source doc map. Stub search/embed-query with with-redefs.

Step 3: Run tests

clj -X:test:silent :nses '[com.getorcha.workers.matching.candidates-test]'

Expected: all PASS.

Step 4: Run full matching test suite

clj -X:test:silent :nses '[com.getorcha.workers.matching.candidates-test com.getorcha.workers.matching.integration-test com.getorcha.workers.matching.core-test com.getorcha.workers.matching.evidence-test com.getorcha.workers.matching.normalize-test com.getorcha.workers.matching.searchable-text-test]'

Expected: all PASS.

Step 5: Commit

git add src/com/getorcha/workers/matching/candidates.clj test/com/getorcha/workers/matching/candidates_test.clj
git commit -m "feat(matching): replace unranked find-candidates with hybrid search"

Task 5: BM25-Only Fallback When Embedding is Unavailable (Removed)

Decision: BM25 fallback was initially implemented but then removed. Embedding failure now fails the match outright — the exception propagates to the worker, which handles retry/dead-lettering. Retries will be implemented in a future iteration. This keeps the code simple and failures visible.

search/search calls embed-query unconditionally. match-document! calls search/embed-document unconditionally (when search-config is present). No try/catch anywhere in the embedding path.


Task 6: Extend Backfill Migration for searchable_text and embedding

Files:

Context: The existing backfill populates normalized_counterparty and normalized_references. Extend it to also populate searchable_text and embedding for all documents with structured data. The backfill is a Migratus code migration that runs at application startup.

searchable_text is pure computation (synchronous). embedding requires Vertex AI API calls (async, batched).

Step 1: Extend migrate-up to populate searchable_text

In the existing loop, after computing counterparty/references, also build searchable_text:

(let [doc          {:type            (keyword type)
                    :structured-data structured-data}
      counterparty (normalize/extract-counterparty doc)
      references   (normalize/extract-references doc)
      searchable   (searchable-text/build-searchable-text doc)]
  (db.sql/execute!
   conn
   {:update :document
    :set    {:normalized-counterparty counterparty
             :normalized-references   [:lift (or references [])]
             :searchable-text         searchable}
    :where  [:= :id id]}))

Step 2: Add embedding backfill as second phase

After the synchronous loop, batch-embed all documents that have searchable_text but no embedding:

;; Phase 2: Embed searchable_text
(let [to-embed (db.sql/execute!
                 conn
                 {:select [:id :searchable-text]
                  :from   [:document]
                  :where  [:and
                           [:is-not :searchable-text nil]
                           [:is :embedding nil]]}
                 jdbc-opts)
      texts    (mapv :document/searchable-text to-embed)
      ids      (mapv :document/id to-embed)]
  (when (seq texts)
    (log/info "Backfilling embeddings" {:document-count (count texts)})
    (let [ch (search/embed embed-config texts)]
      (loop []
        (when-let [[idx embedding] (<!! ch)]
          (db.sql/execute!
           conn
           {:update :document
            :set    {:embedding [:cast (str (vec embedding)) :vector]}
            :where  [:= :id (nth ids idx)]})
          (recur))))
    (log/info "Embedding backfill complete")))

The embed-config needs to come from somewhere — either passed into the migration config or read from the system config. Check how the existing migration receives its {:db conn} config and whether search config can be injected.

If search config isn't available in migration context, the embedding backfill could be a separate REPL-driven step or a standalone task rather than a migration. The searchable_text backfill should still be in the migration since it's pure computation.

Step 3: Test the backfill locally

bb migrate create "backfill-searchable-text-and-embeddings"

Or extend the existing migration. Run in dev REPL to verify:

(require '[com.getorcha.db.migrations.backfill-normalized-matching-columns :as backfill])
(backfill/migrate-up {:db (:com.getorcha.db/pool integrant.repl.state/system)})

Step 4: Commit

git add src/com/getorcha/db/migrations/backfill_normalized_matching_columns.clj
git commit -m "feat(matching): extend backfill to populate searchable_text and embedding"

Task 7: End-to-End Verification with Real Documents

Files:

Context: The existing contract-invoice-matching-without-references-test tests pair 01 (ABO Kraft contract ↔ invoice). Update this test to work with the new hybrid search path. The contract candidate needs searchable_text and optionally embedding populated so that search/search can find it.

Step 1: Update contract-invoice-matching-without-references-test

The test currently runs match-document! on the contract first (to populate its normalized columns), then on the invoice. With the new code, the contract's searchable_text gets populated during its own match-document! call. The invoice's find-candidates then searches using hybrid search and should find the contract via BM25 text overlap on "Biomethan", "kWh", and shared VAT ID.

Stub search/embed-document and search/embed-query to return fake embeddings. The BM25 path should be sufficient to find the match.

Verify:

Step 2: Run the full integration test suite

clj -X:test:silent :nses '[com.getorcha.workers.matching.integration-test]'

Step 3: Run ALL matching tests

clj -X:test:silent :nses '[com.getorcha.workers.matching.candidates-test com.getorcha.workers.matching.core-test com.getorcha.workers.matching.evidence-test com.getorcha.workers.matching.integration-test com.getorcha.workers.matching.normalize-test com.getorcha.workers.matching.searchable-text-test com.getorcha.workers.matching.llm-decision-test]'

Step 4: Commit

git add test/com/getorcha/workers/matching/integration_test.clj
git commit -m "test(matching): update integration tests for hybrid search candidate retrieval"