Note (2026-04-24): After this document was written, legal_entity was renamed to tenant and the old tenant was renamed to organization. Read references to these terms with the pre-rename meaning.

Ranked Candidate Retrieval via Hybrid Search

Date: 2026-02-26 Status: Approved Supersedes: Layer 1 candidate retrieval in 2026-02-25-candidate-retrieval-redesign.md

Problem

The current find-candidates does SELECT * WHERE normalized_counterparty = :counterparty LIMIT 50 with no ordering. If a counterparty has >50 matchable documents, the returned 50 are arbitrary. The right candidate might not be in that set.

Beyond the LIMIT problem, there is no ranking at all. Layer 2 evidence scoring can only evaluate what Layer 1 returns — garbage in, garbage out.

What humans actually use to match documents

Analysis of three real invoice-contract pairs (dump/matching/01-03) shows:

Signal Pair 01 (ABO Kraft biomethan) Pair 02 (M&M biomethan) Pair 03 (bikosigma transport)
Shared reference numbers None Date ref only None
Amount proximity Weak (157k/250k) Weak (408k/520k) N/A (rate card)
Subject matter text "Biomethan", "1.200.000 kWh" "Biomethan", "4.000.000 kWh" "Gas-Transportmanagement", line items

Text overlap on substantive content is the universal signal. References and amounts are sometimes useful but often absent. The retrieval layer must prioritize textual and semantic similarity.

Current searchable_text is too sparse

build-searchable-text indexes names, IDs, reference numbers, totals, currency — but NOT the discriminating content: line item descriptions, quantities+units, deliverables, service descriptions. Two contracts from the same supplier produce nearly identical searchable_text, making BM25 useless for differentiation.


Design

1. Enriched searchable_text

Extend build-searchable-text to include the content that discriminates between documents from the same counterparty.

Additions per type:

Type Current fields Added fields
Invoice issuer name/vat-id, invoice-number, total, currency, po-ref, gr-ref line item descriptions, quantities+units
Contract counterparty name/tax-id, contract-number, total-value, currency deliverable descriptions (verbatim, with quantities+units)
PO supplier name/vat-id, po-number, total-value, currency, contract-ref, requisition-number line item descriptions, quantities+units
GRN supplier name/vat-id, grn-number, po-ref, delivery-note-number line item descriptions, quantities+units

Contract deliverables are already stored as strings like "1.200.000 kWh (HS,N) Biomethan am Gastag 22.10.2025" — include them verbatim. They contain the quantities, units, commodity terms, and dates that BM25 and embeddings need.

| separator convention unchanged. Line items joined with | between items, description/quantity/unit space-separated within each item.

2. Populate searchable_text and embedding

Both columns exist on document (migration 20260224100000) with GIN and HNSW indexes. Nothing currently writes to them.

At matching time — extend match-document! to:

1. extract counterparty + references (existing)
2. build searchable_text (new)
3. compute embedding via search/embed-query with RETRIEVAL_DOCUMENT task type (new)
4. persist all fields: counterparty, references, searchable_text, embedding
5. find-candidates using hybrid search (changed)
6. evidence scoring + matching (existing)

Store embedding with RETRIEVAL_DOCUMENT task type. At search time, search/search computes a RETRIEVAL_QUERY embedding for the query — these are different task types optimized for their purpose by the Vertex AI model.

Backfill — extend the existing backfill pattern to populate searchable_text and embedding for all documents with structured data:

  1. searchable_text — synchronous, pure computation. For each document with structured_data, call build-searchable-text and UPDATE.
  2. embedding — async via search/embed (batches of 100, 200ms delay). Fetch documents where searchable_text IS NOT NULL AND embedding IS NULL, batch-embed, write back.

Both phases are idempotent — skip documents that already have values.

3. Hybrid Search Candidate Retrieval

Replace find-candidates with search/search using the generic :where clause:

(defn find-candidates
  [db search-config doc]
  (let [counterparty    (:normalized-counterparty doc)
        matchable-types (get-matchable-types (:type doc))]
    (when counterparty
      (search/search db
        {:table            :document
         :id-column        :id
         :embedding-column :embedding
         :text-column      :searchable-text
         :where            [:and
                            [:= :legal-entity-id (:legal-entity-id doc)]
                            [:in :type (mapv #(db.sql/->cast % :document-type) matchable-types)]
                            [:is-not :structured-data nil]
                            [:= :normalized-counterparty counterparty]
                            [:<> :id (:id doc)]]}
        (:searchable-text doc)
        (merge search-config {:k 50 :semantic-k 200 :bm25-k 200})))))

Why no Clojure re-ranking layer: The original design proposed adding reference overlap and amount proximity bonuses between hybrid search and evidence scoring. This is unnecessary — Layer 2 evidence scoring already evaluates these signals precisely. Hybrid search with enriched searchable_text is discriminating enough to produce a good top 50.

Why hybrid over BM25-only: BM25 catches exact token overlap (quantities, reference numbers, "Kapazitätsbuchung"). Semantic embeddings catch vocabulary variation between contracts and invoices ("Transportnetzbetreiber" vs "TSO", "Abwicklung Gas-Transport" vs "volumenabhängiges Entgelt TPM"). RRF fusion gives precision (BM25) and recall (semantic).

4. Edge Cases

Empty searchable_text: If structured_data has none of the expected fields, store empty string, skip embedding computation. Won't appear in BM25 results. Not a useful candidate anyway.

Embedding API failure during match-document!: Fall back to BM25-only search. find-candidates runs BM25 search without the semantic branch, without RRF. Degraded but functional.

Candidate has no embedding yet (backfill in progress): Vector search filters embedding IS NOT NULL. These candidates are invisible to semantic search but still found by BM25. RRF handles mixed result sets — a document in only one list still gets an RRF score.

Counterparty is null: find-candidates returns nil, match-document! skips matching. No change.


Unchanged

Infrastructure Reused