Note (2026-04-24): After this document was written, legal_entity was renamed to tenant and the old tenant was renamed to organization. Read references to these terms with the pre-rename meaning.

Ranked Candidate Retrieval via Hybrid Search

Date: 2026-02-26 Status: Approved Supersedes: Layer 1 candidate retrieval in 2026-02-25-candidate-retrieval-redesign.md

Problem

The current find-candidates does SELECT * WHERE normalized_counterparty = :counterparty LIMIT 50 with no ordering. If a counterparty has >50 matchable documents, the returned 50 are arbitrary. The right candidate might not be in that set.

Beyond the LIMIT problem, there is no ranking at all. Layer 2 evidence scoring can only evaluate what Layer 1 returns — garbage in, garbage out.

What humans actually use to match documents

Analysis of three real invoice-contract pairs (dump/matching/01-03) shows:

Signal	Pair 01 (ABO Kraft biomethan)	Pair 02 (M&M biomethan)	Pair 03 (bikosigma transport)
Shared reference numbers	None	Date ref only	None
Amount proximity	Weak (157k/250k)	Weak (408k/520k)	N/A (rate card)
Subject matter text	"Biomethan", "1.200.000 kWh"	"Biomethan", "4.000.000 kWh"	"Gas-Transportmanagement", line items

Text overlap on substantive content is the universal signal. References and amounts are sometimes useful but often absent. The retrieval layer must prioritize textual and semantic similarity.

Current `searchable_text` is too sparse

build-searchable-text indexes names, IDs, reference numbers, totals, currency — but NOT the discriminating content: line item descriptions, quantities+units, deliverables, service descriptions. Two contracts from the same supplier produce nearly identical searchable_text, making BM25 useless for differentiation.

Design

1. Enriched `searchable_text`

Extend build-searchable-text to include the content that discriminates between documents from the same counterparty.

Additions per type:

Type	Current fields	Added fields
Invoice	issuer name/vat-id, invoice-number, total, currency, po-ref, gr-ref	line item descriptions, quantities+units
Contract	counterparty name/tax-id, contract-number, total-value, currency	deliverable descriptions (verbatim, with quantities+units)
PO	supplier name/vat-id, po-number, total-value, currency, contract-ref, requisition-number	line item descriptions, quantities+units
GRN	supplier name/vat-id, grn-number, po-ref, delivery-note-number	line item descriptions, quantities+units

Contract deliverables are already stored as strings like "1.200.000 kWh (HS,N) Biomethan am Gastag 22.10.2025" — include them verbatim. They contain the quantities, units, commodity terms, and dates that BM25 and embeddings need.

| separator convention unchanged. Line items joined with | between items, description/quantity/unit space-separated within each item.

2. Populate `searchable_text` and `embedding`

Both columns exist on document (migration 20260224100000) with GIN and HNSW indexes. Nothing currently writes to them.

At matching time — extend match-document! to:

1. extract counterparty + references (existing)
2. build searchable_text (new)
3. compute embedding via search/embed-query with RETRIEVAL_DOCUMENT task type (new)
4. persist all fields: counterparty, references, searchable_text, embedding
5. find-candidates using hybrid search (changed)
6. evidence scoring + matching (existing)

Store embedding with RETRIEVAL_DOCUMENT task type. At search time, search/search computes a RETRIEVAL_QUERY embedding for the query — these are different task types optimized for their purpose by the Vertex AI model.

Backfill — extend the existing backfill pattern to populate searchable_text and embedding for all documents with structured data:

searchable_text — synchronous, pure computation. For each document with structured_data, call build-searchable-text and UPDATE.
embedding — async via search/embed (batches of 100, 200ms delay). Fetch documents where searchable_text IS NOT NULL AND embedding IS NULL, batch-embed, write back.

Both phases are idempotent — skip documents that already have values.

3. Hybrid Search Candidate Retrieval

Replace find-candidates with search/search using the generic :where clause:

(defn find-candidates
  [db search-config doc]
  (let [counterparty    (:normalized-counterparty doc)
        matchable-types (get-matchable-types (:type doc))]
    (when counterparty
      (search/search db
        {:table            :document
         :id-column        :id
         :embedding-column :embedding
         :text-column      :searchable-text
         :where            [:and
                            [:= :legal-entity-id (:legal-entity-id doc)]
                            [:in :type (mapv #(db.sql/->cast % :document-type) matchable-types)]
                            [:is-not :structured-data nil]
                            [:= :normalized-counterparty counterparty]
                            [:<> :id (:id doc)]]}
        (:searchable-text doc)
        (merge search-config {:k 50 :semantic-k 200 :bm25-k 200})))))

Why no Clojure re-ranking layer: The original design proposed adding reference overlap and amount proximity bonuses between hybrid search and evidence scoring. This is unnecessary — Layer 2 evidence scoring already evaluates these signals precisely. Hybrid search with enriched searchable_text is discriminating enough to produce a good top 50.

Why hybrid over BM25-only: BM25 catches exact token overlap (quantities, reference numbers, "Kapazitätsbuchung"). Semantic embeddings catch vocabulary variation between contracts and invoices ("Transportnetzbetreiber" vs "TSO", "Abwicklung Gas-Transport" vs "volumenabhängiges Entgelt TPM"). RRF fusion gives precision (BM25) and recall (semantic).

4. Edge Cases

Empty searchable_text: If structured_data has none of the expected fields, store empty string, skip embedding computation. Won't appear in BM25 results. Not a useful candidate anyway.

Embedding API failure during match-document!: Fall back to BM25-only search. find-candidates runs BM25 search without the semantic branch, without RRF. Degraded but functional.

Candidate has no embedding yet (backfill in progress): Vector search filters embedding IS NOT NULL. These candidates are invisible to semantic search but still found by BM25. RRF handles mixed result sets — a document in only one list still gets an RRF score.

Counterparty is null: find-candidates returns nil, match-document! skips matching. No change.

Unchanged

Layer 2 evidence scoring (signals, weights, thresholds)
Layer 3 LLM decider (invocation rules, prompt, response parsing)
Match creation and cluster assignment
Matchable type pairs
Counterparty and reference normalization

Infrastructure Reused

search/search — hybrid BM25+semantic+RRF (existing, with :where clause support)
search/embed — batch embedding via Vertex AI (existing)
search/embed-query — single-text query embedding (existing)
document.searchable_text column + GIN index (existing migration)
document.embedding vector(768) column + HNSW index (existing migration)
text-multilingual-embedding-002 model, 768 dimensions (existing config)