Note (2026-04-24): After this document was written,
legal_entitywas renamed totenantand the oldtenantwas renamed toorganization. Read references to these terms with the pre-rename meaning.
Date: 2026-02-26
Status: Approved
Supersedes: Layer 1 candidate retrieval in 2026-02-25-candidate-retrieval-redesign.md
The current find-candidates does SELECT * WHERE normalized_counterparty = :counterparty LIMIT 50 with no ordering. If a counterparty has >50 matchable documents, the returned 50 are arbitrary. The right candidate might not be in that set.
Beyond the LIMIT problem, there is no ranking at all. Layer 2 evidence scoring can only evaluate what Layer 1 returns — garbage in, garbage out.
Analysis of three real invoice-contract pairs (dump/matching/01-03) shows:
| Signal | Pair 01 (ABO Kraft biomethan) | Pair 02 (M&M biomethan) | Pair 03 (bikosigma transport) |
|---|---|---|---|
| Shared reference numbers | None | Date ref only | None |
| Amount proximity | Weak (157k/250k) | Weak (408k/520k) | N/A (rate card) |
| Subject matter text | "Biomethan", "1.200.000 kWh" | "Biomethan", "4.000.000 kWh" | "Gas-Transportmanagement", line items |
Text overlap on substantive content is the universal signal. References and amounts are sometimes useful but often absent. The retrieval layer must prioritize textual and semantic similarity.
searchable_text is too sparsebuild-searchable-text indexes names, IDs, reference numbers, totals, currency — but NOT the discriminating content: line item descriptions, quantities+units, deliverables, service descriptions. Two contracts from the same supplier produce nearly identical searchable_text, making BM25 useless for differentiation.
searchable_textExtend build-searchable-text to include the content that discriminates between documents from the same counterparty.
Additions per type:
| Type | Current fields | Added fields |
|---|---|---|
| Invoice | issuer name/vat-id, invoice-number, total, currency, po-ref, gr-ref | line item descriptions, quantities+units |
| Contract | counterparty name/tax-id, contract-number, total-value, currency | deliverable descriptions (verbatim, with quantities+units) |
| PO | supplier name/vat-id, po-number, total-value, currency, contract-ref, requisition-number | line item descriptions, quantities+units |
| GRN | supplier name/vat-id, grn-number, po-ref, delivery-note-number | line item descriptions, quantities+units |
Contract deliverables are already stored as strings like "1.200.000 kWh (HS,N) Biomethan am Gastag 22.10.2025" — include them verbatim. They contain the quantities, units, commodity terms, and dates that BM25 and embeddings need.
| separator convention unchanged. Line items joined with | between items, description/quantity/unit space-separated within each item.
searchable_text and embeddingBoth columns exist on document (migration 20260224100000) with GIN and HNSW indexes. Nothing currently writes to them.
At matching time — extend match-document! to:
1. extract counterparty + references (existing)
2. build searchable_text (new)
3. compute embedding via search/embed-query with RETRIEVAL_DOCUMENT task type (new)
4. persist all fields: counterparty, references, searchable_text, embedding
5. find-candidates using hybrid search (changed)
6. evidence scoring + matching (existing)
Store embedding with RETRIEVAL_DOCUMENT task type. At search time, search/search computes a RETRIEVAL_QUERY embedding for the query — these are different task types optimized for their purpose by the Vertex AI model.
Backfill — extend the existing backfill pattern to populate searchable_text and embedding for all documents with structured data:
searchable_text — synchronous, pure computation. For each document with structured_data, call build-searchable-text and UPDATE.embedding — async via search/embed (batches of 100, 200ms delay). Fetch documents where searchable_text IS NOT NULL AND embedding IS NULL, batch-embed, write back.Both phases are idempotent — skip documents that already have values.
Replace find-candidates with search/search using the generic :where clause:
(defn find-candidates
[db search-config doc]
(let [counterparty (:normalized-counterparty doc)
matchable-types (get-matchable-types (:type doc))]
(when counterparty
(search/search db
{:table :document
:id-column :id
:embedding-column :embedding
:text-column :searchable-text
:where [:and
[:= :legal-entity-id (:legal-entity-id doc)]
[:in :type (mapv #(db.sql/->cast % :document-type) matchable-types)]
[:is-not :structured-data nil]
[:= :normalized-counterparty counterparty]
[:<> :id (:id doc)]]}
(:searchable-text doc)
(merge search-config {:k 50 :semantic-k 200 :bm25-k 200})))))
Why no Clojure re-ranking layer: The original design proposed adding reference overlap and amount proximity bonuses between hybrid search and evidence scoring. This is unnecessary — Layer 2 evidence scoring already evaluates these signals precisely. Hybrid search with enriched searchable_text is discriminating enough to produce a good top 50.
Why hybrid over BM25-only: BM25 catches exact token overlap (quantities, reference numbers, "Kapazitätsbuchung"). Semantic embeddings catch vocabulary variation between contracts and invoices ("Transportnetzbetreiber" vs "TSO", "Abwicklung Gas-Transport" vs "volumenabhängiges Entgelt TPM"). RRF fusion gives precision (BM25) and recall (semantic).
Empty searchable_text: If structured_data has none of the expected fields, store empty string, skip embedding computation. Won't appear in BM25 results. Not a useful candidate anyway.
Embedding API failure during match-document!: Fall back to BM25-only search. find-candidates runs BM25 search without the semantic branch, without RRF. Degraded but functional.
Candidate has no embedding yet (backfill in progress): Vector search filters embedding IS NOT NULL. These candidates are invisible to semantic search but still found by BM25. RRF handles mixed result sets — a document in only one list still gets an RRF score.
Counterparty is null: find-candidates returns nil, match-document! skips matching. No change.
search/search — hybrid BM25+semantic+RRF (existing, with :where clause support)search/embed — batch embedding via Vertex AI (existing)search/embed-query — single-text query embedding (existing)document.searchable_text column + GIN index (existing migration)document.embedding vector(768) column + HNSW index (existing migration)text-multilingual-embedding-002 model, 768 dimensions (existing config)