Note (2026-04-24): After this document was written,
legal_entitywas renamed totenantand the oldtenantwas renamed toorganization. Read references to these terms with the pre-rename meaning.
Date: 2026-02-25
Status: Approved
Supersedes: Candidate retrieval section of 2026-02-24-document-matching-design.md
The current candidate retrieval fetches 50 documents from the same legal entity using hybrid search (BM25 + semantic) over all documents. This approach breaks down at scale (100k+ documents per legal entity) because:
Test case: Contract CFG-ABO-001 (ABO Kraft & Wärme ↔ Carbon Farming, biomethane supply) and Invoice 2025-029RAM share zero cross-references, no common VAT ID, no IBAN. They connect through: supplier name (with formatting difference), matching quantity (1,200,000 kWh), date alignment (Oct 2025), and price plausibility.
Document ingestion completes (valid_structured_data = true)
│
▼
Populate normalized_counterparty + normalized_references
│
▼
Layer 1: SQL candidate retrieval
(counterparty filter + BM25 + deterministic rank → 50 candidates)
│
▼
Layer 2: Evidence scoring
(weighted signals on 50 candidates → scored + ranked)
│
├─ 0 candidates ≥ 0.30 → no match
├─ 1 candidate ≥ 0.70 → auto-match (rule-based)
└─ anything else → top 3 → Layer 3
│
▼
Layer 3: LLM decider (optional)
(confirms/rejects matches for ambiguous cases)
│
▼
Create/update match edges + clusters
If Layer 1 returns 0 candidates (counterparty not found due to extraction failure), log and skip matching. Don't compensate with a fallback — fix the extraction instead.
-- Normalized counterparty name for fast supplier-based filtering
ALTER TABLE document ADD COLUMN normalized_counterparty text;
CREATE INDEX idx_document_normalized_counterparty ON document(normalized_counterparty);
-- All reference numbers from the document, normalized (lowercased, separators stripped)
ALTER TABLE document ADD COLUMN normalized_references text[];
CREATE INDEX idx_document_normalized_references ON document USING gin(normalized_references);
The legal entity is always the buyer. The counterparty (supplier/vendor) comes from:
| Document Type | Counterparty Field |
|---|---|
| Invoice | issuer.name |
| Purchase Order | supplier.name |
| Contract | party-b.name |
| GRN | supplier.name |
Normalized via existing normalize-supplier-name (lowercase, transliterate umlauts, strip punctuation, strip company suffixes):
"abo kraft waerme ramstein""abo kraft waerme ramstein"All reference numbers from a document, normalized (lowercased, all separators stripped):
invoice-number: "2025-029RAM", po-reference: "PO-2024-001" → ["2025029ram", "po2024001"]contract-number: "CFG-ABO-001" → ["cfgabo001"]po-number: "PO-2024-001", contract-reference: "CFG-ABO-001" → ["po2024001", "cfgabo001"]grn-number: "GRN-001", po-reference: "PO-2024-001" → ["grn001", "po2024001"]Both columns populated during ingestion when structured_data is written.
Single query that filters by counterparty + matchable types, then ranks using BM25 + deterministic signals:
SELECT d.id,
d.type,
d.structured_data,
(
-- BM25: textual overlap (commodity, quantities, descriptions)
COALESCE(ts_rank(
to_tsvector('simple', d.searchable_text),
plainto_tsquery('simple', :query_text)
), 0) * 10
-- Reference overlap: any normalized reference in common
+ CASE WHEN d.normalized_references && :source_refs THEN 50 ELSE 0 END
-- Amount proximity: within 50% of source total
+ CASE WHEN :source_total > 0
AND ABS(COALESCE((d.structured_data->>'total')::numeric,
(d.structured_data->>'total-value')::numeric, 0)
- :source_total)
/ :source_total < 0.5
THEN 10 ELSE 0 END
) AS rank_score
FROM document d
WHERE d.legal_entity_id = :legal_entity_id
AND d.type = ANY(:matchable_types)
AND d.normalized_counterparty = :counterparty
AND d.id != :source_id
ORDER BY rank_score DESC
LIMIT 50
Design choices:
searchable_text already exists with a GIN index from the current schemasearchable_text valueUnchanged from current design:
(def matchable-pairs
#{#{:invoice :purchase-order}
#{:invoice :contract}
#{:purchase-order :contract}
#{:goods-received-note :purchase-order}})
Full weighted signal computation in Clojure on the 50 candidates from Layer 1. Produces scored candidates with evidence trails.
(def evidence-signals
{;; Reference matches (highest value — when they exist)
:po-number-exact 60 ; normalized PO number match
:contract-ref-exact 55 ; normalized contract reference match
:po-ref-exact 55 ; PO reference on GRN matches PO document
;; Identity matches
:vat-id-match 30 ; supplier VAT/tax IDs match
:iban-match 25 ; supplier bank accounts match
;; Quantity & amount matches
:quantity-exact 35 ; same quantity appears in both documents
:amount-within-2pct 20 ; total amounts within 2%
:amount-within-5pct 10 ; total amounts within 5%
;; Temporal alignment
:date-within-period 20 ; invoice service period falls within contract dates
:delivery-date-match 25 ; invoice date aligns with contract delivery schedule
;; Fuzzy matches
:supplier-name-fuzzy 15 ; >0.8 Jaro-Winkler after normalization
:description-overlap 10 ; shared commodity/service terms
;; Negative signals
:currency-mismatch -30 ; different currencies
:vat-id-mismatch -40}) ; VAT IDs present but don't match
Raw score = sum of matched signal weights. Normalized = min(1.0, max(0.0, raw / 100.0)).
(def match-thresholds
{:high 0.70
:low 0.30})
Contract CFG-ABO-001 ↔ Invoice 2025-029RAM:
| Signal | Value | Weight |
|---|---|---|
:quantity-exact |
1,200,000 kWh in both | +35 |
:delivery-date-match |
Oct 2025 ↔ Gastag 22.10.2025 | +25 |
:supplier-name-fuzzy |
Normalized names match | +15 |
:description-overlap |
"Lieferung von Biomethan" | +10 |
:amount-within-5pct |
€157k vs ~€150k (first delivery) | +10 |
Total: 95 → normalized 0.95 — well above high threshold, auto-matches without LLM.
Extract all numeric quantities from both documents (line items, deliverables, contract schedules). If any quantity + unit pair appears in both, fire :quantity-exact. This handles the case where a contract specifies multiple deliveries and an invoice covers one of them.
Invocation rule:
LLM receives: Source document summary, each candidate's summary, AND the evidence signals with scores. This lets the LLM see what scoring found and make an informed judgment.
LLM response:
high or medium confidence → create matchlow confidence → no match