Note (2026-04-24): After this document was written, legal_entity was renamed to tenant and the old tenant was renamed to organization. Read references to these terms with the pre-rename meaning.

Candidate Retrieval Redesign for Document Matching

Date: 2026-02-25 Status: Approved Supersedes: Candidate retrieval section of 2026-02-24-document-matching-design.md

Problem

The current candidate retrieval fetches 50 documents from the same legal entity using hybrid search (BM25 + semantic) over all documents. This approach breaks down at scale (100k+ documents per legal entity) because:

Searching the entire document space is too broad — relevant candidates get drowned out
Many real document pairs lack explicit cross-references (no PO number on the invoice, no contract reference). Party identity and contextual signals are the primary connection.
The current evidence signals only support exact reference matching, missing cases where references don't exist

Test case: Contract CFG-ABO-001 (ABO Kraft & Wärme ↔ Carbon Farming, biomethane supply) and Invoice 2025-029RAM share zero cross-references, no common VAT ID, no IBAN. They connect through: supplier name (with formatting difference), matching quantity (1,200,000 kWh), date alignment (Oct 2025), and price plausibility.

Architecture: Three-Layer Pipeline

Document ingestion completes (valid_structured_data = true)
         │
         ▼
Populate normalized_counterparty + normalized_references
         │
         ▼
Layer 1: SQL candidate retrieval
         (counterparty filter + BM25 + deterministic rank → 50 candidates)
         │
         ▼
Layer 2: Evidence scoring
         (weighted signals on 50 candidates → scored + ranked)
         │
         ├─ 0 candidates ≥ 0.30        → no match
         ├─ 1 candidate ≥ 0.70         → auto-match (rule-based)
         └─ anything else              → top 3 → Layer 3
         │
         ▼
Layer 3: LLM decider (optional)
         (confirms/rejects matches for ambiguous cases)
         │
         ▼
Create/update match edges + clusters

If Layer 1 returns 0 candidates (counterparty not found due to extraction failure), log and skip matching. Don't compensate with a fallback — fix the extraction instead.

Schema Changes

-- Normalized counterparty name for fast supplier-based filtering
ALTER TABLE document ADD COLUMN normalized_counterparty text;
CREATE INDEX idx_document_normalized_counterparty ON document(normalized_counterparty);

-- All reference numbers from the document, normalized (lowercased, separators stripped)
ALTER TABLE document ADD COLUMN normalized_references text[];
CREATE INDEX idx_document_normalized_references ON document USING gin(normalized_references);

Counterparty Extraction

The legal entity is always the buyer. The counterparty (supplier/vendor) comes from:

Document Type	Counterparty Field
Invoice	`issuer.name`
Purchase Order	`supplier.name`
Contract	`party-b.name`
GRN	`supplier.name`

Normalized via existing normalize-supplier-name (lowercase, transliterate umlauts, strip punctuation, strip company suffixes):

"ABO Kraft & Wärme Ramstein GmbH & Co KG" → "abo kraft waerme ramstein"
"ABO Kraft & Wärme Ramstein GmbH & Co.KG" → "abo kraft waerme ramstein"

Reference Normalization

All reference numbers from a document, normalized (lowercased, all separators stripped):

Invoice with invoice-number: "2025-029RAM", po-reference: "PO-2024-001" → ["2025029ram", "po2024001"]
Contract with contract-number: "CFG-ABO-001" → ["cfgabo001"]
PO with po-number: "PO-2024-001", contract-reference: "CFG-ABO-001" → ["po2024001", "cfgabo001"]
GRN with grn-number: "GRN-001", po-reference: "PO-2024-001" → ["grn001", "po2024001"]

Both columns populated during ingestion when structured_data is written.

Layer 1: SQL Candidate Retrieval

Single query that filters by counterparty + matchable types, then ranks using BM25 + deterministic signals:

SELECT d.id,
       d.type,
       d.structured_data,
       (
         -- BM25: textual overlap (commodity, quantities, descriptions)
         COALESCE(ts_rank(
           to_tsvector('simple', d.searchable_text),
           plainto_tsquery('simple', :query_text)
         ), 0) * 10

         -- Reference overlap: any normalized reference in common
         + CASE WHEN d.normalized_references && :source_refs THEN 50 ELSE 0 END

         -- Amount proximity: within 50% of source total
         + CASE WHEN :source_total > 0
                AND ABS(COALESCE((d.structured_data->>'total')::numeric,
                                 (d.structured_data->>'total-value')::numeric, 0)
                        - :source_total)
                    / :source_total < 0.5
                THEN 10 ELSE 0 END
       ) AS rank_score
FROM document d
WHERE d.legal_entity_id = :legal_entity_id
  AND d.type = ANY(:matchable_types)
  AND d.normalized_counterparty = :counterparty
  AND d.id != :source_id
ORDER BY rank_score DESC
LIMIT 50

Design choices:

Reference overlap (50) is the highest SQL-level weight — when references exist, they're the strongest signal
BM25 ×10 puts text relevance in comparable range with bonus signals
Amount tolerance is generous (50%) — this is a cheap pre-filter, not precise scoring. Layer 2 handles precise amount comparison.
searchable_text already exists with a GIN index from the current schema
Query text is the source document's searchable_text value

Matchable Type Pairs

Unchanged from current design:

(def matchable-pairs
  #{#{:invoice :purchase-order}
    #{:invoice :contract}
    #{:purchase-order :contract}
    #{:goods-received-note :purchase-order}})

Layer 2: Evidence Scoring

Full weighted signal computation in Clojure on the 50 candidates from Layer 1. Produces scored candidates with evidence trails.

Evidence Signals

(def evidence-signals
  {;; Reference matches (highest value — when they exist)
   :po-number-exact         60   ; normalized PO number match
   :contract-ref-exact      55   ; normalized contract reference match
   :po-ref-exact            55   ; PO reference on GRN matches PO document

   ;; Identity matches
   :vat-id-match            30   ; supplier VAT/tax IDs match
   :iban-match              25   ; supplier bank accounts match

   ;; Quantity & amount matches
   :quantity-exact           35   ; same quantity appears in both documents
   :amount-within-2pct       20   ; total amounts within 2%
   :amount-within-5pct       10   ; total amounts within 5%

   ;; Temporal alignment
   :date-within-period       20   ; invoice service period falls within contract dates
   :delivery-date-match      25   ; invoice date aligns with contract delivery schedule

   ;; Fuzzy matches
   :supplier-name-fuzzy      15   ; >0.8 Jaro-Winkler after normalization
   :description-overlap      10   ; shared commodity/service terms

   ;; Negative signals
   :currency-mismatch       -30   ; different currencies
   :vat-id-mismatch         -40}) ; VAT IDs present but don't match

Score Normalization

Raw score = sum of matched signal weights. Normalized = min(1.0, max(0.0, raw / 100.0)).

Thresholds

(def match-thresholds
  {:high 0.70
   :low  0.30})

Test Case Validation

Contract CFG-ABO-001 ↔ Invoice 2025-029RAM:

Signal	Value	Weight
`:quantity-exact`	1,200,000 kWh in both	+35
`:delivery-date-match`	Oct 2025 ↔ Gastag 22.10.2025	+25
`:supplier-name-fuzzy`	Normalized names match	+15
`:description-overlap`	"Lieferung von Biomethan"	+10
`:amount-within-5pct`	€157k vs ~€150k (first delivery)	+10

Total: 95 → normalized 0.95 — well above high threshold, auto-matches without LLM.

Quantity Matching

Extract all numeric quantities from both documents (line items, deliverables, contract schedules). If any quantity + unit pair appears in both, fire :quantity-exact. This handles the case where a contract specifies multiple deliveries and an invoice covers one of them.

Layer 3: LLM Decider

Invocation rule:

0 candidates ≥ low (0.30) → no match, no LLM
1 candidate ≥ high (0.70) → auto-match (rule-based), no LLM
Anything else → send top 3 candidates (by score) to LLM

LLM receives: Source document summary, each candidate's summary, AND the evidence signals with scores. This lets the LLM see what scoring found and make an informed judgment.

LLM response:

Confirms or rejects each candidate
high or medium confidence → create match
low confidence → no match

Non-Goals (v1)

Line-item reconciliation (document-level matching only)
Contract commercial term validation
Manual match creation/deletion
Match approval workflow
Warnings and anomaly detection
Configurable thresholds per legal entity
Supplier identity grouping (linking "ABO Energy" to "ABO Kraft & Wärme Ramstein" as same entity)

Future Work

Supplier identity table (group entity name variants)
Line-item reconciliation (match invoice lines to PO lines / contract deliverables)
Contract commercial term validation (price vs contract price, volume discounts)
Warnings (over-invoicing, VAT mismatch flags, duplicate invoice detection)
Match dashboard with filtering
Manual match creation/deletion
Configurable thresholds per legal entity