Note (2026-04-24): After this document was written, legal_entity was renamed to tenant and the old tenant was renamed to organization. Read references to these terms with the pre-rename meaning.

Candidate Retrieval Redesign for Document Matching

Date: 2026-02-25 Status: Approved Supersedes: Candidate retrieval section of 2026-02-24-document-matching-design.md

Problem

The current candidate retrieval fetches 50 documents from the same legal entity using hybrid search (BM25 + semantic) over all documents. This approach breaks down at scale (100k+ documents per legal entity) because:

  1. Searching the entire document space is too broad — relevant candidates get drowned out
  2. Many real document pairs lack explicit cross-references (no PO number on the invoice, no contract reference). Party identity and contextual signals are the primary connection.
  3. The current evidence signals only support exact reference matching, missing cases where references don't exist

Test case: Contract CFG-ABO-001 (ABO Kraft & Wärme ↔ Carbon Farming, biomethane supply) and Invoice 2025-029RAM share zero cross-references, no common VAT ID, no IBAN. They connect through: supplier name (with formatting difference), matching quantity (1,200,000 kWh), date alignment (Oct 2025), and price plausibility.


Architecture: Three-Layer Pipeline

Document ingestion completes (valid_structured_data = true)
         │
         ▼
Populate normalized_counterparty + normalized_references
         │
         ▼
Layer 1: SQL candidate retrieval
         (counterparty filter + BM25 + deterministic rank → 50 candidates)
         │
         ▼
Layer 2: Evidence scoring
         (weighted signals on 50 candidates → scored + ranked)
         │
         ├─ 0 candidates ≥ 0.30        → no match
         ├─ 1 candidate ≥ 0.70         → auto-match (rule-based)
         └─ anything else              → top 3 → Layer 3
         │
         ▼
Layer 3: LLM decider (optional)
         (confirms/rejects matches for ambiguous cases)
         │
         ▼
Create/update match edges + clusters

If Layer 1 returns 0 candidates (counterparty not found due to extraction failure), log and skip matching. Don't compensate with a fallback — fix the extraction instead.


Schema Changes

-- Normalized counterparty name for fast supplier-based filtering
ALTER TABLE document ADD COLUMN normalized_counterparty text;
CREATE INDEX idx_document_normalized_counterparty ON document(normalized_counterparty);

-- All reference numbers from the document, normalized (lowercased, separators stripped)
ALTER TABLE document ADD COLUMN normalized_references text[];
CREATE INDEX idx_document_normalized_references ON document USING gin(normalized_references);

Counterparty Extraction

The legal entity is always the buyer. The counterparty (supplier/vendor) comes from:

Document Type Counterparty Field
Invoice issuer.name
Purchase Order supplier.name
Contract party-b.name
GRN supplier.name

Normalized via existing normalize-supplier-name (lowercase, transliterate umlauts, strip punctuation, strip company suffixes):

Reference Normalization

All reference numbers from a document, normalized (lowercased, all separators stripped):

Both columns populated during ingestion when structured_data is written.


Layer 1: SQL Candidate Retrieval

Single query that filters by counterparty + matchable types, then ranks using BM25 + deterministic signals:

SELECT d.id,
       d.type,
       d.structured_data,
       (
         -- BM25: textual overlap (commodity, quantities, descriptions)
         COALESCE(ts_rank(
           to_tsvector('simple', d.searchable_text),
           plainto_tsquery('simple', :query_text)
         ), 0) * 10

         -- Reference overlap: any normalized reference in common
         + CASE WHEN d.normalized_references && :source_refs THEN 50 ELSE 0 END

         -- Amount proximity: within 50% of source total
         + CASE WHEN :source_total > 0
                AND ABS(COALESCE((d.structured_data->>'total')::numeric,
                                 (d.structured_data->>'total-value')::numeric, 0)
                        - :source_total)
                    / :source_total < 0.5
                THEN 10 ELSE 0 END
       ) AS rank_score
FROM document d
WHERE d.legal_entity_id = :legal_entity_id
  AND d.type = ANY(:matchable_types)
  AND d.normalized_counterparty = :counterparty
  AND d.id != :source_id
ORDER BY rank_score DESC
LIMIT 50

Design choices:

Matchable Type Pairs

Unchanged from current design:

(def matchable-pairs
  #{#{:invoice :purchase-order}
    #{:invoice :contract}
    #{:purchase-order :contract}
    #{:goods-received-note :purchase-order}})

Layer 2: Evidence Scoring

Full weighted signal computation in Clojure on the 50 candidates from Layer 1. Produces scored candidates with evidence trails.

Evidence Signals

(def evidence-signals
  {;; Reference matches (highest value — when they exist)
   :po-number-exact         60   ; normalized PO number match
   :contract-ref-exact      55   ; normalized contract reference match
   :po-ref-exact            55   ; PO reference on GRN matches PO document

   ;; Identity matches
   :vat-id-match            30   ; supplier VAT/tax IDs match
   :iban-match              25   ; supplier bank accounts match

   ;; Quantity & amount matches
   :quantity-exact           35   ; same quantity appears in both documents
   :amount-within-2pct       20   ; total amounts within 2%
   :amount-within-5pct       10   ; total amounts within 5%

   ;; Temporal alignment
   :date-within-period       20   ; invoice service period falls within contract dates
   :delivery-date-match      25   ; invoice date aligns with contract delivery schedule

   ;; Fuzzy matches
   :supplier-name-fuzzy      15   ; >0.8 Jaro-Winkler after normalization
   :description-overlap      10   ; shared commodity/service terms

   ;; Negative signals
   :currency-mismatch       -30   ; different currencies
   :vat-id-mismatch         -40}) ; VAT IDs present but don't match

Score Normalization

Raw score = sum of matched signal weights. Normalized = min(1.0, max(0.0, raw / 100.0)).

Thresholds

(def match-thresholds
  {:high 0.70
   :low  0.30})

Test Case Validation

Contract CFG-ABO-001 ↔ Invoice 2025-029RAM:

Signal Value Weight
:quantity-exact 1,200,000 kWh in both +35
:delivery-date-match Oct 2025 ↔ Gastag 22.10.2025 +25
:supplier-name-fuzzy Normalized names match +15
:description-overlap "Lieferung von Biomethan" +10
:amount-within-5pct €157k vs ~€150k (first delivery) +10

Total: 95 → normalized 0.95 — well above high threshold, auto-matches without LLM.

Quantity Matching

Extract all numeric quantities from both documents (line items, deliverables, contract schedules). If any quantity + unit pair appears in both, fire :quantity-exact. This handles the case where a contract specifies multiple deliveries and an invoice covers one of them.


Layer 3: LLM Decider

Invocation rule:

LLM receives: Source document summary, each candidate's summary, AND the evidence signals with scores. This lets the LLM see what scoring found and make an informed judgment.

LLM response:


Non-Goals (v1)

Future Work