Matching Algorithm Tuning

Problem

Invoice-to-contract matching fails on obvious matches (bikosigma invoice<->contract):

  1. Blended score is 0.47 because deterministic signals all return 0 (missing data on contract side)
  2. LLM decider returns {"matches": []} because the prompt lacks the data needed to reason about the match

Fix 1: Enrich LLM Prompt

format-document-summary in llm_decision.clj currently sends skeletal summaries. Enrich per type:

Invoice — add: line items (description, quantity, unit, amount), service period, PO/GR references, payment terms, due date

Contract — add: title, description, contract type, deliverables, base fee, variable components, payment schedule, PO references, currency, total value

Purchase Order — add: line items (description, quantity, unit, amount), contract reference, requisition number

GRN — add: line items (description, quantity-received, unit)

Send all line items/deliverables, no cap. Gemini 2.5 Flash has 1M token context; prompt truncation deferred until needed.

Fix 2a: description-overlap Signal

Token-level Jaccard similarity between document descriptions.

Fix 2b: Relax date-within-period

Current: requires both contract dates + invoice service period. All must be present.

New cascade:

  1. Contract dates + invoice service period → strict check (service period within contract period)
  2. Contract effective only (no expiration) → invoice-date >= effective-date
  3. Invoice has no service period → check invoice-date against contract date range

Expected Impact (bikosigma case)

Before: 0.6 * 0.787 + 0.4 * 0.0 = 0.472

After (description-overlap ~+25, date-within-period +20):

Still below 0.70 auto-match threshold → goes to LLM, which now has enough data to match confidently.

Files to Modify