Matching Algorithm Tuning

Problem

Invoice-to-contract matching fails on obvious matches (bikosigma invoice<->contract):

Blended score is 0.47 because deterministic signals all return 0 (missing data on contract side)
LLM decider returns {"matches": []} because the prompt lacks the data needed to reason about the match

Fix 1: Enrich LLM Prompt

format-document-summary in llm_decision.clj currently sends skeletal summaries. Enrich per type:

Invoice — add: line items (description, quantity, unit, amount), service period, PO/GR references, payment terms, due date

Contract — add: title, description, contract type, deliverables, base fee, variable components, payment schedule, PO references, currency, total value

Purchase Order — add: line items (description, quantity, unit, amount), contract reference, requisition number

GRN — add: line items (description, quantity-received, unit)

Send all line items/deliverables, no cap. Gemini 2.5 Flash has 1M token context; prompt truncation deferred until needed.

Fix 2a: `description-overlap` Signal

Token-level Jaccard similarity between document descriptions.

Extract tokens from invoice line-items[*].description and contract deliverables[*]
Normalize: lowercase, strip punctuation, remove stopwords (German + English)
Jaccard index: |A ∩ B| / |A ∪ B|
Fire if Jaccard > 0.15
Weight: +25
Applies to all type pairs

Fix 2b: Relax `date-within-period`

Current: requires both contract dates + invoice service period. All must be present.

New cascade:

Contract dates + invoice service period → strict check (service period within contract period)
Contract effective only (no expiration) → invoice-date >= effective-date
Invoice has no service period → check invoice-date against contract date range

Expected Impact (bikosigma case)

Before: 0.6 * 0.787 + 0.4 * 0.0 = 0.472

After (description-overlap ~+25, date-within-period +20):

Deterministic: 45/100 = 0.45
Blended: 0.6 * 0.787 + 0.4 * 0.45 = 0.652

Still below 0.70 auto-match threshold → goes to LLM, which now has enough data to match confidently.

Files to Modify

src/com/getorcha/workers/matching/llm_decision.clj — format-document-summary
src/com/getorcha/workers/matching/evidence.clj — collect-signals, new description-overlap + relaxed date-within-period
test/com/getorcha/workers/matching/llm_decision_test.clj — updated test fixtures
test/com/getorcha/workers/matching/evidence_test.clj — new signal tests