The matching pipeline has two phases: hybrid search (BM25 + semantic) retrieves candidates, then deterministic evidence signals score them independently. The retrieval scores are discarded before scoring.
This causes two problems:
:supplier-name-fuzzy and :description-overlap poorly re-implement what BM25/semantic search already does.Example: bikosigma invoice vs contract scored 0.15 (only supplier-name-fuzzy fired) despite hybrid search correctly finding the contract and the invoice line items being verbatim matches to the contract fee schedule.
final_score = alpha * cosine_similarity + (1 - alpha) * deterministic_score
| Document-Type Pair | alpha | Rationale |
|---|---|---|
| invoice <-> contract | 0.6 | Contracts lack VAT, IBAN, amounts. Semantic similarity is primary signal. |
| invoice <-> PO | 0.5 | Rich deterministic fields but retrieval also valuable. |
| invoice <-> GRN | 0.3 | Quantities, dates, supplier info commonly present on both. |
| PO <-> contract | 0.5 | Moderate — contracts may have PO refs but often sparse. |
| PO <-> GRN | 0.3 | Rich — PO refs, quantities, dates. |
| default | 0.4 | Balanced fallback for unlisted pairs. |
Remove (redundant with hybrid search):
:supplier-name-fuzzy (weight 15) — candidate retrieval already filters by normalized_counterparty:description-overlap (weight 10) — BM25 does bag-of-words better on full searchable textKeep (structured field comparisons that add value beyond retrieval):
:po-number-exact (60), :contract-ref-exact (55), :po-ref-exact (55):vat-id-match (30), :vat-id-mismatch (-40):iban-match (25):quantity-exact (35):amount-within-2pct (20), :amount-within-5pct (10):date-within-period (20), :delivery-date-match (25):currency-mismatch (-30)Unchanged: 0.70 (auto-match), 0.30 (minimum to consider). Recalibrate empirically after deployment.
candidates/find-candidates -> [rows with cosine similarity preserved]
|
evidence/compute-score -> deterministic score (10 signals)
|
blend-score(type-pair, cosine, deterministic) -> final score
candidates/find-candidates already returns :score (cosine) per candidate row. No change.core/score-all-candidates reads candidate :score, computes deterministic score, blends them.{:score final, :retrieval-score cosine, :deterministic-score det, :evidence signals}.document_match.confidence stores the final blended score.No schema migration needed.
evidence.clj:supplier-name-fuzzy and :description-overlap from evidence-signalsextract-description-words, stop-words, Jaro-Winkler import, and corresponding collect-signals blockstype-pair-alpha mapblend-score functioncore.cljscore-all-candidates: pass candidate :score (cosine) through to blend-scorenormalize.cljget-counterparty-name is still used by extract-counterparty for candidate retrieval.blend-score with various type pairs and score combinations| Scenario | Cosine | Deterministic | alpha | Final | Outcome |
|---|---|---|---|---|---|
| bikosigma invoice <-> contract | 0.80 | 0.00 | 0.6 | 0.48 | LLM decides (was: filtered at 0.15) |
| Invoice <-> PO with matching PO# | 0.85 | 1.00 | 0.5 | 0.925 | Auto-match |
| Invoice <-> PO, no deterministic | 0.85 | 0.00 | 0.5 | 0.425 | LLM decides |
| Invoice <-> GRN with quantities | 0.75 | 0.60 | 0.3 | 0.645 | LLM decides |
| Unrelated documents | 0.35 | 0.00 | 0.4 | 0.14 | Filtered out |