Semantic Search for LLM Booking Context

Goal

Replace the current pg_trgm supplier-name matching with semantic search to select a curated, deduplicated set of historical bookings to send to the LLM for account/cost-center assignment.

Background

Orcha's current approach:

Uses pg_trgm fuzzy search on normalized supplier name (threshold 0.7)
Returns up to 50 historical bookings for matching suppliers
Passes them as CSV to the LLM prompt

Problems:

Gating on supplier name misses matches when names don't fuzzy-match well
Sends redundant data (e.g., 30 near-identical "01.2025 BP" entries for BearingPoint)
Wastes tokens without improving LLM decision quality

Scope

In scope:

Re-embed historical data using raw text (no normalization)
Remove Jina, keep Google and MiniLM for comparison
Implement per-line-item semantic search with configurable top-K and threshold
Deduplicate results: collapse near-identical items, preserve diversity within account/CC clusters
Output a curated CSV list suitable for LLM context

Out of scope:

Price/amount inclusion in embeddings (future consideration)
Production Orcha integration (this is a spike to validate the approach)
Actual LLM calls (we output the curated list, don't call the LLM)

Embedding Changes

Text format:

{supplier_name} | {description}

Raw text, no normalization.

Schema changes:

Remove embedding_jina column
Remove supplier_name_normalized column
Remove description_normalized column
Remove pg_trgm index
Keep embedding_google (768d) and embedding_minilm (384d)

Code cleanup:

Remove src/normalize.py
Remove src/embeddings/jina_embed.py
Remove src/search/llm_matching.py
Remove Jina-related code from pgvector_search.py
Update text_prep.py to use raw text
Recreate DB and re-embed all rows

Search & Curation Algorithm

Input: Invoice with N line items from supplier X

Step 1: Per-line-item semantic search

For each line item:

Query: "{supplier_name} | {line_item_description}"
Embed query using selected model
Search pgvector: up to K results where similarity > threshold

Configurable parameters (exposed in UI):

K: default 10
Threshold: default 0.7 for MiniLM, 0.6 for Google

Step 2: Merge results

Collect all results from N searches into one pool. Tag each result with which line item(s) found it.

Step 3: Cluster by (debit_account, cost_center)

Group results by their account/CC assignment.

Step 4: Deduplicate within clusters

For each cluster:

Compare historical descriptions pairwise using embedding similarity
If two items have similarity > 0.9 to each other, keep only the one with highest original search similarity
This collapses "01.2025 BP" / "02.2025 BP" but preserves "Adobe" / "Office 365" / "Azure"

Step 5: Build output

For each cluster:

Include diverse representative items (descriptions, amounts)
Include cluster metadata: total original match count, account, cost_center

Output: Curated list as CSV.

UI / Interface

Search input:

Single text field for query (format: supplier | description)
Both models (Google and MiniLM) always run side-by-side
Configurable parameters: K, similarity threshold per model

Search output:

Side-by-side curated CSV results for each model
Metadata per model: total matches found, clusters formed, items after deduplication
Per-cluster breakdown: account/CC, count, representative items

Batch mode:

Input: CSV file (like regnology.csv) with supplier_name, line_item_description columns
Output: For each row, curated historical bookings from both models
Allows comparing curation quality across models and threshold settings

Code Cleanup Summary

Remove:

src/normalize.py
src/embeddings/jina_embed.py
src/search/llm_matching.py
Jina-related code in pgvector_search.py
Normalized column references throughout

Update:

init.sql - new schema without normalized columns, Jina, pg_trgm
text_prep.py - use raw text
Embedding scripts - remove normalization
Search UI - new curation flow

Add:

Curation module (clustering, deduplication logic)
Updated search endpoint with configurable parameters