Semantic Search for LLM Booking Context
Goal
Replace the current pg_trgm supplier-name matching with semantic search to select a curated, deduplicated set of historical bookings to send to the LLM for account/cost-center assignment.
Background
Orcha's current approach:
- Uses pg_trgm fuzzy search on normalized supplier name (threshold 0.7)
- Returns up to 50 historical bookings for matching suppliers
- Passes them as CSV to the LLM prompt
Problems:
- Gating on supplier name misses matches when names don't fuzzy-match well
- Sends redundant data (e.g., 30 near-identical "01.2025 BP" entries for BearingPoint)
- Wastes tokens without improving LLM decision quality
Scope
In scope:
- Re-embed historical data using raw text (no normalization)
- Remove Jina, keep Google and MiniLM for comparison
- Implement per-line-item semantic search with configurable top-K and threshold
- Deduplicate results: collapse near-identical items, preserve diversity within account/CC clusters
- Output a curated CSV list suitable for LLM context
Out of scope:
- Price/amount inclusion in embeddings (future consideration)
- Production Orcha integration (this is a spike to validate the approach)
- Actual LLM calls (we output the curated list, don't call the LLM)
Embedding Changes
Text format:
{supplier_name} | {description}
Raw text, no normalization.
Schema changes:
- Remove
embedding_jina column
- Remove
supplier_name_normalized column
- Remove
description_normalized column
- Remove pg_trgm index
- Keep
embedding_google (768d) and embedding_minilm (384d)
Code cleanup:
- Remove
src/normalize.py
- Remove
src/embeddings/jina_embed.py
- Remove
src/search/llm_matching.py
- Remove Jina-related code from
pgvector_search.py
- Update
text_prep.py to use raw text
- Recreate DB and re-embed all rows
Search & Curation Algorithm
Input: Invoice with N line items from supplier X
Step 1: Per-line-item semantic search
For each line item:
- Query:
"{supplier_name} | {line_item_description}"
- Embed query using selected model
- Search pgvector: up to K results where similarity > threshold
Configurable parameters (exposed in UI):
- K: default 10
- Threshold: default 0.7 for MiniLM, 0.6 for Google
Step 2: Merge results
Collect all results from N searches into one pool. Tag each result with which line item(s) found it.
Step 3: Cluster by (debit_account, cost_center)
Group results by their account/CC assignment.
Step 4: Deduplicate within clusters
For each cluster:
- Compare historical descriptions pairwise using embedding similarity
- If two items have similarity > 0.9 to each other, keep only the one with highest original search similarity
- This collapses "01.2025 BP" / "02.2025 BP" but preserves "Adobe" / "Office 365" / "Azure"
Step 5: Build output
For each cluster:
- Include diverse representative items (descriptions, amounts)
- Include cluster metadata: total original match count, account, cost_center
Output: Curated list as CSV.
UI / Interface
Search input:
- Single text field for query (format:
supplier | description)
- Both models (Google and MiniLM) always run side-by-side
- Configurable parameters: K, similarity threshold per model
Search output:
- Side-by-side curated CSV results for each model
- Metadata per model: total matches found, clusters formed, items after deduplication
- Per-cluster breakdown: account/CC, count, representative items
Batch mode:
- Input: CSV file (like regnology.csv) with supplier_name, line_item_description columns
- Output: For each row, curated historical bookings from both models
- Allows comparing curation quality across models and threshold settings
Code Cleanup Summary
Remove:
src/normalize.py
src/embeddings/jina_embed.py
src/search/llm_matching.py
- Jina-related code in
pgvector_search.py
- Normalized column references throughout
Update:
init.sql - new schema without normalized columns, Jina, pg_trgm
text_prep.py - use raw text
- Embedding scripts - remove normalization
- Search UI - new curation flow
Add:
- Curation module (clustering, deduplication logic)
- Updated search endpoint with configurable parameters