Semantic Search Comparison

What This Is

A spike comparing semantic search approaches for invoice line item matching: LLM-based context matching (current Orcha approach) vs pgvector semantic search with multiple embedding models. Uses real GDPDU booking history to benchmark quality, latency, and cost.

Core Value

Determine whether pgvector semantic search can match or exceed the quality of LLM-based matching for assigning GL accounts and cost centers to invoice line items.

Requirements

Validated

(None yet — ship to validate)

Active

Docker Postgres 18 with pgvector extension
Import ~6K line items from historical.csv with embeddings
Support three embedding models: Google text-multilingual-embedding-002, Jina AI, all-MiniLM-L6-v2
Implement LLM-based matching (replicate current Orcha approach using Gemini 2.5 Flash)
Implement pgvector semantic search for each embedding model
Generate synthetic test queries (variations of existing descriptions)
Measure exact match rate for GL account and cost center
Measure top-K accuracy (K=3, 5, 10)
Implement LLM-as-judge evaluation
Track latency per query for each approach
Track cost per query (tokens/API calls)
Interactive HTML page to run searches and view results
Comparison report with hand-picked examples

Out of Scope

Production deployment — this is a spike
Supplier matching improvements — focus is on line item semantic matching
Multi-tenant support — single dataset comparison
Real-time embedding generation — pre-compute embeddings during import

Context

Current Orcha implementation:

Suppliers matched via pg_trgm (fuzzy string similarity >= 0.7)
Up to 50 historical bookings for matched supplier passed as CSV to LLM
Gemini 2.5 Flash performs semantic matching to suggest accounts/cost centers
Reference code: orcha/src/com/getorcha/workers/ingestion/post_process.clj:41-69

Dataset:

Source: orcha/dump/regnology/historical.csv
~6,078 rows of GDPDU-style bookings
Columns: Supplier Name, Line Item Description, Net Amount, Debit Account, Credit Account, Cost Center

Embedding models to compare:

Google text-multilingual-embedding-002 — production-ready, multilingual
Jina AI embeddings — retrieval-optimized
all-MiniLM-L6-v2 — local, free, 384 dimensions

Evaluation approach:

Synthetic variations: modify existing descriptions slightly
Ground truth: original booking's GL account and cost center
Metrics: exact match, top-K, LLM-as-judge, human eval, latency, cost

Constraints

Tech stack: Python for embeddings/API, Postgres 18 in Docker
Postgres port: Use non-standard port (e.g., 5433) — standard port 5432 is in use by another project
Credentials: Use existing Orcha Google API credentials for Gemini and embeddings
Budget: Jina AI may need API key — guide user if needed

Key Decisions

Decision	Rationale	Outcome
Postgres 18 in Docker	Isolated environment, easy pgvector setup	— Pending
Synthetic test queries	Test robustness to description variations	— Pending
Three embedding models	Compare local vs API, different architectures	— Pending

Last updated: 2026-02-20 after initialization