Roadmap: Semantic Search Comparison

Overview

This spike compares pgvector semantic search against LLM-based matching for invoice line item classification. Starting with infrastructure (Postgres + pgvector), we generate embeddings for 3 models, implement search backends (pgvector and LLM), build evaluation framework with metrics, create an interactive comparison dashboard, and conclude with LLM-as-judge enhancement for semantic equivalence assessment. Each phase delivers a complete, verifiable capability.

Phases

Phase Numbering:

Integer phases (1, 2, 3): Planned milestone work
Decimal phases (2.1, 2.2): Urgent insertions (marked with INSERTED)

Decimal phases appear between their surrounding integers in numeric order.

Phase 1: Foundation - Docker Postgres with pgvector, Python environment, data import
Phase 2: Embedding Generation - Pre-compute embeddings for all 3 models, train/test split
Phase 3: Search Implementation - pgvector search and LLM context matching backends
Phase 4: Evaluation & Dashboard - Metrics calculation, batch benchmarking, interactive comparison UI
Phase 5: Reporting & LLM Judge - Static reports, hand-picked examples, LLM-as-judge evaluation

Phase Details

Phase 1: Foundation

Goal: Working development environment with pgvector-enabled database and imported historical data Depends on: Nothing (first phase) Requirements: INFRA-01, INFRA-02, INFRA-03 Success Criteria (what must be TRUE):

docker compose up starts Postgres 18 with pgvector extension enabled
Python environment activates with all required packages (google-genai, sentence-transformers, ragas)
~6K line items from historical.csv are queryable in the database with supplier, description, amounts, accounts, cost center Plans: 2 plans

Plans:

01-01-PLAN.md - Docker Compose with Postgres 18 + pgvector, database schema with line_item table
01-02-PLAN.md - Python environment with uv, text normalization, historical CSV import

Phase 2: Embedding Generation

Goal: All line items have pre-computed embeddings from 3 models, with clean train/test separation Depends on: Phase 1 Requirements: INFRA-04, INFRA-05, INFRA-06, INFRA-07, EVAL-01, EVAL-02 Success Criteria (what must be TRUE):

Every line item has embedding vectors from Google, Jina, and MiniLM models stored in separate columns
Embedding model metadata (name, dimensions, distance metric) is stored and retrievable per record
20% of data is marked as test set (before embedding) with no leakage to training set
Synthetic query variations exist for test set entries (synonyms, typos, reordering)
HNSW indexes are created after data load for all embedding columns Plans: 3 plans

Plans:

02-01-PLAN.md - Train/test split schema, embedding model metadata, stratified 80/20 split
02-02-PLAN.md - Google text-multilingual-embedding-002 and Jina embeddings-v3 API embeddings
02-03-PLAN.md - Local MiniLM embeddings, HNSW indexes, synthetic query variations

Phase 3: Search Implementation

Goal: Users can query the system and get results from both pgvector search and LLM context matching Depends on: Phase 2 Requirements: SRCH-01, SRCH-02, SRCH-03, SRCH-05, SRCH-06 Success Criteria (what must be TRUE):

User can enter a query text and receive search results
User can adjust K (3, 5, 10) and see top-K results with similarity scores
LLM context matching (Orcha approach replication) returns GL account and cost center suggestions
Results display supplier name, description, GL accounts, cost center for each hit Plans: 2 plans

Plans:

03-01-PLAN.md - pgvector search backend with query embedding functions for all 3 models
03-02-PLAN.md - LLM context matching with pg_trgm and Flask web interface

Phase 4: Evaluation & Dashboard

Goal: Quantitative comparison of all approaches with interactive exploration UI Depends on: Phase 3 Requirements: SRCH-04, EVAL-03, EVAL-04, EVAL-05, EVAL-06, EVAL-07, EVAL-09, REPT-01 Success Criteria (what must be TRUE):

User can see side-by-side results from all 3 embedding models and LLM baseline for same query
Exact match accuracy for GL account and cost center is calculated per approach
Top-K accuracy (K=3,5,10) is calculated per approach
Latency per query is measured and displayed in milliseconds
Batch benchmark runs over full test set with aggregate statistics
Cost per query (tokens, API calls, estimated $) is tracked
Interactive HTML dashboard allows search exploration and result comparison Plans: 2 plans

Plans:

04-01-PLAN.md — Metrics calculator with dataclass structures, accuracy calculations, batch benchmark runner
04-02-PLAN.md — Interactive Flask dashboard with benchmark routes and side-by-side comparison view

Phase 5: Reporting & LLM Judge

Goal: Complete evaluation with LLM-as-judge for semantic relevance and shareable reports Depends on: Phase 4 Requirements: EVAL-08, REPT-02, REPT-03, REPT-04 Success Criteria (what must be TRUE):

LLM-as-judge evaluates semantic relevance when exact match fails
Hand-picked example showcase demonstrates best/worst cases per approach
Static HTML report exports with all aggregate metrics
Confusion matrix shows GL account prediction error patterns Plans: 2 plans

Plans:

05-01-PLAN.md — LLM-as-judge module and example selector for curated showcase
05-02-PLAN.md — Static HTML report with confusion matrix and Flask export route

Progress

Execution Order: Phases execute in numeric order: 1 -> 2 -> 3 -> 4 -> 5

Phase	Plans Complete	Status	Completed
1. Foundation	2/2	Complete	2026-02-20
2. Embedding Generation	3/3	Complete	2026-02-20
3. Search Implementation	2/2	Complete	2026-02-20
4. Evaluation & Dashboard	2/2	Complete	2026-02-20
5. Reporting & LLM Judge	2/2	Complete	2026-02-20

Roadmap created: 2026-02-20