Roadmap: Semantic Search Comparison
Overview
This spike compares pgvector semantic search against LLM-based matching for invoice line item classification. Starting with infrastructure (Postgres + pgvector), we generate embeddings for 3 models, implement search backends (pgvector and LLM), build evaluation framework with metrics, create an interactive comparison dashboard, and conclude with LLM-as-judge enhancement for semantic equivalence assessment. Each phase delivers a complete, verifiable capability.
Phases
Phase Numbering:
- Integer phases (1, 2, 3): Planned milestone work
- Decimal phases (2.1, 2.2): Urgent insertions (marked with INSERTED)
Decimal phases appear between their surrounding integers in numeric order.
Phase Details
Phase 1: Foundation
Goal: Working development environment with pgvector-enabled database and imported historical data
Depends on: Nothing (first phase)
Requirements: INFRA-01, INFRA-02, INFRA-03
Success Criteria (what must be TRUE):
docker compose up starts Postgres 18 with pgvector extension enabled
- Python environment activates with all required packages (google-genai, sentence-transformers, ragas)
- ~6K line items from historical.csv are queryable in the database with supplier, description, amounts, accounts, cost center
Plans: 2 plans
Plans:
Phase 2: Embedding Generation
Goal: All line items have pre-computed embeddings from 3 models, with clean train/test separation
Depends on: Phase 1
Requirements: INFRA-04, INFRA-05, INFRA-06, INFRA-07, EVAL-01, EVAL-02
Success Criteria (what must be TRUE):
- Every line item has embedding vectors from Google, Jina, and MiniLM models stored in separate columns
- Embedding model metadata (name, dimensions, distance metric) is stored and retrievable per record
- 20% of data is marked as test set (before embedding) with no leakage to training set
- Synthetic query variations exist for test set entries (synonyms, typos, reordering)
- HNSW indexes are created after data load for all embedding columns
Plans: 3 plans
Plans:
Phase 3: Search Implementation
Goal: Users can query the system and get results from both pgvector search and LLM context matching
Depends on: Phase 2
Requirements: SRCH-01, SRCH-02, SRCH-03, SRCH-05, SRCH-06
Success Criteria (what must be TRUE):
- User can enter a query text and receive search results
- User can adjust K (3, 5, 10) and see top-K results with similarity scores
- LLM context matching (Orcha approach replication) returns GL account and cost center suggestions
- Results display supplier name, description, GL accounts, cost center for each hit
Plans: 2 plans
Plans:
Phase 4: Evaluation & Dashboard
Goal: Quantitative comparison of all approaches with interactive exploration UI
Depends on: Phase 3
Requirements: SRCH-04, EVAL-03, EVAL-04, EVAL-05, EVAL-06, EVAL-07, EVAL-09, REPT-01
Success Criteria (what must be TRUE):
- User can see side-by-side results from all 3 embedding models and LLM baseline for same query
- Exact match accuracy for GL account and cost center is calculated per approach
- Top-K accuracy (K=3,5,10) is calculated per approach
- Latency per query is measured and displayed in milliseconds
- Batch benchmark runs over full test set with aggregate statistics
- Cost per query (tokens, API calls, estimated $) is tracked
- Interactive HTML dashboard allows search exploration and result comparison
Plans: 2 plans
Plans:
Phase 5: Reporting & LLM Judge
Goal: Complete evaluation with LLM-as-judge for semantic relevance and shareable reports
Depends on: Phase 4
Requirements: EVAL-08, REPT-02, REPT-03, REPT-04
Success Criteria (what must be TRUE):
- LLM-as-judge evaluates semantic relevance when exact match fails
- Hand-picked example showcase demonstrates best/worst cases per approach
- Static HTML report exports with all aggregate metrics
- Confusion matrix shows GL account prediction error patterns
Plans: 2 plans
Plans:
Progress
Execution Order:
Phases execute in numeric order: 1 -> 2 -> 3 -> 4 -> 5
| Phase |
Plans Complete |
Status |
Completed |
| 1. Foundation |
2/2 |
Complete |
2026-02-20 |
| 2. Embedding Generation |
3/3 |
Complete |
2026-02-20 |
| 3. Search Implementation |
2/2 |
Complete |
2026-02-20 |
| 4. Evaluation & Dashboard |
2/2 |
Complete |
2026-02-20 |
| 5. Reporting & LLM Judge |
2/2 |
Complete |
2026-02-20 |
Roadmap created: 2026-02-20