Roadmap: Semantic Search Comparison

Overview

This spike compares pgvector semantic search against LLM-based matching for invoice line item classification. Starting with infrastructure (Postgres + pgvector), we generate embeddings for 3 models, implement search backends (pgvector and LLM), build evaluation framework with metrics, create an interactive comparison dashboard, and conclude with LLM-as-judge enhancement for semantic equivalence assessment. Each phase delivers a complete, verifiable capability.

Phases

Phase Numbering:

Decimal phases appear between their surrounding integers in numeric order.

Phase Details

Phase 1: Foundation

Goal: Working development environment with pgvector-enabled database and imported historical data Depends on: Nothing (first phase) Requirements: INFRA-01, INFRA-02, INFRA-03 Success Criteria (what must be TRUE):

  1. docker compose up starts Postgres 18 with pgvector extension enabled
  2. Python environment activates with all required packages (google-genai, sentence-transformers, ragas)
  3. ~6K line items from historical.csv are queryable in the database with supplier, description, amounts, accounts, cost center Plans: 2 plans

Plans:

Phase 2: Embedding Generation

Goal: All line items have pre-computed embeddings from 3 models, with clean train/test separation Depends on: Phase 1 Requirements: INFRA-04, INFRA-05, INFRA-06, INFRA-07, EVAL-01, EVAL-02 Success Criteria (what must be TRUE):

  1. Every line item has embedding vectors from Google, Jina, and MiniLM models stored in separate columns
  2. Embedding model metadata (name, dimensions, distance metric) is stored and retrievable per record
  3. 20% of data is marked as test set (before embedding) with no leakage to training set
  4. Synthetic query variations exist for test set entries (synonyms, typos, reordering)
  5. HNSW indexes are created after data load for all embedding columns Plans: 3 plans

Plans:

Phase 3: Search Implementation

Goal: Users can query the system and get results from both pgvector search and LLM context matching Depends on: Phase 2 Requirements: SRCH-01, SRCH-02, SRCH-03, SRCH-05, SRCH-06 Success Criteria (what must be TRUE):

  1. User can enter a query text and receive search results
  2. User can adjust K (3, 5, 10) and see top-K results with similarity scores
  3. LLM context matching (Orcha approach replication) returns GL account and cost center suggestions
  4. Results display supplier name, description, GL accounts, cost center for each hit Plans: 2 plans

Plans:

Phase 4: Evaluation & Dashboard

Goal: Quantitative comparison of all approaches with interactive exploration UI Depends on: Phase 3 Requirements: SRCH-04, EVAL-03, EVAL-04, EVAL-05, EVAL-06, EVAL-07, EVAL-09, REPT-01 Success Criteria (what must be TRUE):

  1. User can see side-by-side results from all 3 embedding models and LLM baseline for same query
  2. Exact match accuracy for GL account and cost center is calculated per approach
  3. Top-K accuracy (K=3,5,10) is calculated per approach
  4. Latency per query is measured and displayed in milliseconds
  5. Batch benchmark runs over full test set with aggregate statistics
  6. Cost per query (tokens, API calls, estimated $) is tracked
  7. Interactive HTML dashboard allows search exploration and result comparison Plans: 2 plans

Plans:

Phase 5: Reporting & LLM Judge

Goal: Complete evaluation with LLM-as-judge for semantic relevance and shareable reports Depends on: Phase 4 Requirements: EVAL-08, REPT-02, REPT-03, REPT-04 Success Criteria (what must be TRUE):

  1. LLM-as-judge evaluates semantic relevance when exact match fails
  2. Hand-picked example showcase demonstrates best/worst cases per approach
  3. Static HTML report exports with all aggregate metrics
  4. Confusion matrix shows GL account prediction error patterns Plans: 2 plans

Plans:

Progress

Execution Order: Phases execute in numeric order: 1 -> 2 -> 3 -> 4 -> 5

Phase Plans Complete Status Completed
1. Foundation 2/2 Complete 2026-02-20
2. Embedding Generation 3/3 Complete 2026-02-20
3. Search Implementation 2/2 Complete 2026-02-20
4. Evaluation & Dashboard 2/2 Complete 2026-02-20
5. Reporting & LLM Judge 2/2 Complete 2026-02-20

Roadmap created: 2026-02-20