Project Research Summary

Project: Semantic Search Comparison Benchmark Domain: Vector Database Evaluation (Invoice Line Item Matching) Researched: 2026-02-20 Confidence: HIGH

Executive Summary

This project is a technical spike to evaluate whether pgvector-based semantic search can replace LLM-based matching for invoice line item classification. The research shows this is a well-established problem domain with mature tools: PostgreSQL 18 with pgvector 0.8.1 for vector search, multiple embedding models (Google gemini-embedding-001, Jina v3, all-MiniLM-L6-v2) for comparison, and standard IR evaluation metrics (exact match, MRR, nDCG).

The recommended approach follows proven patterns from the RAG evaluation ecosystem: pre-compute embeddings for all models, store in separate columns for side-by-side comparison, build HNSW indexes after data load, and evaluate using both exact match metrics and LLM-as-judge for semantic equivalence. The critical differentiator is the side-by-side comparison dashboard that lets stakeholders interactively explore how each model performs on the same queries.

The primary risks are data leakage (testing on training data), prompt sensitivity (LLM baseline varying 10-70% with minor prompt changes), and unfair comparison (different preprocessing per model). These are mitigated by: holding out 20% of data for testing, documenting exact prompts with sensitivity metrics, and standardizing preprocessing across all embedding models. The spike scope is deliberately constrained to ~6K invoice line items to make decisions quickly without production deployment complexity.

Key Findings

Recommended Stack

Modern vector search requires PostgreSQL 18 with pgvector 0.8.1 (HNSW indexes, binary quantization support). Python 3.12 is the sweet spot supported by all required libraries. The embedding landscape has shifted significantly: google-genai (v1.64.0) replaces deprecated google-generativeai and vertexai modules; gemini-embedding-001 now outperforms text-multilingual-embedding-002 for multilingual retrieval. For local baselines, all-MiniLM-L6-v2 provides fast, cost-free embeddings at 384 dimensions.

Core technologies:

PostgreSQL 18 + pgvector 0.8.1: Industry-standard vector search, HNSW indexes for sub-100ms queries on 6K rows
google-genai SDK: New unified API for Google embeddings (replaces deprecated libraries, supports both Gemini API and Vertex AI)
sentence-transformers 5.2.3: De-facto standard for local embeddings, runs Jina models and MiniLM baseline
ragas 0.4.3: LLM-as-judge evaluation framework (65% Fortune 500 adoption, 92% human-aligned faithfulness scoring)
Flask + HTMX: Interactive dashboard for side-by-side model comparison (simpler than Dash, more capable than Gradio for benchmarks)

Critical version notes:

Avoid google-generativeai (deprecated Nov 2025), vertexai.language_models (removed June 2026)
Use psycopg v3 (not psycopg2) for async capability and pgvector compatibility
Python 3.12 required (3.9 no longer supported by google-genai, sentence-transformers, streamlit)

Expected Features

Research reveals a clear feature hierarchy for embedding comparison tools. The MTEB leaderboard, RAGAS framework, and production RAG systems all converge on similar patterns.

Must have (table stakes):

Pre-computed embeddings for all 3 models — Foundation for performance comparison; embedding generation measured separately
Single query interface — Interactive exploration with text input; users need to test individual queries
Top-K retrieval (K=3,5,10) — Standard retrieval interface for precision/recall tradeoffs
Side-by-side model comparison — Core differentiation; see all models' results simultaneously for same query
Exact match accuracy — Primary success metric: does retrieved GL account match ground truth?
Latency measurement — Performance comparison; sub-100ms matters for production viability
Batch benchmarking — Run entire test suite; single queries misleading, need aggregate statistics

Should have (competitive):

LLM-as-judge evaluation — When exact match fails (semantic equivalents), LLM provides relevance scores at $0.01-0.10 per assessment
Aggregate metrics (nDCG, MRR) — Industry-standard ranking-aware metrics beyond simple accuracy
Score distribution histograms — Understand model confidence patterns; identify decision thresholds
Cost tracking per query — API costs are real constraint; compare embedding vs LLM costs
HTML export/report — Share findings with stakeholders who won't run the tool

Defer (v2+):

Synthetic query generation — LLM-based paraphrasing adds complexity; may not be needed if existing test set sufficient
Confusion matrix for GL accounts — Useful for production error analysis but not for initial spike decision
Real-time embedding generation — Masks true search performance; pre-computation is industry best practice

Architecture Approach

The research converges on a pipeline architecture with checkpoints: ingestion pipeline (run once) → search layer (multiple implementations) → evaluation framework → presentation layer. This separation enables fair comparison and resumability after API failures.

Major components:

Ingestion Pipeline — CSV loader → embedding generator (3 models, batch API requests) → pgvector writer → synthetic query generator. Each stage writes to database for restart capability.
Search Layer — Strategy pattern with common interface: PgvectorSearch (3 model variants) and LLMContextSearch (Orcha replication). Enables fair comparison with same evaluation framework.
Evaluation Framework — Ground truth dataset (held-out 20%) → metrics calculator (exact match, Top-K, MRR, latency) → LLM-as-judge evaluator (for semantic equivalence) → results store.
Dashboard — Flask + HTMX for interactive exploration. Thin presentation layer calling search and evaluation modules. Side-by-side comparison view is core feature.

Database schema:

line_items table with 3 embedding columns (embedding_google vector(768), embedding_jina vector(768), embedding_minilm vector(384))
HNSW index per embedding type (created AFTER data load, not before)
Separate test_queries table with ground truth labels
evaluation_runs and query_results tables for benchmark storage

Build order rationale: Database schema first → ingestion pipeline → search layer → evaluation framework → dashboard. Cannot search without data; cannot evaluate without search; dashboard is thin integration layer.

Critical Pitfalls

Research identifies 8 critical pitfalls from vector database practitioners and RAG evaluation studies. Top 5 by severity:

Data leakage in evaluation — Testing on training data or overly-similar synthetic variations produces artificially high accuracy (>95%). Prevention: Hold out 20% of historical.csv before embedding; generate diverse test variations (typos, abbreviations, synonyms, paraphrases). Address in Phase 3 (Evaluation Design).
Prompt sensitivity in LLM matching — Small prompt wording changes cause 10-70% accuracy swings. The same query gets different results with minor formatting changes. Prevention: Test 3-5 prompt variations and report variance (not just best result); use structured JSON output; document exact prompts. Address in Phase 4 (LLM Implementation).
IVFFlat index on insufficient data — Creating index before data load produces poor clusters that never improve. IVFFlat is static once built, unlike HNSW. Prevention: Use HNSW for this spike (works without data, better for dynamic datasets); if using IVFFlat, load ALL data first. Address in Phase 1 (Data Import).
Embedding dimension mismatch — Different models have different dimensions (Google: 768, Jina: 768, MiniLM: 384). Mixing models causes silent failures or crashes. Prevention: Store embedding_model metadata in database; validate dimensions on insert; use separate columns per model. Address in Phase 1 (Data Import).
LLM context window overflow — Passing too many historical bookings causes "lost in the middle" effect or silent truncation. Current Orcha approach passes up to 50 bookings (~5K tokens just for context). Prevention: Calculate actual token count before API call; limit to 20-30 most relevant bookings; place query at start/end of context. Address in Phase 4 (LLM Implementation).

Implications for Roadmap

Based on research, suggested 5-phase structure following industry patterns for vector search evaluation:

Phase 1: Foundation & Data Import

Rationale: Database schema and data ingestion must come first. Research shows HNSW indexes should be created AFTER full data load, not during schema setup. Pre-computing embeddings is industry best practice to isolate search performance from generation latency.

Delivers: PostgreSQL 18 with pgvector extension, line_items table with 3 embedding columns, HNSW indexes, ~6K rows of invoice data with embeddings from all 3 models.

Addresses:

Pre-computed embeddings for all 3 models (FEATURES.md table stakes)
Database schema with proper typing (ARCHITECTURE.md foundation)

Avoids:

Embedding dimension mismatch (PITFALLS.md #4) — separate columns per model, metadata tracking
IVFFlat on empty table (PITFALLS.md #3) — use HNSW, create indexes after data load
Unfair comparison (PITFALLS.md #8) — standardize preprocessing across models

Research needed: Standard patterns — pgvector setup is well-documented, embedding APIs have clear examples. Skip /gsd:research-phase.

Phase 2: Search Implementation

Rationale: Once data is loaded, implement both search approaches (pgvector with 3 models + LLM context matching). Strategy pattern from ARCHITECTURE.md enables fair comparison. Must implement before evaluation framework can consume results.

Delivers: PgvectorSearch class with 3 model variants, LLMContextSearch replicating Orcha approach, common SearchBackend interface, query endpoint returning top-K results with scores.

Addresses:

Single query interface (FEATURES.md table stakes)
Top-K retrieval (FEATURES.md table stakes)
Strategy pattern for search backends (ARCHITECTURE.md pattern #2)

Avoids:

Normalization mismatch (PITFALLS.md #2) — document each model's distance metric, use cosine for cosine-trained models
Prompt sensitivity (PITFALLS.md #6) — define and freeze prompt templates early, measure sensitivity explicitly

Research needed: LLM context matching implementation details — how to structure CSV context, token counting, prompt optimization. Candidate for /gsd:research-phase during planning.

Phase 3: Evaluation Framework

Rationale: With search working, build evaluation framework to objectively compare approaches. Ground truth dataset design is critical — research emphasizes holding out data to avoid leakage. Metrics hierarchy: start simple (exact match), graduate to ranking-aware (MRR, nDCG), supplement with LLM-as-judge.

Delivers: Test dataset (20% held-out from historical.csv), ground truth labels (expected GL account/cost center), metrics calculator (exact match, Top-K accuracy, MRR, latency), batch evaluation runner.

Addresses:

Ground truth comparison (FEATURES.md table stakes)
Exact match accuracy metric (FEATURES.md table stakes)
Latency measurement (FEATURES.md table stakes)
Batch benchmarking (FEATURES.md table stakes)
Ground truth dataset structure (ARCHITECTURE.md pattern #3)

Avoids:

Data leakage (PITFALLS.md #4) — 80/20 train/test split before embedding, test queries not in training
Evaluation on training data (ARCHITECTURE.md anti-pattern #3) — synthetic variations, not exact strings

Research needed: Standard patterns — evaluation metrics well-established in IR literature. Skip /gsd:research-phase.

Phase 4: Dashboard & Comparison

Rationale: With search and evaluation complete, build interactive interface for exploration. Research shows side-by-side comparison is core differentiator for model comparison tools. Flask + HTMX simpler than full frontend frameworks for spike.

Delivers: Flask application with routes for single query and batch evaluation, side-by-side comparison view (all 3 pgvector models + LLM baseline), score visualization, latency display, HTML export for stakeholder reports.

Addresses:

Side-by-side model comparison (FEATURES.md differentiator, HIGH priority)
Score distribution histograms (FEATURES.md differentiator)
Cost tracking per query (FEATURES.md differentiator)
HTML export (FEATURES.md differentiator)
Interactive dashboard (ARCHITECTURE.md presentation layer)

Avoids:

Complex frontend (FEATURES.md anti-feature "Complex query DSL") — simple text input, code modifications for advanced queries

Research needed: Standard patterns — Flask + HTMX well-documented for dashboards. Skip /gsd:research-phase.

Phase 5: LLM-as-Judge Enhancement

Rationale: After core comparison complete, add LLM-as-judge for cases where exact match fails but result is semantically equivalent. Research shows LLM judges have biases (position, verbosity, self-preference) requiring multi-model judging and randomization.

Delivers: LLM-as-judge evaluator using ragas framework, multi-model judging (GPT-4, Claude, Gemini), randomized presentation order, explicit rubric in evaluation prompt, human validation sample (10-20 queries).

Addresses:

LLM-as-judge evaluation (FEATURES.md differentiator, P2 priority)
Quality assessment beyond exact match

Avoids:

LLM-as-judge bias (PITFALLS.md #7) — multi-model judging, order randomization, human validation sample
Single-run evaluation (PITFALLS.md #6) — multiple runs with variance reported

Research needed: LLM-as-judge implementation with ragas — prompt engineering, bias mitigation strategies. Candidate for /gsd:research-phase during planning.

Phase Ordering Rationale

Phase 1 before 2: Cannot search without data. Embedding generation is slow (API rate limits) — fails should be resumable.
Phase 2 before 3: Cannot evaluate without search implementation. Strategy pattern enables fair comparison.
Phase 3 before 4: Dashboard displays evaluation results. Metrics must exist before visualization.
Phase 4 before 5: Core comparison (exact match) before enhancement (LLM judge). Spike decision may not require Phase 5.

Architecture dependencies discovered:

Ground truth dataset structure (Phase 3) depends on pre-computed embeddings (Phase 1)
Side-by-side comparison (Phase 4) depends on search backends (Phase 2) and metrics (Phase 3)
LLM-as-judge (Phase 5) depends on evaluation framework (Phase 3) for result storage

Pitfall avoidance sequencing:

Data leakage prevented by train/test split BEFORE Phase 1 embedding
Dimension mismatch prevented by schema design IN Phase 1
Prompt sensitivity addressed by template freezing IN Phase 2
LLM-as-judge bias addressed by multi-model approach IN Phase 5

Research Flags

Phases likely needing deeper research during planning:

Phase 2 (Search Implementation): LLM context matching has domain-specific considerations (CSV formatting, token counting, prompt optimization for financial data). Research prompt engineering patterns for invoice classification.
Phase 5 (LLM-as-Judge): Bias mitigation strategies are implementation-dependent. Research ragas-specific patterns for multi-model judging and rubric design.

Phases with standard patterns (skip research-phase):

Phase 1 (Foundation): pgvector setup, embedding API usage, and Docker configuration are well-documented with clear examples
Phase 3 (Evaluation): IR metrics (MRR, nDCG) and ground truth dataset creation follow established patterns from MTEB and RAG literature
Phase 4 (Dashboard): Flask + HTMX for data dashboards has extensive documentation and examples

Confidence Assessment

Area	Confidence	Notes
Stack	HIGH	Official pgvector docs, Google SDK migration guides, sentence-transformers PyPI. Version compatibility verified.
Features	MEDIUM	Based on MTEB leaderboard, RAGAS framework, OpenSearch/Pinecone patterns. No spike-specific benchmarks found.
Architecture	HIGH	Multiple RAG evaluation frameworks (ragas, DeepEval) follow same pipeline pattern. Strategy pattern well-established for backend comparison.
Pitfalls	HIGH	Multiple authoritative sources on vector search anti-patterns, LLM evaluation biases, and data leakage. Research papers verify findings.

Overall confidence: HIGH

The stack, architecture, and pitfalls research draws from official documentation and peer-reviewed sources. Feature research relies more on industry patterns and comparative tool analysis. The recommended approach aligns with how 65% of Fortune 500 companies evaluate RAG systems (ragas adoption stat).

Gaps to Address

During planning/execution:

Jina API integration details — No official Python SDK found, must use requests library directly. Test API during Phase 1 to validate rate limits and batch endpoint behavior.
German-specific embedding quality — Research focused on general multilingual models. May discover during evaluation that German financial terminology requires domain-specific tuning. Monitor exact match rates; if <50%, consider fine-tuning in separate spike.
Orcha prompt templates — Research doesn't have access to current Orcha prompts for LLM matching. Replicate approach in Phase 2 may require experimentation. Document all variations and report sensitivity.
Synthetic query generation scope — Unclear if existing historical.csv provides sufficient test coverage. Evaluate after Phase 3 baseline; only add synthetic generation (complex LLM paraphrasing) if test set proves insufficient.
Token counting implementation — Research identifies context overflow risk but doesn't specify token counting library. Validate tiktoken vs transformers tokenizer alignment with Gemini 2.5 Flash during Phase 4.

Sources

Primary (HIGH confidence)

pgvector GitHub — v0.8.1 release notes, HNSW index configuration, Docker setup
Google Gen AI SDK — v1.64.0 migration guide, embedding API examples
Vertex AI Deprecation Notice — June 2025 deprecation timeline
ragas PyPI — LLM-as-judge framework, evaluation metrics
pgvector 2026 Guide — Index selection (HNSW vs IVFFlat), performance tuning
RAG Evaluation Survey — Evaluation methodology patterns
Quantifying LLM Prompt Sensitivity — 10-70% accuracy variance from prompt changes

Secondary (MEDIUM confidence)

MTEB Leaderboard — Embedding model benchmarks, standard evaluation metrics
Milvus Semantic Search Metrics — MRR, nDCG, Top-K accuracy patterns
Evidently AI - LLM as Judge Guide — LLM-as-judge best practices
Pinecone Vector Similarity — Cosine vs L2 distance, normalization patterns
Justice or Prejudice LLM-as-Judge — Position bias, verbosity bias, self-preference bias
Databricks Long Context RAG — Context window overflow patterns
Jina AI Embeddings — v3/v4 model specifications, API documentation

Tertiary (LOW confidence, needs validation)

Chroma Comparing Embedding Models — Side-by-side comparison UI patterns
Shaped.ai Cosine Similarity Analysis — Distance metric selection considerations

Research completed: 2026-02-20 Ready for roadmap: yes