Project Research Summary

Project: Semantic Search Comparison Benchmark Domain: Vector Database Evaluation (Invoice Line Item Matching) Researched: 2026-02-20 Confidence: HIGH

Executive Summary

This project is a technical spike to evaluate whether pgvector-based semantic search can replace LLM-based matching for invoice line item classification. The research shows this is a well-established problem domain with mature tools: PostgreSQL 18 with pgvector 0.8.1 for vector search, multiple embedding models (Google gemini-embedding-001, Jina v3, all-MiniLM-L6-v2) for comparison, and standard IR evaluation metrics (exact match, MRR, nDCG).

The recommended approach follows proven patterns from the RAG evaluation ecosystem: pre-compute embeddings for all models, store in separate columns for side-by-side comparison, build HNSW indexes after data load, and evaluate using both exact match metrics and LLM-as-judge for semantic equivalence. The critical differentiator is the side-by-side comparison dashboard that lets stakeholders interactively explore how each model performs on the same queries.

The primary risks are data leakage (testing on training data), prompt sensitivity (LLM baseline varying 10-70% with minor prompt changes), and unfair comparison (different preprocessing per model). These are mitigated by: holding out 20% of data for testing, documenting exact prompts with sensitivity metrics, and standardizing preprocessing across all embedding models. The spike scope is deliberately constrained to ~6K invoice line items to make decisions quickly without production deployment complexity.

Key Findings

Modern vector search requires PostgreSQL 18 with pgvector 0.8.1 (HNSW indexes, binary quantization support). Python 3.12 is the sweet spot supported by all required libraries. The embedding landscape has shifted significantly: google-genai (v1.64.0) replaces deprecated google-generativeai and vertexai modules; gemini-embedding-001 now outperforms text-multilingual-embedding-002 for multilingual retrieval. For local baselines, all-MiniLM-L6-v2 provides fast, cost-free embeddings at 384 dimensions.

Core technologies:

Critical version notes:

Expected Features

Research reveals a clear feature hierarchy for embedding comparison tools. The MTEB leaderboard, RAGAS framework, and production RAG systems all converge on similar patterns.

Must have (table stakes):

Should have (competitive):

Defer (v2+):

Architecture Approach

The research converges on a pipeline architecture with checkpoints: ingestion pipeline (run once) → search layer (multiple implementations) → evaluation framework → presentation layer. This separation enables fair comparison and resumability after API failures.

Major components:

  1. Ingestion Pipeline — CSV loader → embedding generator (3 models, batch API requests) → pgvector writer → synthetic query generator. Each stage writes to database for restart capability.

  2. Search Layer — Strategy pattern with common interface: PgvectorSearch (3 model variants) and LLMContextSearch (Orcha replication). Enables fair comparison with same evaluation framework.

  3. Evaluation Framework — Ground truth dataset (held-out 20%) → metrics calculator (exact match, Top-K, MRR, latency) → LLM-as-judge evaluator (for semantic equivalence) → results store.

  4. Dashboard — Flask + HTMX for interactive exploration. Thin presentation layer calling search and evaluation modules. Side-by-side comparison view is core feature.

Database schema:

Build order rationale: Database schema first → ingestion pipeline → search layer → evaluation framework → dashboard. Cannot search without data; cannot evaluate without search; dashboard is thin integration layer.

Critical Pitfalls

Research identifies 8 critical pitfalls from vector database practitioners and RAG evaluation studies. Top 5 by severity:

  1. Data leakage in evaluation — Testing on training data or overly-similar synthetic variations produces artificially high accuracy (>95%). Prevention: Hold out 20% of historical.csv before embedding; generate diverse test variations (typos, abbreviations, synonyms, paraphrases). Address in Phase 3 (Evaluation Design).

  2. Prompt sensitivity in LLM matching — Small prompt wording changes cause 10-70% accuracy swings. The same query gets different results with minor formatting changes. Prevention: Test 3-5 prompt variations and report variance (not just best result); use structured JSON output; document exact prompts. Address in Phase 4 (LLM Implementation).

  3. IVFFlat index on insufficient data — Creating index before data load produces poor clusters that never improve. IVFFlat is static once built, unlike HNSW. Prevention: Use HNSW for this spike (works without data, better for dynamic datasets); if using IVFFlat, load ALL data first. Address in Phase 1 (Data Import).

  4. Embedding dimension mismatch — Different models have different dimensions (Google: 768, Jina: 768, MiniLM: 384). Mixing models causes silent failures or crashes. Prevention: Store embedding_model metadata in database; validate dimensions on insert; use separate columns per model. Address in Phase 1 (Data Import).

  5. LLM context window overflow — Passing too many historical bookings causes "lost in the middle" effect or silent truncation. Current Orcha approach passes up to 50 bookings (~5K tokens just for context). Prevention: Calculate actual token count before API call; limit to 20-30 most relevant bookings; place query at start/end of context. Address in Phase 4 (LLM Implementation).

Implications for Roadmap

Based on research, suggested 5-phase structure following industry patterns for vector search evaluation:

Phase 1: Foundation & Data Import

Rationale: Database schema and data ingestion must come first. Research shows HNSW indexes should be created AFTER full data load, not during schema setup. Pre-computing embeddings is industry best practice to isolate search performance from generation latency.

Delivers: PostgreSQL 18 with pgvector extension, line_items table with 3 embedding columns, HNSW indexes, ~6K rows of invoice data with embeddings from all 3 models.

Addresses:

Avoids:

Research needed: Standard patterns — pgvector setup is well-documented, embedding APIs have clear examples. Skip /gsd:research-phase.


Phase 2: Search Implementation

Rationale: Once data is loaded, implement both search approaches (pgvector with 3 models + LLM context matching). Strategy pattern from ARCHITECTURE.md enables fair comparison. Must implement before evaluation framework can consume results.

Delivers: PgvectorSearch class with 3 model variants, LLMContextSearch replicating Orcha approach, common SearchBackend interface, query endpoint returning top-K results with scores.

Addresses:

Avoids:

Research needed: LLM context matching implementation details — how to structure CSV context, token counting, prompt optimization. Candidate for /gsd:research-phase during planning.


Phase 3: Evaluation Framework

Rationale: With search working, build evaluation framework to objectively compare approaches. Ground truth dataset design is critical — research emphasizes holding out data to avoid leakage. Metrics hierarchy: start simple (exact match), graduate to ranking-aware (MRR, nDCG), supplement with LLM-as-judge.

Delivers: Test dataset (20% held-out from historical.csv), ground truth labels (expected GL account/cost center), metrics calculator (exact match, Top-K accuracy, MRR, latency), batch evaluation runner.

Addresses:

Avoids:

Research needed: Standard patterns — evaluation metrics well-established in IR literature. Skip /gsd:research-phase.


Phase 4: Dashboard & Comparison

Rationale: With search and evaluation complete, build interactive interface for exploration. Research shows side-by-side comparison is core differentiator for model comparison tools. Flask + HTMX simpler than full frontend frameworks for spike.

Delivers: Flask application with routes for single query and batch evaluation, side-by-side comparison view (all 3 pgvector models + LLM baseline), score visualization, latency display, HTML export for stakeholder reports.

Addresses:

Avoids:

Research needed: Standard patterns — Flask + HTMX well-documented for dashboards. Skip /gsd:research-phase.


Phase 5: LLM-as-Judge Enhancement

Rationale: After core comparison complete, add LLM-as-judge for cases where exact match fails but result is semantically equivalent. Research shows LLM judges have biases (position, verbosity, self-preference) requiring multi-model judging and randomization.

Delivers: LLM-as-judge evaluator using ragas framework, multi-model judging (GPT-4, Claude, Gemini), randomized presentation order, explicit rubric in evaluation prompt, human validation sample (10-20 queries).

Addresses:

Avoids:

Research needed: LLM-as-judge implementation with ragas — prompt engineering, bias mitigation strategies. Candidate for /gsd:research-phase during planning.


Phase Ordering Rationale

Architecture dependencies discovered:

Pitfall avoidance sequencing:

Research Flags

Phases likely needing deeper research during planning:

Phases with standard patterns (skip research-phase):

Confidence Assessment

Area Confidence Notes
Stack HIGH Official pgvector docs, Google SDK migration guides, sentence-transformers PyPI. Version compatibility verified.
Features MEDIUM Based on MTEB leaderboard, RAGAS framework, OpenSearch/Pinecone patterns. No spike-specific benchmarks found.
Architecture HIGH Multiple RAG evaluation frameworks (ragas, DeepEval) follow same pipeline pattern. Strategy pattern well-established for backend comparison.
Pitfalls HIGH Multiple authoritative sources on vector search anti-patterns, LLM evaluation biases, and data leakage. Research papers verify findings.

Overall confidence: HIGH

The stack, architecture, and pitfalls research draws from official documentation and peer-reviewed sources. Feature research relies more on industry patterns and comparative tool analysis. The recommended approach aligns with how 65% of Fortune 500 companies evaluate RAG systems (ragas adoption stat).

Gaps to Address

During planning/execution:

  1. Jina API integration details — No official Python SDK found, must use requests library directly. Test API during Phase 1 to validate rate limits and batch endpoint behavior.

  2. German-specific embedding quality — Research focused on general multilingual models. May discover during evaluation that German financial terminology requires domain-specific tuning. Monitor exact match rates; if <50%, consider fine-tuning in separate spike.

  3. Orcha prompt templates — Research doesn't have access to current Orcha prompts for LLM matching. Replicate approach in Phase 2 may require experimentation. Document all variations and report sensitivity.

  4. Synthetic query generation scope — Unclear if existing historical.csv provides sufficient test coverage. Evaluate after Phase 3 baseline; only add synthetic generation (complex LLM paraphrasing) if test set proves insufficient.

  5. Token counting implementation — Research identifies context overflow risk but doesn't specify token counting library. Validate tiktoken vs transformers tokenizer alignment with Gemini 2.5 Flash during Phase 4.

Sources

Primary (HIGH confidence)

Secondary (MEDIUM confidence)

Tertiary (LOW confidence, needs validation)


Research completed: 2026-02-20 Ready for roadmap: yes