Feature Research

Domain: Semantic Search Comparison/Benchmark Tool (spike for invoice line item matching) Researched: 2026-02-20 Confidence: MEDIUM (based on multiple web sources, industry patterns)

Feature Landscape

Table Stakes (Users Expect These)

Features users assume exist. Missing these = tool feels incomplete for its purpose.

Feature	Why Expected	Complexity	Notes
Single query search interface	Basic usage pattern; need to test individual queries interactively	LOW	Text input + submit button; display results with scores
Multiple embedding model support	Core purpose of comparison; PROJECT.md specifies 3 models	MEDIUM	Abstraction layer for model switching; pre-computed embeddings
Top-K result retrieval	Standard retrieval interface; adjustable K for precision/recall tradeoff	LOW	K=3,5,10 as per PROJECT.md requirements
Similarity score display	Users need to understand why results ranked as shown	LOW	Show cosine similarity or distance metric per result
Exact match accuracy metric	Primary success criterion; does retrieved GL account match ground truth	LOW	Boolean comparison of predicted vs actual account/cost center
Latency measurement	Performance is key comparison criterion; sub-100ms matters	LOW	Timer around query execution; display ms per query
Result list visualization	Core output display; show retrieved documents with context	LOW	Ordered list with rank, score, document text, GL account
Ground truth comparison	Need reference to evaluate quality; synthetic test framework	MEDIUM	Link query to expected answer; highlight match/mismatch

Differentiators (Competitive Advantage)

Features that set the product apart. Not required, but valuable for a comprehensive spike.

Feature	Value Proposition	Complexity	Notes
Side-by-side model comparison	Core differentiation - see all 3 models simultaneously for same query	MEDIUM	Multi-column layout; same query to all models; visual diff of results
LLM-as-judge evaluation	Automated quality assessment without manual labeling; catches semantic equivalents	HIGH	LLM API call per evaluation; prompt engineering for relevance scoring
Batch query benchmarking	Run entire test suite at once; statistical validity	MEDIUM	Loop over test queries; aggregate metrics; progress indicator
Aggregate metrics dashboard (nDCG, MRR)	Industry-standard IR metrics beyond simple accuracy	MEDIUM	Ranking-aware metrics; implementation from evaluation libraries
Score distribution histograms	Understand model confidence patterns; identify decision thresholds	MEDIUM	Charting library; binned similarity scores; per-model comparison
Synthetic query generation	Create test variations without manual effort; robustness testing	HIGH	LLM-based paraphrasing; variation types (synonyms, reordering, typos)
Cost tracking per query	API costs are real constraint; compare embedding vs LLM costs	LOW	Token counting; price per model; cumulative totals
Hand-picked example showcase	Curated evidence for decision-making; report-ready outputs	LOW	Save/tag interesting queries; export selection
HTML export/report generation	Share findings with stakeholders who won't run the tool	MEDIUM	Static HTML generation; embedded charts; summary statistics
Confusion matrix for GL accounts	See where models systematically fail; error pattern analysis	MEDIUM	Multi-class confusion matrix; heatmap visualization

Anti-Features (Commonly Requested, Often Problematic)

Features that seem good but create problems for a spike project.

Feature	Why Requested	Why Problematic	Alternative
Real-time embedding generation	Feels more "realistic"	Adds latency to every query; masks true search performance; API costs during exploration	Pre-compute all embeddings during import; measure embedding generation separately
User authentication/multi-tenant	"Production-ready" thinking	Spike scope creep; single dataset; adds complexity with no value	Single-user local tool; explicit scope from PROJECT.md
A/B testing / interleaving experiments	Gold standard for online evaluation	Requires live users, traffic scale, statistical power we don't have	Offline batch evaluation with held-out test set
Custom model fine-tuning	"What if we trained our own"	Major scope expansion; need training data, infrastructure, expertise	Compare off-the-shelf models first; fine-tuning is separate spike if needed
Full RAG pipeline evaluation	Evaluate end-to-end generation	We're evaluating retrieval only; LLM matching is separate system	Keep retrieval and generation evaluations distinct
Continuous monitoring/drift detection	"Production" features	No production traffic; spike is point-in-time evaluation	One-time comprehensive benchmark; revisit if deployed
Complex query DSL	Power user flexibility	Adds learning curve; spike users are developers who can modify code	Simple text input; code modifications for advanced queries

Feature Dependencies

[Embedding import/storage]
    |
    +--requires--> [Single query interface]
    |                   |
    |                   +--requires--> [Top-K retrieval]
    |                   |                   |
    |                   |                   +--enables--> [Side-by-side comparison]
    |                   |                   |
    |                   |                   +--enables--> [Batch benchmarking]
    |                   |                                      |
    |                   |                                      +--enables--> [Aggregate metrics (nDCG, MRR)]
    |                   |                                      |
    |                   |                                      +--enables--> [Score distributions]
    |                   |
    |                   +--requires--> [Similarity score display]
    |
    +--enables--> [Latency measurement]

[Ground truth data]
    |
    +--requires--> [Exact match accuracy]
    |
    +--enables--> [LLM-as-judge evaluation]
    |
    +--enables--> [Confusion matrix]

[Synthetic query generation] --independent-- [can run after initial data import]

[HTML export] --requires--> [Any metrics or visualizations to export]

Dependency Notes

Embedding storage requires import: Must have pre-computed embeddings before any search functionality works
Side-by-side requires Top-K: Need basic retrieval working before comparing across models
Aggregate metrics require batch: nDCG/MRR need multiple query results to be meaningful
LLM-as-judge requires ground truth: Need expected answers to evaluate relevance
Export requires metrics: Nothing to export until we can generate results and metrics

MVP Definition

Launch With (v1 - Initial Spike)

Minimum viable spike - what's needed to make the core comparison decision.

Pre-computed embeddings for all 3 models - Foundation for all comparisons
Single query interface - Interactive exploration of results
Top-K retrieval (K=3,5,10) - Standard retrieval output
Similarity scores displayed - Understand ranking basis
Side-by-side comparison view - Core differentiation capability
Exact match accuracy on test set - Primary success metric
Latency per query - Performance comparison
Basic batch benchmarking - Run test suite, get aggregate accuracy

Add After Validation (v1.x)

Features to add once core comparison shows promise.

LLM-as-judge evaluation - Add when exact match proves insufficient (semantic equivalents)
nDCG and MRR metrics - Add when ranking quality matters beyond top-1
Score distribution visualizations - Add for deeper analysis of model behavior
Cost tracking - Add when comparing embedding API costs vs LLM costs
HTML export - Add when ready to share findings with stakeholders

Future Consideration (v2+ / Separate Spike)

Features to defer until product-market fit is established.

Synthetic query generation - Complex; may not be needed if existing test set sufficient
Confusion matrix analysis - Useful for production but not for initial decision
LLM baseline comparison - Already in PROJECT.md scope but could be separate workstream

Feature Prioritization Matrix

Feature	User Value	Implementation Cost	Priority
Pre-computed embeddings	HIGH	MEDIUM	P1
Single query interface	HIGH	LOW	P1
Top-K retrieval	HIGH	LOW	P1
Side-by-side comparison	HIGH	MEDIUM	P1
Exact match accuracy	HIGH	LOW	P1
Latency measurement	MEDIUM	LOW	P1
Batch benchmarking	HIGH	MEDIUM	P1
Similarity score display	MEDIUM	LOW	P1
LLM-as-judge	MEDIUM	HIGH	P2
nDCG/MRR metrics	MEDIUM	MEDIUM	P2
Score distributions	LOW	MEDIUM	P2
Cost tracking	LOW	LOW	P2
HTML export	MEDIUM	MEDIUM	P2
Synthetic query generation	LOW	HIGH	P3
Confusion matrix	LOW	MEDIUM	P3

Priority key:

P1: Must have for initial spike decision
P2: Should have if time permits, add based on P1 findings
P3: Nice to have, likely out of spike scope

Competitor/Reference Tool Analysis

Feature	MTEB Leaderboard	OpenSearch Search Quality	RAGAS	DeepEval	Our Approach
Standard benchmarks	56 datasets, 8 tasks	Custom datasets	Synthetic generation	Test-driven	Custom dataset (historical.csv)
Side-by-side comparison	Leaderboard table	A/B experiment view	N/A	N/A	Per-query multi-model view
Metrics	Per-task scores	nDCG, MAP, Precision	Faithfulness, relevancy	14+ metrics	Exact match, Top-K, latency, cost
LLM evaluation	N/A	N/A	Core feature	Core feature	Optional P2 feature
Interactive exploration	Web UI	Dashboard	Programmatic	Programmatic	Interactive HTML page
Export	CSV download	Dashboard export	JSON/logs	pytest reports	HTML report

Key Insights from Research

What the Ecosystem Teaches Us

Metrics hierarchy matters: Start with simple exact match, graduate to ranking-aware metrics (MRR, nDCG) only when needed. Most production teams use multiple metrics.
LLM-as-judge is standard for semantic evaluation: When exact string match fails (synonyms, paraphrases), LLM judges provide reasonable relevance scores. Cost is manageable ($0.01-0.10 per assessment).
Batch evaluation is prerequisite for statistical validity: Single query demos are misleading. Need aggregate metrics over representative test set.
Visualization accelerates insight: Histograms of similarity scores, confusion matrices, and side-by-side comparisons help identify patterns faster than raw numbers.
Pre-computation is best practice: All benchmark tools pre-compute embeddings to isolate search performance from embedding generation latency.

What Makes Our Spike Different

Domain-specific: Invoice line items in German, not general-purpose retrieval
Decision-focused: Binary outcome (use pgvector vs continue with LLM) not academic benchmark
Cost-sensitive: Comparing free local models vs API costs
Ground truth available: Historical bookings provide labeled data

Sources

Feature research for: Semantic Search Comparison Benchmark Tool Researched: 2026-02-20