Feature Research

Domain: Semantic Search Comparison/Benchmark Tool (spike for invoice line item matching) Researched: 2026-02-20 Confidence: MEDIUM (based on multiple web sources, industry patterns)

Feature Landscape

Table Stakes (Users Expect These)

Features users assume exist. Missing these = tool feels incomplete for its purpose.

Feature Why Expected Complexity Notes
Single query search interface Basic usage pattern; need to test individual queries interactively LOW Text input + submit button; display results with scores
Multiple embedding model support Core purpose of comparison; PROJECT.md specifies 3 models MEDIUM Abstraction layer for model switching; pre-computed embeddings
Top-K result retrieval Standard retrieval interface; adjustable K for precision/recall tradeoff LOW K=3,5,10 as per PROJECT.md requirements
Similarity score display Users need to understand why results ranked as shown LOW Show cosine similarity or distance metric per result
Exact match accuracy metric Primary success criterion; does retrieved GL account match ground truth LOW Boolean comparison of predicted vs actual account/cost center
Latency measurement Performance is key comparison criterion; sub-100ms matters LOW Timer around query execution; display ms per query
Result list visualization Core output display; show retrieved documents with context LOW Ordered list with rank, score, document text, GL account
Ground truth comparison Need reference to evaluate quality; synthetic test framework MEDIUM Link query to expected answer; highlight match/mismatch

Differentiators (Competitive Advantage)

Features that set the product apart. Not required, but valuable for a comprehensive spike.

Feature Value Proposition Complexity Notes
Side-by-side model comparison Core differentiation - see all 3 models simultaneously for same query MEDIUM Multi-column layout; same query to all models; visual diff of results
LLM-as-judge evaluation Automated quality assessment without manual labeling; catches semantic equivalents HIGH LLM API call per evaluation; prompt engineering for relevance scoring
Batch query benchmarking Run entire test suite at once; statistical validity MEDIUM Loop over test queries; aggregate metrics; progress indicator
Aggregate metrics dashboard (nDCG, MRR) Industry-standard IR metrics beyond simple accuracy MEDIUM Ranking-aware metrics; implementation from evaluation libraries
Score distribution histograms Understand model confidence patterns; identify decision thresholds MEDIUM Charting library; binned similarity scores; per-model comparison
Synthetic query generation Create test variations without manual effort; robustness testing HIGH LLM-based paraphrasing; variation types (synonyms, reordering, typos)
Cost tracking per query API costs are real constraint; compare embedding vs LLM costs LOW Token counting; price per model; cumulative totals
Hand-picked example showcase Curated evidence for decision-making; report-ready outputs LOW Save/tag interesting queries; export selection
HTML export/report generation Share findings with stakeholders who won't run the tool MEDIUM Static HTML generation; embedded charts; summary statistics
Confusion matrix for GL accounts See where models systematically fail; error pattern analysis MEDIUM Multi-class confusion matrix; heatmap visualization

Anti-Features (Commonly Requested, Often Problematic)

Features that seem good but create problems for a spike project.

Feature Why Requested Why Problematic Alternative
Real-time embedding generation Feels more "realistic" Adds latency to every query; masks true search performance; API costs during exploration Pre-compute all embeddings during import; measure embedding generation separately
User authentication/multi-tenant "Production-ready" thinking Spike scope creep; single dataset; adds complexity with no value Single-user local tool; explicit scope from PROJECT.md
A/B testing / interleaving experiments Gold standard for online evaluation Requires live users, traffic scale, statistical power we don't have Offline batch evaluation with held-out test set
Custom model fine-tuning "What if we trained our own" Major scope expansion; need training data, infrastructure, expertise Compare off-the-shelf models first; fine-tuning is separate spike if needed
Full RAG pipeline evaluation Evaluate end-to-end generation We're evaluating retrieval only; LLM matching is separate system Keep retrieval and generation evaluations distinct
Continuous monitoring/drift detection "Production" features No production traffic; spike is point-in-time evaluation One-time comprehensive benchmark; revisit if deployed
Complex query DSL Power user flexibility Adds learning curve; spike users are developers who can modify code Simple text input; code modifications for advanced queries

Feature Dependencies

[Embedding import/storage]
    |
    +--requires--> [Single query interface]
    |                   |
    |                   +--requires--> [Top-K retrieval]
    |                   |                   |
    |                   |                   +--enables--> [Side-by-side comparison]
    |                   |                   |
    |                   |                   +--enables--> [Batch benchmarking]
    |                   |                                      |
    |                   |                                      +--enables--> [Aggregate metrics (nDCG, MRR)]
    |                   |                                      |
    |                   |                                      +--enables--> [Score distributions]
    |                   |
    |                   +--requires--> [Similarity score display]
    |
    +--enables--> [Latency measurement]

[Ground truth data]
    |
    +--requires--> [Exact match accuracy]
    |
    +--enables--> [LLM-as-judge evaluation]
    |
    +--enables--> [Confusion matrix]

[Synthetic query generation] --independent-- [can run after initial data import]

[HTML export] --requires--> [Any metrics or visualizations to export]

Dependency Notes

MVP Definition

Launch With (v1 - Initial Spike)

Minimum viable spike - what's needed to make the core comparison decision.

Add After Validation (v1.x)

Features to add once core comparison shows promise.

Future Consideration (v2+ / Separate Spike)

Features to defer until product-market fit is established.

Feature Prioritization Matrix

Feature User Value Implementation Cost Priority
Pre-computed embeddings HIGH MEDIUM P1
Single query interface HIGH LOW P1
Top-K retrieval HIGH LOW P1
Side-by-side comparison HIGH MEDIUM P1
Exact match accuracy HIGH LOW P1
Latency measurement MEDIUM LOW P1
Batch benchmarking HIGH MEDIUM P1
Similarity score display MEDIUM LOW P1
LLM-as-judge MEDIUM HIGH P2
nDCG/MRR metrics MEDIUM MEDIUM P2
Score distributions LOW MEDIUM P2
Cost tracking LOW LOW P2
HTML export MEDIUM MEDIUM P2
Synthetic query generation LOW HIGH P3
Confusion matrix LOW MEDIUM P3

Priority key:

Competitor/Reference Tool Analysis

Feature MTEB Leaderboard OpenSearch Search Quality RAGAS DeepEval Our Approach
Standard benchmarks 56 datasets, 8 tasks Custom datasets Synthetic generation Test-driven Custom dataset (historical.csv)
Side-by-side comparison Leaderboard table A/B experiment view N/A N/A Per-query multi-model view
Metrics Per-task scores nDCG, MAP, Precision Faithfulness, relevancy 14+ metrics Exact match, Top-K, latency, cost
LLM evaluation N/A N/A Core feature Core feature Optional P2 feature
Interactive exploration Web UI Dashboard Programmatic Programmatic Interactive HTML page
Export CSV download Dashboard export JSON/logs pytest reports HTML report

Key Insights from Research

What the Ecosystem Teaches Us

  1. Metrics hierarchy matters: Start with simple exact match, graduate to ranking-aware metrics (MRR, nDCG) only when needed. Most production teams use multiple metrics.

  2. LLM-as-judge is standard for semantic evaluation: When exact string match fails (synonyms, paraphrases), LLM judges provide reasonable relevance scores. Cost is manageable ($0.01-0.10 per assessment).

  3. Batch evaluation is prerequisite for statistical validity: Single query demos are misleading. Need aggregate metrics over representative test set.

  4. Visualization accelerates insight: Histograms of similarity scores, confusion matrices, and side-by-side comparisons help identify patterns faster than raw numbers.

  5. Pre-computation is best practice: All benchmark tools pre-compute embeddings to isolate search performance from embedding generation latency.

What Makes Our Spike Different

Sources


Feature research for: Semantic Search Comparison Benchmark Tool Researched: 2026-02-20