Domain: Semantic Search Comparison/Benchmark Tool (spike for invoice line item matching) Researched: 2026-02-20 Confidence: MEDIUM (based on multiple web sources, industry patterns)
Features users assume exist. Missing these = tool feels incomplete for its purpose.
| Feature | Why Expected | Complexity | Notes |
|---|---|---|---|
| Single query search interface | Basic usage pattern; need to test individual queries interactively | LOW | Text input + submit button; display results with scores |
| Multiple embedding model support | Core purpose of comparison; PROJECT.md specifies 3 models | MEDIUM | Abstraction layer for model switching; pre-computed embeddings |
| Top-K result retrieval | Standard retrieval interface; adjustable K for precision/recall tradeoff | LOW | K=3,5,10 as per PROJECT.md requirements |
| Similarity score display | Users need to understand why results ranked as shown | LOW | Show cosine similarity or distance metric per result |
| Exact match accuracy metric | Primary success criterion; does retrieved GL account match ground truth | LOW | Boolean comparison of predicted vs actual account/cost center |
| Latency measurement | Performance is key comparison criterion; sub-100ms matters | LOW | Timer around query execution; display ms per query |
| Result list visualization | Core output display; show retrieved documents with context | LOW | Ordered list with rank, score, document text, GL account |
| Ground truth comparison | Need reference to evaluate quality; synthetic test framework | MEDIUM | Link query to expected answer; highlight match/mismatch |
Features that set the product apart. Not required, but valuable for a comprehensive spike.
| Feature | Value Proposition | Complexity | Notes |
|---|---|---|---|
| Side-by-side model comparison | Core differentiation - see all 3 models simultaneously for same query | MEDIUM | Multi-column layout; same query to all models; visual diff of results |
| LLM-as-judge evaluation | Automated quality assessment without manual labeling; catches semantic equivalents | HIGH | LLM API call per evaluation; prompt engineering for relevance scoring |
| Batch query benchmarking | Run entire test suite at once; statistical validity | MEDIUM | Loop over test queries; aggregate metrics; progress indicator |
| Aggregate metrics dashboard (nDCG, MRR) | Industry-standard IR metrics beyond simple accuracy | MEDIUM | Ranking-aware metrics; implementation from evaluation libraries |
| Score distribution histograms | Understand model confidence patterns; identify decision thresholds | MEDIUM | Charting library; binned similarity scores; per-model comparison |
| Synthetic query generation | Create test variations without manual effort; robustness testing | HIGH | LLM-based paraphrasing; variation types (synonyms, reordering, typos) |
| Cost tracking per query | API costs are real constraint; compare embedding vs LLM costs | LOW | Token counting; price per model; cumulative totals |
| Hand-picked example showcase | Curated evidence for decision-making; report-ready outputs | LOW | Save/tag interesting queries; export selection |
| HTML export/report generation | Share findings with stakeholders who won't run the tool | MEDIUM | Static HTML generation; embedded charts; summary statistics |
| Confusion matrix for GL accounts | See where models systematically fail; error pattern analysis | MEDIUM | Multi-class confusion matrix; heatmap visualization |
Features that seem good but create problems for a spike project.
| Feature | Why Requested | Why Problematic | Alternative |
|---|---|---|---|
| Real-time embedding generation | Feels more "realistic" | Adds latency to every query; masks true search performance; API costs during exploration | Pre-compute all embeddings during import; measure embedding generation separately |
| User authentication/multi-tenant | "Production-ready" thinking | Spike scope creep; single dataset; adds complexity with no value | Single-user local tool; explicit scope from PROJECT.md |
| A/B testing / interleaving experiments | Gold standard for online evaluation | Requires live users, traffic scale, statistical power we don't have | Offline batch evaluation with held-out test set |
| Custom model fine-tuning | "What if we trained our own" | Major scope expansion; need training data, infrastructure, expertise | Compare off-the-shelf models first; fine-tuning is separate spike if needed |
| Full RAG pipeline evaluation | Evaluate end-to-end generation | We're evaluating retrieval only; LLM matching is separate system | Keep retrieval and generation evaluations distinct |
| Continuous monitoring/drift detection | "Production" features | No production traffic; spike is point-in-time evaluation | One-time comprehensive benchmark; revisit if deployed |
| Complex query DSL | Power user flexibility | Adds learning curve; spike users are developers who can modify code | Simple text input; code modifications for advanced queries |
[Embedding import/storage]
|
+--requires--> [Single query interface]
| |
| +--requires--> [Top-K retrieval]
| | |
| | +--enables--> [Side-by-side comparison]
| | |
| | +--enables--> [Batch benchmarking]
| | |
| | +--enables--> [Aggregate metrics (nDCG, MRR)]
| | |
| | +--enables--> [Score distributions]
| |
| +--requires--> [Similarity score display]
|
+--enables--> [Latency measurement]
[Ground truth data]
|
+--requires--> [Exact match accuracy]
|
+--enables--> [LLM-as-judge evaluation]
|
+--enables--> [Confusion matrix]
[Synthetic query generation] --independent-- [can run after initial data import]
[HTML export] --requires--> [Any metrics or visualizations to export]
Minimum viable spike - what's needed to make the core comparison decision.
Features to add once core comparison shows promise.
Features to defer until product-market fit is established.
| Feature | User Value | Implementation Cost | Priority |
|---|---|---|---|
| Pre-computed embeddings | HIGH | MEDIUM | P1 |
| Single query interface | HIGH | LOW | P1 |
| Top-K retrieval | HIGH | LOW | P1 |
| Side-by-side comparison | HIGH | MEDIUM | P1 |
| Exact match accuracy | HIGH | LOW | P1 |
| Latency measurement | MEDIUM | LOW | P1 |
| Batch benchmarking | HIGH | MEDIUM | P1 |
| Similarity score display | MEDIUM | LOW | P1 |
| LLM-as-judge | MEDIUM | HIGH | P2 |
| nDCG/MRR metrics | MEDIUM | MEDIUM | P2 |
| Score distributions | LOW | MEDIUM | P2 |
| Cost tracking | LOW | LOW | P2 |
| HTML export | MEDIUM | MEDIUM | P2 |
| Synthetic query generation | LOW | HIGH | P3 |
| Confusion matrix | LOW | MEDIUM | P3 |
Priority key:
| Feature | MTEB Leaderboard | OpenSearch Search Quality | RAGAS | DeepEval | Our Approach |
|---|---|---|---|---|---|
| Standard benchmarks | 56 datasets, 8 tasks | Custom datasets | Synthetic generation | Test-driven | Custom dataset (historical.csv) |
| Side-by-side comparison | Leaderboard table | A/B experiment view | N/A | N/A | Per-query multi-model view |
| Metrics | Per-task scores | nDCG, MAP, Precision | Faithfulness, relevancy | 14+ metrics | Exact match, Top-K, latency, cost |
| LLM evaluation | N/A | N/A | Core feature | Core feature | Optional P2 feature |
| Interactive exploration | Web UI | Dashboard | Programmatic | Programmatic | Interactive HTML page |
| Export | CSV download | Dashboard export | JSON/logs | pytest reports | HTML report |
Metrics hierarchy matters: Start with simple exact match, graduate to ranking-aware metrics (MRR, nDCG) only when needed. Most production teams use multiple metrics.
LLM-as-judge is standard for semantic evaluation: When exact string match fails (synonyms, paraphrases), LLM judges provide reasonable relevance scores. Cost is manageable ($0.01-0.10 per assessment).
Batch evaluation is prerequisite for statistical validity: Single query demos are misleading. Need aggregate metrics over representative test set.
Visualization accelerates insight: Histograms of similarity scores, confusion matrices, and side-by-side comparisons help identify patterns faster than raw numbers.
Pre-computation is best practice: All benchmark tools pre-compute embeddings to isolate search performance from embedding generation latency.
Feature research for: Semantic Search Comparison Benchmark Tool Researched: 2026-02-20