Researched: 2026-02-20 Domain: Metrics calculation, batch benchmarking, Flask dashboard, cost tracking Confidence: HIGH
This phase focuses on quantitative evaluation of the 4 search approaches (3 embedding models + LLM baseline) with metrics calculation, batch benchmarking, and an interactive dashboard. The existing codebase already has Flask search UI, pgvector search, LLM matching, test set infrastructure, and query variations -- this phase adds measurement, aggregation, and comparison layers.
The technical stack is straightforward: Python's time.perf_counter() for high-precision latency measurement, simple accuracy calculations (exact match and top-K), dataclasses for structured results, and extending the existing Flask app with new routes. No complex libraries needed -- sklearn's accuracy_score could be used but manual calculation is simpler for this use case. Cost tracking requires storing token counts per API call and applying known pricing rates.
Primary recommendation: Build a modular metrics calculator with dataclass-based result structures, batch benchmark runner with progress streaming, and extend the existing Flask app with comparison and benchmark routes.
<user_constraints>
None -- all implementation decisions delegated to Claude.
User delegated all implementation decisions. Claude has full flexibility on:
Comparison view:
Metrics display:
Batch benchmark UX:
Dashboard interaction:
None -- discussion stayed within phase scope. </user_constraints>
<phase_requirements>
| ID | Description | Research Support |
|---|---|---|
| SRCH-04 | Side-by-side comparison view across all 3 embedding models | Existing Flask template uses 4-column grid; extend to include metric annotations and highlighting |
| EVAL-03 | Exact match accuracy for GL account assignment | Simple calculation: (correct predictions / total predictions); compare top-1 result debit_account to ground truth |
| EVAL-04 | Exact match accuracy for cost center assignment | Same as EVAL-03 but comparing cost_center field |
| EVAL-05 | Top-K accuracy (correct answer in top 3/5/10) | Check if ground truth appears anywhere in top-K results; standard IR metric |
| EVAL-06 | Latency measurement per query (ms) | Use time.perf_counter() for high-precision timing around search calls |
| EVAL-07 | Batch benchmarking over full test set | Iterate test_query_variation table, run all 4 approaches, aggregate statistics |
| EVAL-09 | Cost tracking per query (tokens, API calls, estimated $) | Track token counts from API responses; apply pricing: Gemini Flash $0.30/$2.50 per 1M, Jina ~$0.02 per 1M, Google embeddings $0.15 per 1M |
| REPT-01 | Interactive HTML dashboard for search exploration | Extend existing Flask app with benchmark routes; use Bootstrap grid for comparison view |
| </phase_requirements> |
| Library | Version | Purpose | Why Standard |
|---|---|---|---|
| Flask | 3.x (existing) | Web framework | Already in use; simple extension with new routes |
| dataclasses | stdlib | Structured results | Zero-dependency, clean typing, JSON-serializable with asdict() |
| time.perf_counter | stdlib | Latency measurement | Highest precision timer; monotonic (no clock drift) |
| statistics | stdlib | Aggregate calculations | mean, stdev, median built-in |
| Library | Version | Purpose | When to Use |
|---|---|---|---|
| orjson | 3.x | Fast JSON export | Benchmark result export; native dataclass support |
| tqdm | existing | Progress bars | CLI progress during batch runs |
| Instead of | Could Use | Tradeoff |
|---|---|---|
| Manual accuracy | sklearn.metrics.accuracy_score | Adds dependency for simple division; not worth it |
| Server-Sent Events | Polling | SSE more complex but real-time; polling simpler for batch progress |
| Complex charting | HTML tables | Charts require JS libraries; tables sufficient for spike |
Installation:
pip install orjson # Only if JSON export needed
# All other dependencies already installed
src/
├── evaluation/
│ ├── __init__.py
│ ├── train_test_split.py # existing
│ ├── query_variations.py # existing
│ ├── metrics.py # NEW: accuracy, top-k calculation
│ ├── benchmark.py # NEW: batch runner with timing
│ └── cost_tracker.py # NEW: token/cost aggregation
├── search/
│ ├── pgvector_search.py # existing (add timing)
│ └── llm_matching.py # existing (add token tracking)
├── templates/
│ ├── search.html # existing
│ ├── benchmark.html # NEW: batch benchmark UI
│ └── results.html # NEW: aggregate results view
└── app.py # extend with benchmark routes
What: Use dataclasses for all metric/result structures When to use: Any structured data that needs to be aggregated or serialized Example:
# Source: Python dataclasses stdlib + orjson compatibility
from dataclasses import dataclass, asdict, field
from typing import Optional
import time
@dataclass
class SearchResult:
"""Single search result with timing."""
model: str # 'google', 'jina', 'minilm', 'llm'
query_id: int
latency_ms: float
predicted_debit_account: Optional[str]
predicted_cost_center: Optional[str]
ground_truth_debit_account: str
ground_truth_cost_center: Optional[str]
top_k_debit_accounts: list[str] = field(default_factory=list)
similarity_scores: list[float] = field(default_factory=list)
# Cost tracking
tokens_input: int = 0
tokens_output: int = 0
api_calls: int = 0
@dataclass
class BenchmarkResults:
"""Aggregated benchmark statistics."""
model: str
total_queries: int
exact_match_gl: float # 0-1
exact_match_cc: float # 0-1
top_3_accuracy: float
top_5_accuracy: float
top_10_accuracy: float
latency_mean_ms: float
latency_p50_ms: float
latency_p95_ms: float
total_tokens: int
total_cost_usd: float
What: Consistent high-precision timing with context manager When to use: Wrapping any search operation Example:
# Source: Python time module, PEP 418
from contextlib import contextmanager
import time
@contextmanager
def measure_latency():
"""Context manager for high-precision timing."""
start = time.perf_counter()
result = {'latency_ms': 0.0}
try:
yield result
finally:
end = time.perf_counter()
result['latency_ms'] = (end - start) * 1000 # Convert to ms
What: Stream progress updates during long-running benchmarks When to use: Batch benchmark runs that take >10 seconds Example:
# Source: Flask streaming documentation
from flask import Response, stream_with_context
import json
def benchmark_stream():
"""Generator for streaming benchmark progress."""
total = get_test_count()
for i, result in enumerate(run_benchmark_iterator()):
progress = {
'current': i + 1,
'total': total,
'percent': round((i + 1) / total * 100, 1),
'last_result': asdict(result)
}
yield f"data: {json.dumps(progress)}\n\n"
@app.route('/benchmark/stream')
def benchmark_sse():
"""Server-Sent Events endpoint for benchmark progress."""
return Response(
stream_with_context(benchmark_stream()),
mimetype='text/event-stream'
)
perf_counter() calls at the outermost layer, not inside search functions| Problem | Don't Build | Use Instead | Why |
|---|---|---|---|
| Percentile calculation | Custom implementation | statistics.quantiles(data, n=100)[94] for p95 |
Edge cases with small samples |
| JSON serialization | Manual dict conversion | dataclasses.asdict() or orjson |
Handles nested structures correctly |
| Progress streaming | WebSocket implementation | Flask SSE with generators | Much simpler for unidirectional updates |
| Query timing | time.time() |
time.perf_counter() |
time.time() can jump backward on clock sync |
Key insight: The evaluation logic itself is simple arithmetic. The complexity is in data flow (iterating test set, aggregating results, streaming progress) not in the metrics math.
What goes wrong: First query takes much longer due to model loading, connection warming Why it happens: Lazy initialization of embedding models, database connections How to avoid: Run 1-2 warmup queries before timing; exclude warmup from statistics Warning signs: First query 5-10x slower than subsequent queries
What goes wrong: API rate limits cause failures mid-benchmark Why it happens: Rapid sequential API calls to Jina, Google, Gemini How to avoid: Add configurable delays between API calls; handle 429 with exponential backoff Warning signs: Sporadic HTTP 429 errors, incomplete benchmark runs
What goes wrong: OOM when storing all SearchResult objects for 1000+ queries Why it happens: Accumulating result objects in list How to avoid: Aggregate incrementally; only keep running statistics Warning signs: Memory usage growing linearly with test set size
What goes wrong: "6801" != "6801.0" fails exact match Why it happens: Mixed string/numeric types from database How to avoid: Normalize both predicted and ground truth to same format before comparison Warning signs: Low accuracy despite visually correct predictions
What goes wrong: LLM returns single prediction vs pgvector returns top-K list Why it happens: Different result structures by design How to avoid: Normalize result handling; LLM gets only exact match (top-1), pgvector gets top-K Warning signs: Comparing apples to oranges in accuracy calculations
Verified patterns from official sources:
# Source: https://docs.python.org/3/library/time.html#time.perf_counter
import time
def search_with_timing(conn, query: str, k: int = 5) -> tuple[dict, float]:
"""Execute search and return results with latency in ms."""
start = time.perf_counter()
results = search_all_models(conn, query, k)
end = time.perf_counter()
latency_ms = (end - start) * 1000
return results, latency_ms
# Source: Standard IR metrics (Pinecone, Weaviate docs)
def calculate_top_k_accuracy(results: list[SearchResult], k: int) -> float:
"""
Calculate accuracy where correct answer appears in top-K.
Args:
results: List of SearchResult with top_k_debit_accounts and ground_truth
k: Number of top results to check
Returns:
Accuracy as float 0-1
"""
if not results:
return 0.0
hits = 0
for r in results:
ground_truth = str(r.ground_truth_debit_account).strip()
top_k = [str(x).strip() for x in r.top_k_debit_accounts[:k]]
if ground_truth in top_k:
hits += 1
return hits / len(results)
# Source: Python statistics module
import statistics
def calculate_latency_stats(latencies_ms: list[float]) -> dict:
"""Calculate latency statistics."""
if not latencies_ms:
return {'mean': 0, 'median': 0, 'p95': 0, 'min': 0, 'max': 0}
sorted_latencies = sorted(latencies_ms)
n = len(sorted_latencies)
return {
'mean': statistics.mean(latencies_ms),
'median': statistics.median(latencies_ms),
'p95': sorted_latencies[int(n * 0.95)] if n >= 20 else sorted_latencies[-1],
'min': min(latencies_ms),
'max': max(latencies_ms),
'stdev': statistics.stdev(latencies_ms) if n > 1 else 0,
}
# Source: Gemini API pricing docs, Jina pricing page
# Prices as of 2026-02
PRICING = {
'gemini_flash': {
'input_per_1m': 0.30, # USD per 1M input tokens
'output_per_1m': 2.50, # USD per 1M output tokens
},
'google_embedding': {
'input_per_1m': 0.15, # USD per 1M tokens
},
'jina_embedding': {
'input_per_1m': 0.02, # USD per 1M tokens (approximate)
},
'minilm': {
'input_per_1m': 0.0, # Local model, no API cost
},
}
def calculate_cost(model: str, input_tokens: int, output_tokens: int = 0) -> float:
"""Calculate USD cost for API call."""
pricing = PRICING.get(model, {})
input_cost = (input_tokens / 1_000_000) * pricing.get('input_per_1m', 0)
output_cost = (output_tokens / 1_000_000) * pricing.get('output_per_1m', 0)
return input_cost + output_cost
| Old Approach | Current Approach | When Changed | Impact |
|---|---|---|---|
time.time() |
time.perf_counter() |
Python 3.3+ | Monotonic, higher precision timing |
| Custom JSON export | orjson with dataclass support | 2024+ | 10x faster, native dataclass handling |
| Polling for progress | Server-Sent Events | 2020s | Real-time updates without WebSocket complexity |
| Flask sync only | Flask 2.0+ async support | 2021 | Async routes available but not needed here |
Deprecated/outdated:
time.clock(): Removed in Python 3.8, replaced by perf_counter()Token counting for Gemini/Jina API calls
usage_metadata with token counts; need to verify Jina response structureTest set size vs benchmark duration
Cost center accuracy handling
Confidence breakdown:
Research date: 2026-02-20 Valid until: 2026-03-20 (30 days - stable domain, no fast-moving dependencies)