Create metrics calculator and batch benchmarking infrastructure for evaluating all 4 search approaches.
Purpose: Enable quantitative comparison of embedding search vs LLM matching with accuracy, latency, and cost metrics.
Output: Python modules for metrics calculation, benchmark execution, and cost tracking.
<execution_context>
@./.claude/get-shit-done/workflows/execute-plan.md
@./.claude/get-shit-done/templates/summary.md
</execution_context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/04-evaluation-dashboard/04-RESEARCH.md
@src/search/pgvector_search.py
@src/search/llm_matching.py
@src/evaluation/train_test_split.py
@src/evaluation/query_variations.py
Task 1: Create metrics module with dataclass structures and accuracy calculations
src/evaluation/metrics.py, src/evaluation/cost_tracker.py
Create src/evaluation/metrics.py with:
-
SearchResult dataclass:
- model: str (google, jina, minilm, llm)
- query_id: int
- latency_ms: float
- predicted_debit_account: Optional[str]
- predicted_cost_center: Optional[str]
- ground_truth_debit_account: str
- ground_truth_cost_center: Optional[str]
- top_k_debit_accounts: list[str] (for embedding results)
- similarity_scores: list[float]
- tokens_input: int = 0
- tokens_output: int = 0
- api_calls: int = 0
-
BenchmarkResults dataclass:
- model: str
- total_queries: int
- exact_match_gl: float (0-1)
- exact_match_cc: float (0-1)
- top_3_accuracy: float
- top_5_accuracy: float
- top_10_accuracy: float
- latency_mean_ms: float
- latency_p50_ms: float
- latency_p95_ms: float
- total_tokens: int
- total_cost_usd: float
-
Functions:
-
calculate_exact_match_accuracy(results: list[SearchResult], field: str) -> float
Normalize both predicted and ground truth to str before comparing (handles "6801" vs "6801.0").
NULL == NULL counts as correct match for cost_center.
-
calculate_top_k_accuracy(results: list[SearchResult], k: int) -> float
Check if ground_truth_debit_account appears in top_k_debit_accounts[:k].
Normalize all values to str for comparison.
-
calculate_latency_stats(latencies: list[float]) -> dict
Return mean, median, p95, min, max, stdev using statistics module.
For p95: sorted_latencies[int(n * 0.95)] if n >= 20 else sorted_latencies[-1].
Create src/evaluation/cost_tracker.py with:
-
PRICING dict:
- gemini_flash: input_per_1m=0.30, output_per_1m=2.50
- google_embedding: input_per_1m=0.15
- jina_embedding: input_per_1m=0.02
- minilm: input_per_1m=0.0 (local, no API cost)
-
calculate_cost(model: str, input_tokens: int, output_tokens: int = 0) -> float
Apply pricing based on model name.
python -c "from src.evaluation.metrics import SearchResult, BenchmarkResults, calculate_exact_match_accuracy, calculate_top_k_accuracy; print('Metrics module imports OK')"
python -c "from src.evaluation.cost_tracker import PRICING, calculate_cost; print('Cost tracker imports OK')"
SearchResult and BenchmarkResults dataclasses importable. calculate_exact_match_accuracy returns float. calculate_top_k_accuracy returns float. calculate_cost returns USD estimate.
Task 2: Create batch benchmark runner with test set iteration and aggregate statistics
src/evaluation/benchmark.py, src/evaluation/__init__.py
Create src/evaluation/benchmark.py with:
-
measure_latency context manager:
- Use time.perf_counter() for high-precision timing
- Yield dict with latency_ms field that gets populated on exit
- Convert to milliseconds: (end - start) * 1000
-
fetch_test_queries(conn) -> list[dict]:
- Query test_query_variation table joined with line_item
- Return list with: variation_id, original_item_id, variation_text, variation_type, ground_truth_debit_account, ground_truth_cost_center
- Order by original_item_id for consistent iteration
-
run_single_query(conn, query_text: str, k: int = 5) -> dict[str, SearchResult]:
- Execute search_all_models() with timing
- Execute llm_context_match() with timing
- Return dict with 'google', 'jina', 'minilm', 'llm' keys containing SearchResult objects
- For embedding models: top_k_debit_accounts from results, predicted = top-1
- For LLM: top_k empty (single prediction)
- WARMUP: Skip timing for first query of each model (cold start)
-
run_benchmark(conn, k: int = 5, limit: Optional[int] = None) -> dict[str, BenchmarkResults]:
- Iterate test queries (optionally limit for quick mode)
- Run 2 warmup queries per model before timing starts
- Aggregate SearchResults per model
- Calculate BenchmarkResults for each model
- Add configurable delay between API calls (0.1s default) to avoid rate limits
- Print progress with tqdm
- Return dict with model names as keys
-
aggregate_results(results: list[SearchResult]) -> BenchmarkResults:
- Use calculate_exact_match_accuracy for GL and CC
- Use calculate_top_k_accuracy for k=3,5,10
- Use calculate_latency_stats
- Sum tokens and calculate total cost
Update src/evaluation/init.py to export:
- SearchResult, BenchmarkResults from metrics
- run_benchmark, fetch_test_queries from benchmark
- calculate_cost from cost_tracker
python -c "from src.evaluation import SearchResult, BenchmarkResults, run_benchmark; print('Benchmark module imports OK')"
python -c "
from src.db import get_connection
from src.evaluation.benchmark import fetch_test_queries
conn = get_connection()
queries = fetch_test_queries(conn)
print(f'Found {len(queries)} test queries')
conn.close()
"
run_benchmark returns dict with BenchmarkResults per model. fetch_test_queries returns test set variations. Latency measured in ms. Progress displayed with tqdm.
Task 3: Verify benchmark execution with small sample
src/evaluation/run_benchmark.py
Create src/evaluation/run_benchmark.py as a CLI script:
-
Parse command line args:
- --limit: Number of queries (default 10 for quick test)
- --k: Top-K value (default 5)
- --full: Run full benchmark (no limit)
-
Connect to database, run benchmark, print results table:
- Model | GL Acc | CC Acc | Top-3 | Top-5 | Top-10 | Latency (ms) | Cost ($)
-
Handle rate limit errors gracefully (print warning, continue)
-
Print summary statistics at end
Run: python -m src.evaluation.run_benchmark --limit 10
This validates the entire pipeline works end-to-end on a small sample before full benchmark.
python -m src.evaluation.run_benchmark --limit 5
CLI runs 5-query benchmark without errors. Results table shows all 4 models with accuracy, latency, cost metrics. No rate limit failures.
After all tasks:
1. `python -c "from src.evaluation import SearchResult, BenchmarkResults, run_benchmark"` succeeds
2. `python -m src.evaluation.run_benchmark --limit 5` completes without errors
3. Output shows all 4 models with GL accuracy, CC accuracy, top-K metrics, latency, cost
<success_criteria>
- SearchResult and BenchmarkResults dataclasses defined with all fields
- Exact match accuracy calculates correctly (normalized string comparison)
- Top-K accuracy calculates correctly (ground truth in top K)
- Latency measured with perf_counter in milliseconds
- Batch benchmark iterates test set with progress display
- Cost tracked per model with estimated USD
- Small sample benchmark (5-10 queries) runs successfully
</success_criteria>