04-01-PLAN

Create metrics calculator and batch benchmarking infrastructure for evaluating all 4 search approaches.

Purpose: Enable quantitative comparison of embedding search vs LLM matching with accuracy, latency, and cost metrics. Output: Python modules for metrics calculation, benchmark execution, and cost tracking.

<execution_context> @./.claude/get-shit-done/workflows/execute-plan.md @./.claude/get-shit-done/templates/summary.md </execution_context>

@.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md @.planning/phases/04-evaluation-dashboard/04-RESEARCH.md

@src/search/pgvector_search.py @src/search/llm_matching.py @src/evaluation/train_test_split.py @src/evaluation/query_variations.py

Task 1: Create metrics module with dataclass structures and accuracy calculations src/evaluation/metrics.py, src/evaluation/cost_tracker.py Create src/evaluation/metrics.py with:

SearchResult dataclass:
- model: str (google, jina, minilm, llm)
- query_id: int
- latency_ms: float
- predicted_debit_account: Optional[str]
- predicted_cost_center: Optional[str]
- ground_truth_debit_account: str
- ground_truth_cost_center: Optional[str]
- top_k_debit_accounts: list[str] (for embedding results)
- similarity_scores: list[float]
- tokens_input: int = 0
- tokens_output: int = 0
- api_calls: int = 0
BenchmarkResults dataclass:
- model: str
- total_queries: int
- exact_match_gl: float (0-1)
- exact_match_cc: float (0-1)
- top_3_accuracy: float
- top_5_accuracy: float
- top_10_accuracy: float
- latency_mean_ms: float
- latency_p50_ms: float
- latency_p95_ms: float
- total_tokens: int
- total_cost_usd: float
Functions:
- calculate_exact_match_accuracy(results: list[SearchResult], field: str) -> float Normalize both predicted and ground truth to str before comparing (handles "6801" vs "6801.0"). NULL == NULL counts as correct match for cost_center.
- calculate_top_k_accuracy(results: list[SearchResult], k: int) -> float Check if ground_truth_debit_account appears in top_k_debit_accounts[:k]. Normalize all values to str for comparison.
- calculate_latency_stats(latencies: list[float]) -> dict Return mean, median, p95, min, max, stdev using statistics module. For p95: sorted_latencies[int(n * 0.95)] if n >= 20 else sorted_latencies[-1].

Create src/evaluation/cost_tracker.py with:

PRICING dict:
- gemini_flash: input_per_1m=0.30, output_per_1m=2.50
- google_embedding: input_per_1m=0.15
- jina_embedding: input_per_1m=0.02
- minilm: input_per_1m=0.0 (local, no API cost)
calculate_cost(model: str, input_tokens: int, output_tokens: int = 0) -> float Apply pricing based on model name. python -c "from src.evaluation.metrics import SearchResult, BenchmarkResults, calculate_exact_match_accuracy, calculate_top_k_accuracy; print('Metrics module imports OK')" python -c "from src.evaluation.cost_tracker import PRICING, calculate_cost; print('Cost tracker imports OK')" SearchResult and BenchmarkResults dataclasses importable. calculate_exact_match_accuracy returns float. calculate_top_k_accuracy returns float. calculate_cost returns USD estimate.

Task 2: Create batch benchmark runner with test set iteration and aggregate statistics src/evaluation/benchmark.py, src/evaluation/__init__.py Create src/evaluation/benchmark.py with:

measure_latency context manager:
- Use time.perf_counter() for high-precision timing
- Yield dict with latency_ms field that gets populated on exit
- Convert to milliseconds: (end - start) * 1000
fetch_test_queries(conn) -> list[dict]:
- Query test_query_variation table joined with line_item
- Return list with: variation_id, original_item_id, variation_text, variation_type, ground_truth_debit_account, ground_truth_cost_center
- Order by original_item_id for consistent iteration
run_single_query(conn, query_text: str, k: int = 5) -> dict[str, SearchResult]:
- Execute search_all_models() with timing
- Execute llm_context_match() with timing
- Return dict with 'google', 'jina', 'minilm', 'llm' keys containing SearchResult objects
- For embedding models: top_k_debit_accounts from results, predicted = top-1
- For LLM: top_k empty (single prediction)
- WARMUP: Skip timing for first query of each model (cold start)
run_benchmark(conn, k: int = 5, limit: Optional[int] = None) -> dict[str, BenchmarkResults]:
- Iterate test queries (optionally limit for quick mode)
- Run 2 warmup queries per model before timing starts
- Aggregate SearchResults per model
- Calculate BenchmarkResults for each model
- Add configurable delay between API calls (0.1s default) to avoid rate limits
- Print progress with tqdm
- Return dict with model names as keys
aggregate_results(results: list[SearchResult]) -> BenchmarkResults:
- Use calculate_exact_match_accuracy for GL and CC
- Use calculate_top_k_accuracy for k=3,5,10
- Use calculate_latency_stats
- Sum tokens and calculate total cost

Update src/evaluation/init.py to export:

SearchResult, BenchmarkResults from metrics
run_benchmark, fetch_test_queries from benchmark
calculate_cost from cost_tracker python -c "from src.evaluation import SearchResult, BenchmarkResults, run_benchmark; print('Benchmark module imports OK')" python -c " from src.db import get_connection from src.evaluation.benchmark import fetch_test_queries conn = get_connection() queries = fetch_test_queries(conn) print(f'Found {len(queries)} test queries') conn.close() " run_benchmark returns dict with BenchmarkResults per model. fetch_test_queries returns test set variations. Latency measured in ms. Progress displayed with tqdm.

Task 3: Verify benchmark execution with small sample src/evaluation/run_benchmark.py Create src/evaluation/run_benchmark.py as a CLI script:

Parse command line args:
- --limit: Number of queries (default 10 for quick test)
- --k: Top-K value (default 5)
- --full: Run full benchmark (no limit)
Connect to database, run benchmark, print results table:
- Model | GL Acc | CC Acc | Top-3 | Top-5 | Top-10 | Latency (ms) | Cost ($)
Handle rate limit errors gracefully (print warning, continue)
Print summary statistics at end

Run: python -m src.evaluation.run_benchmark --limit 10

This validates the entire pipeline works end-to-end on a small sample before full benchmark. python -m src.evaluation.run_benchmark --limit 5 CLI runs 5-query benchmark without errors. Results table shows all 4 models with accuracy, latency, cost metrics. No rate limit failures.

After all tasks: 1. `python -c "from src.evaluation import SearchResult, BenchmarkResults, run_benchmark"` succeeds 2. `python -m src.evaluation.run_benchmark --limit 5` completes without errors 3. Output shows all 4 models with GL accuracy, CC accuracy, top-K metrics, latency, cost

<success_criteria>

SearchResult and BenchmarkResults dataclasses defined with all fields
Exact match accuracy calculates correctly (normalized string comparison)
Top-K accuracy calculates correctly (ground truth in top K)
Latency measured with perf_counter in milliseconds
Batch benchmark iterates test set with progress display
Cost tracked per model with estimated USD
Small sample benchmark (5-10 queries) runs successfully </success_criteria>

After completion, create `.planning/phases/04-evaluation-dashboard/04-01-SUMMARY.md`