Create metrics calculator and batch benchmarking infrastructure for evaluating all 4 search approaches.

Purpose: Enable quantitative comparison of embedding search vs LLM matching with accuracy, latency, and cost metrics. Output: Python modules for metrics calculation, benchmark execution, and cost tracking.

<execution_context> @./.claude/get-shit-done/workflows/execute-plan.md @./.claude/get-shit-done/templates/summary.md </execution_context>

@.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md @.planning/phases/04-evaluation-dashboard/04-RESEARCH.md

@src/search/pgvector_search.py @src/search/llm_matching.py @src/evaluation/train_test_split.py @src/evaluation/query_variations.py

Task 1: Create metrics module with dataclass structures and accuracy calculations src/evaluation/metrics.py, src/evaluation/cost_tracker.py Create src/evaluation/metrics.py with:
  1. SearchResult dataclass:

    • model: str (google, jina, minilm, llm)
    • query_id: int
    • latency_ms: float
    • predicted_debit_account: Optional[str]
    • predicted_cost_center: Optional[str]
    • ground_truth_debit_account: str
    • ground_truth_cost_center: Optional[str]
    • top_k_debit_accounts: list[str] (for embedding results)
    • similarity_scores: list[float]
    • tokens_input: int = 0
    • tokens_output: int = 0
    • api_calls: int = 0
  2. BenchmarkResults dataclass:

    • model: str
    • total_queries: int
    • exact_match_gl: float (0-1)
    • exact_match_cc: float (0-1)
    • top_3_accuracy: float
    • top_5_accuracy: float
    • top_10_accuracy: float
    • latency_mean_ms: float
    • latency_p50_ms: float
    • latency_p95_ms: float
    • total_tokens: int
    • total_cost_usd: float
  3. Functions:

    • calculate_exact_match_accuracy(results: list[SearchResult], field: str) -> float Normalize both predicted and ground truth to str before comparing (handles "6801" vs "6801.0"). NULL == NULL counts as correct match for cost_center.

    • calculate_top_k_accuracy(results: list[SearchResult], k: int) -> float Check if ground_truth_debit_account appears in top_k_debit_accounts[:k]. Normalize all values to str for comparison.

    • calculate_latency_stats(latencies: list[float]) -> dict Return mean, median, p95, min, max, stdev using statistics module. For p95: sorted_latencies[int(n * 0.95)] if n >= 20 else sorted_latencies[-1].

Create src/evaluation/cost_tracker.py with:

  1. PRICING dict:

    • gemini_flash: input_per_1m=0.30, output_per_1m=2.50
    • google_embedding: input_per_1m=0.15
    • jina_embedding: input_per_1m=0.02
    • minilm: input_per_1m=0.0 (local, no API cost)
  2. calculate_cost(model: str, input_tokens: int, output_tokens: int = 0) -> float Apply pricing based on model name. python -c "from src.evaluation.metrics import SearchResult, BenchmarkResults, calculate_exact_match_accuracy, calculate_top_k_accuracy; print('Metrics module imports OK')" python -c "from src.evaluation.cost_tracker import PRICING, calculate_cost; print('Cost tracker imports OK')" SearchResult and BenchmarkResults dataclasses importable. calculate_exact_match_accuracy returns float. calculate_top_k_accuracy returns float. calculate_cost returns USD estimate.

Task 2: Create batch benchmark runner with test set iteration and aggregate statistics src/evaluation/benchmark.py, src/evaluation/__init__.py Create src/evaluation/benchmark.py with:
  1. measure_latency context manager:

    • Use time.perf_counter() for high-precision timing
    • Yield dict with latency_ms field that gets populated on exit
    • Convert to milliseconds: (end - start) * 1000
  2. fetch_test_queries(conn) -> list[dict]:

    • Query test_query_variation table joined with line_item
    • Return list with: variation_id, original_item_id, variation_text, variation_type, ground_truth_debit_account, ground_truth_cost_center
    • Order by original_item_id for consistent iteration
  3. run_single_query(conn, query_text: str, k: int = 5) -> dict[str, SearchResult]:

    • Execute search_all_models() with timing
    • Execute llm_context_match() with timing
    • Return dict with 'google', 'jina', 'minilm', 'llm' keys containing SearchResult objects
    • For embedding models: top_k_debit_accounts from results, predicted = top-1
    • For LLM: top_k empty (single prediction)
    • WARMUP: Skip timing for first query of each model (cold start)
  4. run_benchmark(conn, k: int = 5, limit: Optional[int] = None) -> dict[str, BenchmarkResults]:

    • Iterate test queries (optionally limit for quick mode)
    • Run 2 warmup queries per model before timing starts
    • Aggregate SearchResults per model
    • Calculate BenchmarkResults for each model
    • Add configurable delay between API calls (0.1s default) to avoid rate limits
    • Print progress with tqdm
    • Return dict with model names as keys
  5. aggregate_results(results: list[SearchResult]) -> BenchmarkResults:

    • Use calculate_exact_match_accuracy for GL and CC
    • Use calculate_top_k_accuracy for k=3,5,10
    • Use calculate_latency_stats
    • Sum tokens and calculate total cost

Update src/evaluation/init.py to export:

  • SearchResult, BenchmarkResults from metrics
  • run_benchmark, fetch_test_queries from benchmark
  • calculate_cost from cost_tracker python -c "from src.evaluation import SearchResult, BenchmarkResults, run_benchmark; print('Benchmark module imports OK')" python -c " from src.db import get_connection from src.evaluation.benchmark import fetch_test_queries conn = get_connection() queries = fetch_test_queries(conn) print(f'Found {len(queries)} test queries') conn.close() " run_benchmark returns dict with BenchmarkResults per model. fetch_test_queries returns test set variations. Latency measured in ms. Progress displayed with tqdm.
Task 3: Verify benchmark execution with small sample src/evaluation/run_benchmark.py Create src/evaluation/run_benchmark.py as a CLI script:
  1. Parse command line args:

    • --limit: Number of queries (default 10 for quick test)
    • --k: Top-K value (default 5)
    • --full: Run full benchmark (no limit)
  2. Connect to database, run benchmark, print results table:

    • Model | GL Acc | CC Acc | Top-3 | Top-5 | Top-10 | Latency (ms) | Cost ($)
  3. Handle rate limit errors gracefully (print warning, continue)

  4. Print summary statistics at end

Run: python -m src.evaluation.run_benchmark --limit 10

This validates the entire pipeline works end-to-end on a small sample before full benchmark. python -m src.evaluation.run_benchmark --limit 5 CLI runs 5-query benchmark without errors. Results table shows all 4 models with accuracy, latency, cost metrics. No rate limit failures.

After all tasks: 1. `python -c "from src.evaluation import SearchResult, BenchmarkResults, run_benchmark"` succeeds 2. `python -m src.evaluation.run_benchmark --limit 5` completes without errors 3. Output shows all 4 models with GL accuracy, CC accuracy, top-K metrics, latency, cost

<success_criteria>

After completion, create `.planning/phases/04-evaluation-dashboard/04-01-SUMMARY.md`