Phase 4: Evaluation & Dashboard - Research

Researched: 2026-02-20 Domain: Metrics calculation, batch benchmarking, Flask dashboard, cost tracking Confidence: HIGH

Summary

This phase focuses on quantitative evaluation of the 4 search approaches (3 embedding models + LLM baseline) with metrics calculation, batch benchmarking, and an interactive dashboard. The existing codebase already has Flask search UI, pgvector search, LLM matching, test set infrastructure, and query variations -- this phase adds measurement, aggregation, and comparison layers.

The technical stack is straightforward: Python's time.perf_counter() for high-precision latency measurement, simple accuracy calculations (exact match and top-K), dataclasses for structured results, and extending the existing Flask app with new routes. No complex libraries needed -- sklearn's accuracy_score could be used but manual calculation is simpler for this use case. Cost tracking requires storing token counts per API call and applying known pricing rates.

Primary recommendation: Build a modular metrics calculator with dataclass-based result structures, batch benchmark runner with progress streaming, and extend the existing Flask app with comparison and benchmark routes.

<user_constraints>

User Constraints (from CONTEXT.md)

Locked Decisions

None -- all implementation decisions delegated to Claude.

Claude's Discretion

User delegated all implementation decisions. Claude has full flexibility on:

Comparison view:

Metrics display:

Batch benchmark UX:

Dashboard interaction:

Deferred Ideas (OUT OF SCOPE)

None -- discussion stayed within phase scope. </user_constraints>

<phase_requirements>

Phase Requirements

ID Description Research Support
SRCH-04 Side-by-side comparison view across all 3 embedding models Existing Flask template uses 4-column grid; extend to include metric annotations and highlighting
EVAL-03 Exact match accuracy for GL account assignment Simple calculation: (correct predictions / total predictions); compare top-1 result debit_account to ground truth
EVAL-04 Exact match accuracy for cost center assignment Same as EVAL-03 but comparing cost_center field
EVAL-05 Top-K accuracy (correct answer in top 3/5/10) Check if ground truth appears anywhere in top-K results; standard IR metric
EVAL-06 Latency measurement per query (ms) Use time.perf_counter() for high-precision timing around search calls
EVAL-07 Batch benchmarking over full test set Iterate test_query_variation table, run all 4 approaches, aggregate statistics
EVAL-09 Cost tracking per query (tokens, API calls, estimated $) Track token counts from API responses; apply pricing: Gemini Flash $0.30/$2.50 per 1M, Jina ~$0.02 per 1M, Google embeddings $0.15 per 1M
REPT-01 Interactive HTML dashboard for search exploration Extend existing Flask app with benchmark routes; use Bootstrap grid for comparison view
</phase_requirements>

Standard Stack

Core

Library Version Purpose Why Standard
Flask 3.x (existing) Web framework Already in use; simple extension with new routes
dataclasses stdlib Structured results Zero-dependency, clean typing, JSON-serializable with asdict()
time.perf_counter stdlib Latency measurement Highest precision timer; monotonic (no clock drift)
statistics stdlib Aggregate calculations mean, stdev, median built-in

Supporting

Library Version Purpose When to Use
orjson 3.x Fast JSON export Benchmark result export; native dataclass support
tqdm existing Progress bars CLI progress during batch runs

Alternatives Considered

Instead of Could Use Tradeoff
Manual accuracy sklearn.metrics.accuracy_score Adds dependency for simple division; not worth it
Server-Sent Events Polling SSE more complex but real-time; polling simpler for batch progress
Complex charting HTML tables Charts require JS libraries; tables sufficient for spike

Installation:

pip install orjson  # Only if JSON export needed
# All other dependencies already installed

Architecture Patterns

src/
├── evaluation/
│   ├── __init__.py
│   ├── train_test_split.py     # existing
│   ├── query_variations.py     # existing
│   ├── metrics.py              # NEW: accuracy, top-k calculation
│   ├── benchmark.py            # NEW: batch runner with timing
│   └── cost_tracker.py         # NEW: token/cost aggregation
├── search/
│   ├── pgvector_search.py      # existing (add timing)
│   └── llm_matching.py         # existing (add token tracking)
├── templates/
│   ├── search.html             # existing
│   ├── benchmark.html          # NEW: batch benchmark UI
│   └── results.html            # NEW: aggregate results view
└── app.py                      # extend with benchmark routes

Pattern 1: Dataclass Result Structures

What: Use dataclasses for all metric/result structures When to use: Any structured data that needs to be aggregated or serialized Example:

# Source: Python dataclasses stdlib + orjson compatibility
from dataclasses import dataclass, asdict, field
from typing import Optional
import time

@dataclass
class SearchResult:
    """Single search result with timing."""
    model: str  # 'google', 'jina', 'minilm', 'llm'
    query_id: int
    latency_ms: float
    predicted_debit_account: Optional[str]
    predicted_cost_center: Optional[str]
    ground_truth_debit_account: str
    ground_truth_cost_center: Optional[str]
    top_k_debit_accounts: list[str] = field(default_factory=list)
    similarity_scores: list[float] = field(default_factory=list)
    # Cost tracking
    tokens_input: int = 0
    tokens_output: int = 0
    api_calls: int = 0

@dataclass
class BenchmarkResults:
    """Aggregated benchmark statistics."""
    model: str
    total_queries: int
    exact_match_gl: float  # 0-1
    exact_match_cc: float  # 0-1
    top_3_accuracy: float
    top_5_accuracy: float
    top_10_accuracy: float
    latency_mean_ms: float
    latency_p50_ms: float
    latency_p95_ms: float
    total_tokens: int
    total_cost_usd: float

Pattern 2: Timing Context Manager

What: Consistent high-precision timing with context manager When to use: Wrapping any search operation Example:

# Source: Python time module, PEP 418
from contextlib import contextmanager
import time

@contextmanager
def measure_latency():
    """Context manager for high-precision timing."""
    start = time.perf_counter()
    result = {'latency_ms': 0.0}
    try:
        yield result
    finally:
        end = time.perf_counter()
        result['latency_ms'] = (end - start) * 1000  # Convert to ms

Pattern 3: Progress Streaming for Batch Operations

What: Stream progress updates during long-running benchmarks When to use: Batch benchmark runs that take >10 seconds Example:

# Source: Flask streaming documentation
from flask import Response, stream_with_context
import json

def benchmark_stream():
    """Generator for streaming benchmark progress."""
    total = get_test_count()
    for i, result in enumerate(run_benchmark_iterator()):
        progress = {
            'current': i + 1,
            'total': total,
            'percent': round((i + 1) / total * 100, 1),
            'last_result': asdict(result)
        }
        yield f"data: {json.dumps(progress)}\n\n"

@app.route('/benchmark/stream')
def benchmark_sse():
    """Server-Sent Events endpoint for benchmark progress."""
    return Response(
        stream_with_context(benchmark_stream()),
        mimetype='text/event-stream'
    )

Anti-Patterns to Avoid

Don't Hand-Roll

Problem Don't Build Use Instead Why
Percentile calculation Custom implementation statistics.quantiles(data, n=100)[94] for p95 Edge cases with small samples
JSON serialization Manual dict conversion dataclasses.asdict() or orjson Handles nested structures correctly
Progress streaming WebSocket implementation Flask SSE with generators Much simpler for unidirectional updates
Query timing time.time() time.perf_counter() time.time() can jump backward on clock sync

Key insight: The evaluation logic itself is simple arithmetic. The complexity is in data flow (iterating test set, aggregating results, streaming progress) not in the metrics math.

Common Pitfalls

Pitfall 1: Cold Start Affecting Latency

What goes wrong: First query takes much longer due to model loading, connection warming Why it happens: Lazy initialization of embedding models, database connections How to avoid: Run 1-2 warmup queries before timing; exclude warmup from statistics Warning signs: First query 5-10x slower than subsequent queries

Pitfall 2: Rate Limiting During Batch Benchmark

What goes wrong: API rate limits cause failures mid-benchmark Why it happens: Rapid sequential API calls to Jina, Google, Gemini How to avoid: Add configurable delays between API calls; handle 429 with exponential backoff Warning signs: Sporadic HTTP 429 errors, incomplete benchmark runs

Pitfall 3: Memory Growth During Large Benchmarks

What goes wrong: OOM when storing all SearchResult objects for 1000+ queries Why it happens: Accumulating result objects in list How to avoid: Aggregate incrementally; only keep running statistics Warning signs: Memory usage growing linearly with test set size

Pitfall 4: Inconsistent Ground Truth Comparison

What goes wrong: "6801" != "6801.0" fails exact match Why it happens: Mixed string/numeric types from database How to avoid: Normalize both predicted and ground truth to same format before comparison Warning signs: Low accuracy despite visually correct predictions

Pitfall 5: LLM Returns Different Format Than pgvector

What goes wrong: LLM returns single prediction vs pgvector returns top-K list Why it happens: Different result structures by design How to avoid: Normalize result handling; LLM gets only exact match (top-1), pgvector gets top-K Warning signs: Comparing apples to oranges in accuracy calculations

Code Examples

Verified patterns from official sources:

High-Precision Latency Measurement

# Source: https://docs.python.org/3/library/time.html#time.perf_counter
import time

def search_with_timing(conn, query: str, k: int = 5) -> tuple[dict, float]:
    """Execute search and return results with latency in ms."""
    start = time.perf_counter()
    results = search_all_models(conn, query, k)
    end = time.perf_counter()
    latency_ms = (end - start) * 1000
    return results, latency_ms

Top-K Accuracy Calculation

# Source: Standard IR metrics (Pinecone, Weaviate docs)
def calculate_top_k_accuracy(results: list[SearchResult], k: int) -> float:
    """
    Calculate accuracy where correct answer appears in top-K.

    Args:
        results: List of SearchResult with top_k_debit_accounts and ground_truth
        k: Number of top results to check

    Returns:
        Accuracy as float 0-1
    """
    if not results:
        return 0.0

    hits = 0
    for r in results:
        ground_truth = str(r.ground_truth_debit_account).strip()
        top_k = [str(x).strip() for x in r.top_k_debit_accounts[:k]]
        if ground_truth in top_k:
            hits += 1

    return hits / len(results)

Aggregate Statistics Calculation

# Source: Python statistics module
import statistics

def calculate_latency_stats(latencies_ms: list[float]) -> dict:
    """Calculate latency statistics."""
    if not latencies_ms:
        return {'mean': 0, 'median': 0, 'p95': 0, 'min': 0, 'max': 0}

    sorted_latencies = sorted(latencies_ms)
    n = len(sorted_latencies)

    return {
        'mean': statistics.mean(latencies_ms),
        'median': statistics.median(latencies_ms),
        'p95': sorted_latencies[int(n * 0.95)] if n >= 20 else sorted_latencies[-1],
        'min': min(latencies_ms),
        'max': max(latencies_ms),
        'stdev': statistics.stdev(latencies_ms) if n > 1 else 0,
    }

Cost Tracking by Model

# Source: Gemini API pricing docs, Jina pricing page
# Prices as of 2026-02
PRICING = {
    'gemini_flash': {
        'input_per_1m': 0.30,   # USD per 1M input tokens
        'output_per_1m': 2.50,  # USD per 1M output tokens
    },
    'google_embedding': {
        'input_per_1m': 0.15,   # USD per 1M tokens
    },
    'jina_embedding': {
        'input_per_1m': 0.02,   # USD per 1M tokens (approximate)
    },
    'minilm': {
        'input_per_1m': 0.0,    # Local model, no API cost
    },
}

def calculate_cost(model: str, input_tokens: int, output_tokens: int = 0) -> float:
    """Calculate USD cost for API call."""
    pricing = PRICING.get(model, {})
    input_cost = (input_tokens / 1_000_000) * pricing.get('input_per_1m', 0)
    output_cost = (output_tokens / 1_000_000) * pricing.get('output_per_1m', 0)
    return input_cost + output_cost

State of the Art

Old Approach Current Approach When Changed Impact
time.time() time.perf_counter() Python 3.3+ Monotonic, higher precision timing
Custom JSON export orjson with dataclass support 2024+ 10x faster, native dataclass handling
Polling for progress Server-Sent Events 2020s Real-time updates without WebSocket complexity
Flask sync only Flask 2.0+ async support 2021 Async routes available but not needed here

Deprecated/outdated:

Open Questions

  1. Token counting for Gemini/Jina API calls

  2. Test set size vs benchmark duration

  3. Cost center accuracy handling

Sources

Primary (HIGH confidence)

Secondary (MEDIUM confidence)

Tertiary (LOW confidence)

Metadata

Confidence breakdown:

Research date: 2026-02-20 Valid until: 2026-03-20 (30 days - stable domain, no fast-moving dependencies)