Phase 4: Evaluation & Dashboard - Research

Researched: 2026-02-20 Domain: Metrics calculation, batch benchmarking, Flask dashboard, cost tracking Confidence: HIGH

Summary

This phase focuses on quantitative evaluation of the 4 search approaches (3 embedding models + LLM baseline) with metrics calculation, batch benchmarking, and an interactive dashboard. The existing codebase already has Flask search UI, pgvector search, LLM matching, test set infrastructure, and query variations -- this phase adds measurement, aggregation, and comparison layers.

The technical stack is straightforward: Python's time.perf_counter() for high-precision latency measurement, simple accuracy calculations (exact match and top-K), dataclasses for structured results, and extending the existing Flask app with new routes. No complex libraries needed -- sklearn's accuracy_score could be used but manual calculation is simpler for this use case. Cost tracking requires storing token counts per API call and applying known pricing rates.

Primary recommendation: Build a modular metrics calculator with dataclass-based result structures, batch benchmark runner with progress streaming, and extend the existing Flask app with comparison and benchmark routes.

<user_constraints>

User Constraints (from CONTEXT.md)

Locked Decisions

None -- all implementation decisions delegated to Claude.

Claude's Discretion

User delegated all implementation decisions. Claude has full flexibility on:

Comparison view:

Side-by-side layout design for 4 approaches
How to highlight differences between results
Result card design and information density

Metrics display:

Tables vs charts vs combination
Grouping and emphasis of accuracy, latency, cost
Aggregate statistics presentation

Batch benchmark UX:

Progress indication during benchmark runs
Results presentation format
Any export functionality

Dashboard interaction:

Query input flow and form design
Filtering and navigation patterns
View organization (single page vs tabs vs routes)

Deferred Ideas (OUT OF SCOPE)

None -- discussion stayed within phase scope. </user_constraints>

<phase_requirements>

Phase Requirements

ID	Description	Research Support
SRCH-04	Side-by-side comparison view across all 3 embedding models	Existing Flask template uses 4-column grid; extend to include metric annotations and highlighting
EVAL-03	Exact match accuracy for GL account assignment	Simple calculation: (correct predictions / total predictions); compare top-1 result debit_account to ground truth
EVAL-04	Exact match accuracy for cost center assignment	Same as EVAL-03 but comparing cost_center field
EVAL-05	Top-K accuracy (correct answer in top 3/5/10)	Check if ground truth appears anywhere in top-K results; standard IR metric
EVAL-06	Latency measurement per query (ms)	Use `time.perf_counter()` for high-precision timing around search calls
EVAL-07	Batch benchmarking over full test set	Iterate test_query_variation table, run all 4 approaches, aggregate statistics
EVAL-09	Cost tracking per query (tokens, API calls, estimated $)	Track token counts from API responses; apply pricing: Gemini Flash $0.30/$2.50 per 1M, Jina ~$0.02 per 1M, Google embeddings $0.15 per 1M
REPT-01	Interactive HTML dashboard for search exploration	Extend existing Flask app with benchmark routes; use Bootstrap grid for comparison view
</phase_requirements>

Standard Stack

Core

Library	Version	Purpose	Why Standard
Flask	3.x (existing)	Web framework	Already in use; simple extension with new routes
dataclasses	stdlib	Structured results	Zero-dependency, clean typing, JSON-serializable with `asdict()`
time.perf_counter	stdlib	Latency measurement	Highest precision timer; monotonic (no clock drift)
statistics	stdlib	Aggregate calculations	mean, stdev, median built-in

Supporting

Library	Version	Purpose	When to Use
orjson	3.x	Fast JSON export	Benchmark result export; native dataclass support
tqdm	existing	Progress bars	CLI progress during batch runs

Alternatives Considered

Instead of	Could Use	Tradeoff
Manual accuracy	sklearn.metrics.accuracy_score	Adds dependency for simple division; not worth it
Server-Sent Events	Polling	SSE more complex but real-time; polling simpler for batch progress
Complex charting	HTML tables	Charts require JS libraries; tables sufficient for spike

Installation:

pip install orjson  # Only if JSON export needed
# All other dependencies already installed

Architecture Patterns

Recommended Project Structure

src/
├── evaluation/
│   ├── __init__.py
│   ├── train_test_split.py     # existing
│   ├── query_variations.py     # existing
│   ├── metrics.py              # NEW: accuracy, top-k calculation
│   ├── benchmark.py            # NEW: batch runner with timing
│   └── cost_tracker.py         # NEW: token/cost aggregation
├── search/
│   ├── pgvector_search.py      # existing (add timing)
│   └── llm_matching.py         # existing (add token tracking)
├── templates/
│   ├── search.html             # existing
│   ├── benchmark.html          # NEW: batch benchmark UI
│   └── results.html            # NEW: aggregate results view
└── app.py                      # extend with benchmark routes

Pattern 1: Dataclass Result Structures

What: Use dataclasses for all metric/result structures When to use: Any structured data that needs to be aggregated or serialized Example:

# Source: Python dataclasses stdlib + orjson compatibility
from dataclasses import dataclass, asdict, field
from typing import Optional
import time

@dataclass
class SearchResult:
    """Single search result with timing."""
    model: str  # 'google', 'jina', 'minilm', 'llm'
    query_id: int
    latency_ms: float
    predicted_debit_account: Optional[str]
    predicted_cost_center: Optional[str]
    ground_truth_debit_account: str
    ground_truth_cost_center: Optional[str]
    top_k_debit_accounts: list[str] = field(default_factory=list)
    similarity_scores: list[float] = field(default_factory=list)
    # Cost tracking
    tokens_input: int = 0
    tokens_output: int = 0
    api_calls: int = 0

@dataclass
class BenchmarkResults:
    """Aggregated benchmark statistics."""
    model: str
    total_queries: int
    exact_match_gl: float  # 0-1
    exact_match_cc: float  # 0-1
    top_3_accuracy: float
    top_5_accuracy: float
    top_10_accuracy: float
    latency_mean_ms: float
    latency_p50_ms: float
    latency_p95_ms: float
    total_tokens: int
    total_cost_usd: float

Pattern 2: Timing Context Manager

What: Consistent high-precision timing with context manager When to use: Wrapping any search operation Example:

# Source: Python time module, PEP 418
from contextlib import contextmanager
import time

@contextmanager
def measure_latency():
    """Context manager for high-precision timing."""
    start = time.perf_counter()
    result = {'latency_ms': 0.0}
    try:
        yield result
    finally:
        end = time.perf_counter()
        result['latency_ms'] = (end - start) * 1000  # Convert to ms

Pattern 3: Progress Streaming for Batch Operations

What: Stream progress updates during long-running benchmarks When to use: Batch benchmark runs that take >10 seconds Example:

# Source: Flask streaming documentation
from flask import Response, stream_with_context
import json

def benchmark_stream():
    """Generator for streaming benchmark progress."""
    total = get_test_count()
    for i, result in enumerate(run_benchmark_iterator()):
        progress = {
            'current': i + 1,
            'total': total,
            'percent': round((i + 1) / total * 100, 1),
            'last_result': asdict(result)
        }
        yield f"data: {json.dumps(progress)}\n\n"

@app.route('/benchmark/stream')
def benchmark_sse():
    """Server-Sent Events endpoint for benchmark progress."""
    return Response(
        stream_with_context(benchmark_stream()),
        mimetype='text/event-stream'
    )

Anti-Patterns to Avoid

Mixing timing with business logic: Keep perf_counter() calls at the outermost layer, not inside search functions
Storing raw results in memory: For large test sets, aggregate incrementally instead of keeping all SearchResult objects
Blocking UI during benchmarks: Use SSE or polling; never block the main thread for >30s operations

Don't Hand-Roll

Problem	Don't Build	Use Instead	Why
Percentile calculation	Custom implementation	`statistics.quantiles(data, n=100)[94]` for p95	Edge cases with small samples
JSON serialization	Manual dict conversion	`dataclasses.asdict()` or orjson	Handles nested structures correctly
Progress streaming	WebSocket implementation	Flask SSE with generators	Much simpler for unidirectional updates
Query timing	`time.time()`	`time.perf_counter()`	`time.time()` can jump backward on clock sync

Key insight: The evaluation logic itself is simple arithmetic. The complexity is in data flow (iterating test set, aggregating results, streaming progress) not in the metrics math.

Common Pitfalls

Pitfall 1: Cold Start Affecting Latency

What goes wrong: First query takes much longer due to model loading, connection warming Why it happens: Lazy initialization of embedding models, database connections How to avoid: Run 1-2 warmup queries before timing; exclude warmup from statistics Warning signs: First query 5-10x slower than subsequent queries

Pitfall 2: Rate Limiting During Batch Benchmark

What goes wrong: API rate limits cause failures mid-benchmark Why it happens: Rapid sequential API calls to Jina, Google, Gemini How to avoid: Add configurable delays between API calls; handle 429 with exponential backoff Warning signs: Sporadic HTTP 429 errors, incomplete benchmark runs

Pitfall 3: Memory Growth During Large Benchmarks

What goes wrong: OOM when storing all SearchResult objects for 1000+ queries Why it happens: Accumulating result objects in list How to avoid: Aggregate incrementally; only keep running statistics Warning signs: Memory usage growing linearly with test set size

Pitfall 4: Inconsistent Ground Truth Comparison

What goes wrong: "6801" != "6801.0" fails exact match Why it happens: Mixed string/numeric types from database How to avoid: Normalize both predicted and ground truth to same format before comparison Warning signs: Low accuracy despite visually correct predictions

Pitfall 5: LLM Returns Different Format Than pgvector

What goes wrong: LLM returns single prediction vs pgvector returns top-K list Why it happens: Different result structures by design How to avoid: Normalize result handling; LLM gets only exact match (top-1), pgvector gets top-K Warning signs: Comparing apples to oranges in accuracy calculations

Code Examples

Verified patterns from official sources:

High-Precision Latency Measurement

# Source: https://docs.python.org/3/library/time.html#time.perf_counter
import time

def search_with_timing(conn, query: str, k: int = 5) -> tuple[dict, float]:
    """Execute search and return results with latency in ms."""
    start = time.perf_counter()
    results = search_all_models(conn, query, k)
    end = time.perf_counter()
    latency_ms = (end - start) * 1000
    return results, latency_ms

Top-K Accuracy Calculation

# Source: Standard IR metrics (Pinecone, Weaviate docs)
def calculate_top_k_accuracy(results: list[SearchResult], k: int) -> float:
    """
    Calculate accuracy where correct answer appears in top-K.

    Args:
        results: List of SearchResult with top_k_debit_accounts and ground_truth
        k: Number of top results to check

    Returns:
        Accuracy as float 0-1
    """
    if not results:
        return 0.0

    hits = 0
    for r in results:
        ground_truth = str(r.ground_truth_debit_account).strip()
        top_k = [str(x).strip() for x in r.top_k_debit_accounts[:k]]
        if ground_truth in top_k:
            hits += 1

    return hits / len(results)

Aggregate Statistics Calculation

# Source: Python statistics module
import statistics

def calculate_latency_stats(latencies_ms: list[float]) -> dict:
    """Calculate latency statistics."""
    if not latencies_ms:
        return {'mean': 0, 'median': 0, 'p95': 0, 'min': 0, 'max': 0}

    sorted_latencies = sorted(latencies_ms)
    n = len(sorted_latencies)

    return {
        'mean': statistics.mean(latencies_ms),
        'median': statistics.median(latencies_ms),
        'p95': sorted_latencies[int(n * 0.95)] if n >= 20 else sorted_latencies[-1],
        'min': min(latencies_ms),
        'max': max(latencies_ms),
        'stdev': statistics.stdev(latencies_ms) if n > 1 else 0,
    }

Cost Tracking by Model

# Source: Gemini API pricing docs, Jina pricing page
# Prices as of 2026-02
PRICING = {
    'gemini_flash': {
        'input_per_1m': 0.30,   # USD per 1M input tokens
        'output_per_1m': 2.50,  # USD per 1M output tokens
    },
    'google_embedding': {
        'input_per_1m': 0.15,   # USD per 1M tokens
    },
    'jina_embedding': {
        'input_per_1m': 0.02,   # USD per 1M tokens (approximate)
    },
    'minilm': {
        'input_per_1m': 0.0,    # Local model, no API cost
    },
}

def calculate_cost(model: str, input_tokens: int, output_tokens: int = 0) -> float:
    """Calculate USD cost for API call."""
    pricing = PRICING.get(model, {})
    input_cost = (input_tokens / 1_000_000) * pricing.get('input_per_1m', 0)
    output_cost = (output_tokens / 1_000_000) * pricing.get('output_per_1m', 0)
    return input_cost + output_cost

State of the Art

Old Approach	Current Approach	When Changed	Impact
`time.time()`	`time.perf_counter()`	Python 3.3+	Monotonic, higher precision timing
Custom JSON export	orjson with dataclass support	2024+	10x faster, native dataclass handling
Polling for progress	Server-Sent Events	2020s	Real-time updates without WebSocket complexity
Flask sync only	Flask 2.0+ async support	2021	Async routes available but not needed here

Deprecated/outdated:

time.clock(): Removed in Python 3.8, replaced by perf_counter()
Blocking batch operations: Modern UX expects progress feedback

Open Questions

Token counting for Gemini/Jina API calls
- What we know: Gemini responses include usage_metadata with token counts; need to verify Jina response structure
- What's unclear: Exact response field names for token counts in current API versions
- Recommendation: Extract from API response objects during implementation; log if fields missing
Test set size vs benchmark duration
- What we know: ~1200 test items with 3 variations each = ~3600 queries
- What's unclear: Acceptable benchmark duration (10 min? 30 min?)
- Recommendation: Support both "quick" (subset) and "full" benchmark modes
Cost center accuracy handling
- What we know: Some items have no cost center (NULL)
- What's unclear: Should NULL predictions for NULL ground truth count as correct?
- Recommendation: Yes -- NULL == NULL is a correct prediction

Sources

Primary (HIGH confidence)

Python time.perf_counter() documentation - timing precision
Python dataclasses module - result structures
Flask streaming documentation - progress updates
Gemini API pricing - cost tracking ($0.30/$2.50 per 1M for 2.5 Flash)

Secondary (MEDIUM confidence)

Weaviate evaluation metrics - IR metric definitions
Pinecone offline evaluation - top-k accuracy patterns
orjson GitHub - fast JSON with dataclass support

Tertiary (LOW confidence)

Jina pricing (~$0.02 per 1M tokens) - not found in official docs, community reports only
Vertex AI embedding pricing - official page loads dynamically, couldn't extract exact rates

Metadata

Confidence breakdown:

Standard stack: HIGH - all stdlib or existing dependencies
Architecture: HIGH - extending existing Flask app with standard patterns
Metrics: HIGH - well-documented IR metrics with simple implementations
Cost tracking: MEDIUM - Gemini pricing verified, Jina/Google embedding pricing approximate
Pitfalls: HIGH - based on existing codebase patterns and common issues

Research date: 2026-02-20 Valid until: 2026-03-20 (30 days - stable domain, no fast-moving dependencies)