Create interactive Flask dashboard with comparison view and benchmark visualization.

Purpose: Enable side-by-side comparison of all 4 search approaches with metrics display for data-driven evaluation. Output: Extended Flask app with benchmark route, comparison UI, and aggregate metrics display.

<execution_context> @./.claude/get-shit-done/workflows/execute-plan.md @./.claude/get-shit-done/templates/summary.md </execution_context>

@.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md @.planning/phases/04-evaluation-dashboard/04-RESEARCH.md @.planning/phases/04-evaluation-dashboard/04-01-SUMMARY.md

@src/app.py @src/templates/search.html @src/evaluation/benchmark.py @src/evaluation/metrics.py

Task 1: Extend Flask app with benchmark route and results view src/app.py, src/templates/benchmark.html Update src/app.py to add benchmark route:
  1. Add imports:

    • from dataclasses import asdict
    • from src.evaluation.benchmark import run_benchmark, fetch_test_queries
  2. Add /benchmark route (GET + POST):

    • GET: Show benchmark form with options (limit, k)
    • POST: Run benchmark with specified parameters
    • Store results in session or pass directly to template
    • Handle errors gracefully (rate limits, missing API keys)
  3. Add /benchmark/quick route:

    • Runs benchmark with limit=20 for fast feedback
    • Returns results page

Create src/templates/benchmark.html:

  1. Form section:

    • Number input for limit (default 20, max 3648)
    • Select for K value (3, 5, 10)
    • "Run Quick" button (limit=20)
    • "Run Full" button (full test set, warning about time)
  2. Results section (shown when results exist):

    • Summary cards: Total queries, Total time, Test set coverage
    • Main metrics table with Bootstrap styling: | Model | GL Accuracy | CC Accuracy | Top-3 | Top-5 | Top-10 | Latency (p50) | Latency (p95) | Cost |
    • Format: Accuracy as percentage (85.2%), latency in ms, cost in USD
    • Highlight best accuracy per column with green background
    • Highlight worst latency with yellow background
  3. Use existing Bootstrap CDN from search.html

  4. Navigation: Links to search page and benchmark page in both templates python -c "from src.app import app; print('Flask app imports OK')" curl -s http://localhost:5000/benchmark 2>/dev/null | grep -q "benchmark" || echo "Start Flask first: python -m src.app" /benchmark route responds with form. POST /benchmark runs benchmark and displays results table. Metrics displayed as percentages with highlighting.

Task 2: Create side-by-side comparison view with enhanced metrics display src/app.py, src/templates/comparison.html, src/templates/search.html Update src/app.py to enhance comparison view:
  1. Add /compare route:

    • GET: Show comparison form
    • POST: Run search on single query, display all 4 approaches side-by-side with timing
  2. Modify existing search results to include timing per model:

    • Add latency_ms to each model's results dict
    • Display in results cards
  3. Add metrics annotations to search results:

    • Show similarity score prominently
    • Highlight top-1 prediction

Create src/templates/comparison.html:

  1. Query input form (same as search.html)

  2. 4-column grid layout for results (responsive: 4 cols on lg, 2 on md, 1 on sm):

    • Google column
    • Jina column
    • MiniLM column
    • LLM column
  3. Each column shows:

    • Model name header with latency badge (e.g., "Google - 45ms")
    • Result cards with:
      • Similarity score (for embeddings) or "LLM Prediction"
      • Supplier name
      • Description (truncated to 100 chars)
      • Debit account (highlighted as primary prediction)
      • Cost center
    • For embeddings: Show top-K results as cards
    • For LLM: Show single prediction card
  4. Comparison highlights:

    • If all 4 models agree on debit_account, show green "Consensus" badge
    • If models disagree, show amber "Divergent" badge with unique predictions listed

Update src/templates/search.html:

  • Add latency display to each column header
  • Add navigation to comparison and benchmark pages python -c "from src.app import app; print('Flask routes OK')" /compare route shows 4-column grid with all approaches. Latency displayed per model. Consensus/divergent badges highlight agreement. Search.html updated with latency and navigation.
Task 3: Verify dashboard functionality Interactive Flask dashboard with: - Search page (/): Query input, 4-column results with latency - Comparison page (/compare): Side-by-side view with consensus detection - Benchmark page (/benchmark): Run and view aggregate metrics 1. Start Flask app: `python -m src.app` 2. Open http://localhost:5000
  1. Test search functionality:

    • Enter query: "office supplies | printer paper"
    • Select K=5
    • Verify all 4 columns show results with similarity scores
    • Verify latency displayed in column headers
  2. Navigate to /compare:

    • Enter same query
    • Verify side-by-side layout
    • Check for consensus/divergent badge
  3. Navigate to /benchmark:

    • Click "Run Quick" (limit=20)
    • Wait for completion (~30-60 seconds)
    • Verify results table shows all 4 models
    • Verify accuracy percentages, latency stats, cost estimates

Expected outcome:

  • All pages render without errors
  • Metrics appear reasonable (accuracy 30-90%, latency 50-500ms per query)
  • Cost estimates show for API models ($0.001-0.01 range for small benchmarks) Type "approved" or describe any issues with the dashboard
After all tasks: 1. Flask app runs: `python -m src.app` 2. http://localhost:5000/ shows search with 4-column results 3. http://localhost:5000/compare shows side-by-side comparison view 4. http://localhost:5000/benchmark shows form and runs benchmark 5. Metrics table displays accuracy, latency, cost for all 4 models

<success_criteria>

After completion, create `.planning/phases/04-evaluation-dashboard/04-02-SUMMARY.md`