04-02-PLAN

Create interactive Flask dashboard with comparison view and benchmark visualization.

Purpose: Enable side-by-side comparison of all 4 search approaches with metrics display for data-driven evaluation. Output: Extended Flask app with benchmark route, comparison UI, and aggregate metrics display.

<execution_context> @./.claude/get-shit-done/workflows/execute-plan.md @./.claude/get-shit-done/templates/summary.md </execution_context>

@.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md @.planning/phases/04-evaluation-dashboard/04-RESEARCH.md @.planning/phases/04-evaluation-dashboard/04-01-SUMMARY.md

@src/app.py @src/templates/search.html @src/evaluation/benchmark.py @src/evaluation/metrics.py

Task 1: Extend Flask app with benchmark route and results view src/app.py, src/templates/benchmark.html Update src/app.py to add benchmark route:

Add imports:
- from dataclasses import asdict
- from src.evaluation.benchmark import run_benchmark, fetch_test_queries
Add /benchmark route (GET + POST):
- GET: Show benchmark form with options (limit, k)
- POST: Run benchmark with specified parameters
- Store results in session or pass directly to template
- Handle errors gracefully (rate limits, missing API keys)
Add /benchmark/quick route:
- Runs benchmark with limit=20 for fast feedback
- Returns results page

Create src/templates/benchmark.html:

Form section:
- Number input for limit (default 20, max 3648)
- Select for K value (3, 5, 10)
- "Run Quick" button (limit=20)
- "Run Full" button (full test set, warning about time)
Results section (shown when results exist):
- Summary cards: Total queries, Total time, Test set coverage
- Main metrics table with Bootstrap styling: | Model | GL Accuracy | CC Accuracy | Top-3 | Top-5 | Top-10 | Latency (p50) | Latency (p95) | Cost |
- Format: Accuracy as percentage (85.2%), latency in ms, cost in USD
- Highlight best accuracy per column with green background
- Highlight worst latency with yellow background
Use existing Bootstrap CDN from search.html
Navigation: Links to search page and benchmark page in both templates python -c "from src.app import app; print('Flask app imports OK')" curl -s http://localhost:5000/benchmark 2>/dev/null | grep -q "benchmark" || echo "Start Flask first: python -m src.app" /benchmark route responds with form. POST /benchmark runs benchmark and displays results table. Metrics displayed as percentages with highlighting.

Task 2: Create side-by-side comparison view with enhanced metrics display src/app.py, src/templates/comparison.html, src/templates/search.html Update src/app.py to enhance comparison view:

Add /compare route:
- GET: Show comparison form
- POST: Run search on single query, display all 4 approaches side-by-side with timing
Modify existing search results to include timing per model:
- Add latency_ms to each model's results dict
- Display in results cards
Add metrics annotations to search results:
- Show similarity score prominently
- Highlight top-1 prediction

Create src/templates/comparison.html:

Query input form (same as search.html)
4-column grid layout for results (responsive: 4 cols on lg, 2 on md, 1 on sm):
- Google column
- Jina column
- MiniLM column
- LLM column
Each column shows:
- Model name header with latency badge (e.g., "Google - 45ms")
- Result cards with:
  - Similarity score (for embeddings) or "LLM Prediction"
  - Supplier name
  - Description (truncated to 100 chars)
  - Debit account (highlighted as primary prediction)
  - Cost center
- For embeddings: Show top-K results as cards
- For LLM: Show single prediction card
Comparison highlights:
- If all 4 models agree on debit_account, show green "Consensus" badge
- If models disagree, show amber "Divergent" badge with unique predictions listed

Update src/templates/search.html:

Add latency display to each column header
Add navigation to comparison and benchmark pages python -c "from src.app import app; print('Flask routes OK')" /compare route shows 4-column grid with all approaches. Latency displayed per model. Consensus/divergent badges highlight agreement. Search.html updated with latency and navigation.

Task 3: Verify dashboard functionality

Interactive Flask dashboard with: - Search page (/): Query input, 4-column results with latency - Comparison page (/compare): Side-by-side view with consensus detection - Benchmark page (/benchmark): Run and view aggregate metrics

1. Start Flask app: `python -m src.app` 2. Open http://localhost:5000

Test search functionality:
- Enter query: "office supplies | printer paper"
- Select K=5
- Verify all 4 columns show results with similarity scores
- Verify latency displayed in column headers
Navigate to /compare:
- Enter same query
- Verify side-by-side layout
- Check for consensus/divergent badge
Navigate to /benchmark:
- Click "Run Quick" (limit=20)
- Wait for completion (~30-60 seconds)
- Verify results table shows all 4 models
- Verify accuracy percentages, latency stats, cost estimates

Expected outcome:

All pages render without errors
Metrics appear reasonable (accuracy 30-90%, latency 50-500ms per query)
Cost estimates show for API models ($0.001-0.01 range for small benchmarks) Type "approved" or describe any issues with the dashboard

After all tasks: 1. Flask app runs: `python -m src.app` 2. http://localhost:5000/ shows search with 4-column results 3. http://localhost:5000/compare shows side-by-side comparison view 4. http://localhost:5000/benchmark shows form and runs benchmark 5. Metrics table displays accuracy, latency, cost for all 4 models

<success_criteria>

Side-by-side comparison view shows all 4 approaches (SRCH-04)
Interactive HTML dashboard allows search exploration (REPT-01)
Benchmark results display aggregate metrics (accuracy, latency, cost)
User can adjust K and run benchmarks from the UI
Navigation between search, comparison, and benchmark pages
Human verification confirms functionality </success_criteria>

After completion, create `.planning/phases/04-evaluation-dashboard/04-02-SUMMARY.md`