Create interactive Flask dashboard with comparison view and benchmark visualization.
Purpose: Enable side-by-side comparison of all 4 search approaches with metrics display for data-driven evaluation.
Output: Extended Flask app with benchmark route, comparison UI, and aggregate metrics display.
<execution_context>
@./.claude/get-shit-done/workflows/execute-plan.md
@./.claude/get-shit-done/templates/summary.md
</execution_context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/04-evaluation-dashboard/04-RESEARCH.md
@.planning/phases/04-evaluation-dashboard/04-01-SUMMARY.md
@src/app.py
@src/templates/search.html
@src/evaluation/benchmark.py
@src/evaluation/metrics.py
Task 1: Extend Flask app with benchmark route and results view
src/app.py, src/templates/benchmark.html
Update src/app.py to add benchmark route:
-
Add imports:
- from dataclasses import asdict
- from src.evaluation.benchmark import run_benchmark, fetch_test_queries
-
Add /benchmark route (GET + POST):
- GET: Show benchmark form with options (limit, k)
- POST: Run benchmark with specified parameters
- Store results in session or pass directly to template
- Handle errors gracefully (rate limits, missing API keys)
-
Add /benchmark/quick route:
- Runs benchmark with limit=20 for fast feedback
- Returns results page
Create src/templates/benchmark.html:
-
Form section:
- Number input for limit (default 20, max 3648)
- Select for K value (3, 5, 10)
- "Run Quick" button (limit=20)
- "Run Full" button (full test set, warning about time)
-
Results section (shown when results exist):
- Summary cards: Total queries, Total time, Test set coverage
- Main metrics table with Bootstrap styling:
| Model | GL Accuracy | CC Accuracy | Top-3 | Top-5 | Top-10 | Latency (p50) | Latency (p95) | Cost |
- Format: Accuracy as percentage (85.2%), latency in ms, cost in USD
- Highlight best accuracy per column with green background
- Highlight worst latency with yellow background
-
Use existing Bootstrap CDN from search.html
-
Navigation: Links to search page and benchmark page in both templates
python -c "from src.app import app; print('Flask app imports OK')"
curl -s http://localhost:5000/benchmark 2>/dev/null | grep -q "benchmark" || echo "Start Flask first: python -m src.app"
/benchmark route responds with form. POST /benchmark runs benchmark and displays results table. Metrics displayed as percentages with highlighting.
Task 2: Create side-by-side comparison view with enhanced metrics display
src/app.py, src/templates/comparison.html, src/templates/search.html
Update src/app.py to enhance comparison view:
-
Add /compare route:
- GET: Show comparison form
- POST: Run search on single query, display all 4 approaches side-by-side with timing
-
Modify existing search results to include timing per model:
- Add latency_ms to each model's results dict
- Display in results cards
-
Add metrics annotations to search results:
- Show similarity score prominently
- Highlight top-1 prediction
Create src/templates/comparison.html:
-
Query input form (same as search.html)
-
4-column grid layout for results (responsive: 4 cols on lg, 2 on md, 1 on sm):
- Google column
- Jina column
- MiniLM column
- LLM column
-
Each column shows:
- Model name header with latency badge (e.g., "Google - 45ms")
- Result cards with:
- Similarity score (for embeddings) or "LLM Prediction"
- Supplier name
- Description (truncated to 100 chars)
- Debit account (highlighted as primary prediction)
- Cost center
- For embeddings: Show top-K results as cards
- For LLM: Show single prediction card
-
Comparison highlights:
- If all 4 models agree on debit_account, show green "Consensus" badge
- If models disagree, show amber "Divergent" badge with unique predictions listed
Update src/templates/search.html:
- Add latency display to each column header
- Add navigation to comparison and benchmark pages
python -c "from src.app import app; print('Flask routes OK')"
/compare route shows 4-column grid with all approaches. Latency displayed per model. Consensus/divergent badges highlight agreement. Search.html updated with latency and navigation.
Task 3: Verify dashboard functionality
Interactive Flask dashboard with:
- Search page (/): Query input, 4-column results with latency
- Comparison page (/compare): Side-by-side view with consensus detection
- Benchmark page (/benchmark): Run and view aggregate metrics
1. Start Flask app: `python -m src.app`
2. Open http://localhost:5000
-
Test search functionality:
- Enter query: "office supplies | printer paper"
- Select K=5
- Verify all 4 columns show results with similarity scores
- Verify latency displayed in column headers
-
Navigate to /compare:
- Enter same query
- Verify side-by-side layout
- Check for consensus/divergent badge
-
Navigate to /benchmark:
- Click "Run Quick" (limit=20)
- Wait for completion (~30-60 seconds)
- Verify results table shows all 4 models
- Verify accuracy percentages, latency stats, cost estimates
Expected outcome:
- All pages render without errors
- Metrics appear reasonable (accuracy 30-90%, latency 50-500ms per query)
- Cost estimates show for API models ($0.001-0.01 range for small benchmarks)
Type "approved" or describe any issues with the dashboard
After all tasks:
1. Flask app runs: `python -m src.app`
2. http://localhost:5000/ shows search with 4-column results
3. http://localhost:5000/compare shows side-by-side comparison view
4. http://localhost:5000/benchmark shows form and runs benchmark
5. Metrics table displays accuracy, latency, cost for all 4 models
<success_criteria>
- Side-by-side comparison view shows all 4 approaches (SRCH-04)
- Interactive HTML dashboard allows search exploration (REPT-01)
- Benchmark results display aggregate metrics (accuracy, latency, cost)
- User can adjust K and run benchmarks from the UI
- Navigation between search, comparison, and benchmark pages
- Human verification confirms functionality
</success_criteria>