cards for examples
Footer with "Generated by Semantic Search Comparison Spike"
Template should render without JavaScript (pure HTML/CSS for maximum compatibility).
Run test:
cd /home/volrath/code/worktrees-orcha-semantic-search/spikes/semantic-search
source .venv/bin/activate
python -c "
from src.reporting.report_generator import generate_report
# Quick test with minimal data
results = {'google': {'model': 'google', 'total_queries': 10, 'exact_match_gl': 0.8, 'exact_match_cc': 0.0, 'top_3_accuracy': 0.9, 'top_5_accuracy': 0.95, 'top_10_accuracy': 1.0, 'latency_mean_ms': 50, 'latency_p50_ms': 45, 'latency_p95_ms': 100, 'total_tokens': 0, 'total_cost_usd': 0.0}}
raw = [{'query_text': 'test', 'google_correct': True, 'jina_correct': True, 'minilm_correct': True, 'llm_correct': True, 'google_prediction': '6801', 'jina_prediction': '6801', 'minilm_prediction': '6801', 'llm_prediction': '6801', 'google_similarity': 0.95, 'ground_truth_debit_account': '6801', 'ground_truth_cost_center': None}]
path = generate_report(results, raw, '/tmp/test_report.html')
import os
print(f'Report exists: {os.path.exists(path)}')"
Report generator creates self-contained HTML with embedded confusion matrix and showcase examples
Task 3: Add report export route to Flask app
src/app.py
1. Add imports at top of app.py:
- from src.reporting.report_generator import generate_report
- import os, tempfile
-
Add new route /report/export (POST):
@app.route('/report/export', methods=['POST'])
def export_report():
"""Generate and download static HTML report."""
- Get parameters from form: limit (default 100), k (default 5)
- Run benchmark with run_benchmark(conn, k=k, limit=limit)
- Collect raw results needed for confusion matrix and examples:
- Iterate through benchmark, storing: query_text, predictions, correct flags, similarity
- Generate report to tempfile
- Return file as download (send_file with as_attachment=True)
- Use filename format: semantic_search_report_{timestamp}.html
-
Add link/button to /benchmark page template to trigger export:
- Form with POST to /report/export
- Include current limit/k values
- Button text: "Export Report"
Note: The export may take time for large datasets. Consider showing a loading indicator or warning about expected duration.
Run Flask app and test endpoint:
cd /home/volrath/code/worktrees-orcha-semantic-search/spikes/semantic-search
source .venv/bin/activate
# Just verify import works and route is registered
python -c "
from src.app import app
routes = [rule.rule for rule in app.url_map.iter_rules()]
print('/report/export in routes:', '/report/export' in routes)"
Flask app has /report/export route that generates and downloads self-contained HTML report
Task 4: Verify complete reporting system
Complete reporting system with:
- LLM-as-judge for semantic equivalence (from Plan 01)
- Confusion matrix visualization
- Curated example showcase
- Self-contained HTML report export
1. Start Flask app:
```bash
cd /home/volrath/code/worktrees-orcha-semantic-search/spikes/semantic-search
source .venv/bin/activate
python -m src.app
```
-
Open http://localhost:5000/benchmark
-
Run a small benchmark (limit=20)
-
Click "Export Report" button
-
Verify downloaded report:
- Opens in browser without errors
- Contains aggregate metrics table
- Shows confusion matrix image (embedded, not broken)
- Displays curated example categories
- No external dependencies (works offline)
- File size reasonable (<5MB)
Type "approved" if report works correctly, or describe any issues
1. seaborn and matplotlib installed via uv
2. Confusion matrix generates valid seaborn heatmap with 'Other' grouping
3. Base64 encoding produces valid data URIs
4. HTML template renders without errors
5. Flask route triggers download
6. Report is fully self-contained (no external assets)
<success_criteria>
- Confusion matrix shows top-15 GL accounts with 'Other' grouping
- Static HTML report exports as single file
- Report includes all aggregate metrics, confusion matrix, and examples
- Report opens correctly in browser without external dependencies
- Export route accessible from benchmark page
</success_criteria>