Existing evaluation infrastructure

@src/evaluation/metrics.py @src/evaluation/benchmark.py @src/evaluation/example_selector.py @src/app.py

Task 1: Add seaborn/matplotlib dependencies and create confusion matrix module pyproject.toml, src/reporting/__init__.py, src/reporting/confusion_matrix.py 1. Add seaborn and matplotlib to pyproject.toml dependencies: ```bash cd /home/volrath/code/worktrees-orcha-semantic-search/spikes/semantic-search source .venv/bin/activate uv add seaborn matplotlib ```

Create src/reporting/init.py (empty or with exports)
Create src/reporting/confusion_matrix.py:
- At top: import matplotlib; matplotlib.use('Agg') for non-interactive backend
- Import: matplotlib.pyplot, seaborn, sklearn.metrics.confusion_matrix, numpy
- Import: BytesIO, base64 from stdlib
Implement figure_to_base64(fig, format='png', dpi=150) -> str:
- Save figure to BytesIO buffer
- Encode as base64
- Return data URI string: "data:image/png;base64,{encoded}"
- Close figure after encoding (plt.close(fig))
Implement create_confusion_matrix(y_true: list[str], y_pred: list[str], top_n: int = 15) -> tuple[plt.Figure, list[str]]:
- Count label frequencies in y_true
- Keep top_n most frequent labels
- Map all other labels to 'Other'
- Compute confusion matrix with sklearn
- Create seaborn heatmap:
  - figsize=(12, 10)
  - annot=True, fmt='d', cmap='Blues'
  - square=True, linewidths=0.5
- Set labels: xlabel='Predicted GL Account', ylabel='True GL Account'
- Set title: 'GL Account Prediction Confusion Matrix'
- Rotate x labels 45 degrees for readability
- Return (figure, labels_used)

Key: Use plt.tight_layout() before returning to avoid label cutoff. Run test:

cd /home/volrath/code/worktrees-orcha-semantic-search/spikes/semantic-search
source .venv/bin/activate
python -c "
from src.reporting.confusion_matrix import create_confusion_matrix, figure_to_base64
fig, labels = create_confusion_matrix(['A','A','B','B','C'], ['A','B','A','B','C'], top_n=3)
uri = figure_to_base64(fig)
print(f'Labels: {labels}')
print(f'URI starts with data:image: {uri.startswith(\"data:image/png;base64,\")}')"

Confusion matrix module generates seaborn heatmap with top-N label grouping and base64 encoding Task 2: Create report generator and HTML template src/reporting/report_generator.py, src/reporting/templates/report.html 1. Create src/reporting/templates/ directory

Create src/reporting/report_generator.py:
- Import: jinja2.Environment, FileSystemLoader, select_autoescape
- Import: datetime, os
- Import: create_confusion_matrix, figure_to_base64 from confusion_matrix
- Import: select_showcase_examples, format_showcase_for_display from example_selector
Implement generate_report(benchmark_results: dict, raw_results: list[dict], output_path: str) -> str:
- Extract y_true and y_pred from raw_results for confusion matrix
- Call create_confusion_matrix with top_n=15
- Convert figure to base64
- Call select_showcase_examples on raw_results
- Format showcase for display
- Setup Jinja2 environment with FileSystemLoader pointing to templates/
- Enable autoescape for HTML
- Render report.html template with context:
  - benchmark_results (dict of model -> BenchmarkResults)
  - confusion_matrix_img (base64 data URI)
  - showcase (formatted examples)
  - generated_at (ISO timestamp)
- Write rendered HTML to output_path
- Return output_path
Create src/reporting/templates/report.html:
- Self-contained HTML5 document
- Inline CSS (no external stylesheets)
- Use simple, clean styling (inspired by Bootstrap but inline)
- Sections: a. Header with title, generation timestamp b. Summary table with aggregate metrics per model (accuracy, latency, cost) c. Confusion matrix image (embedded base64) d. Showcase examples organized by category
- Each showcase category in collapsible section or distinct card
- Display: query text, predictions per model, ground truth, similarity scores
- Use for metrics, for confusion matrix,
  cards for examples
  Footer with "Generated by Semantic Search Comparison Spike"
  
  Template should render without JavaScript (pure HTML/CSS for maximum compatibility). Run test:
```
cd /home/volrath/code/worktrees-orcha-semantic-search/spikes/semantic-search
source .venv/bin/activate
python -c "
from src.reporting.report_generator import generate_report
# Quick test with minimal data
results = {'google': {'model': 'google', 'total_queries': 10, 'exact_match_gl': 0.8, 'exact_match_cc': 0.0, 'top_3_accuracy': 0.9, 'top_5_accuracy': 0.95, 'top_10_accuracy': 1.0, 'latency_mean_ms': 50, 'latency_p50_ms': 45, 'latency_p95_ms': 100, 'total_tokens': 0, 'total_cost_usd': 0.0}}
raw = [{'query_text': 'test', 'google_correct': True, 'jina_correct': True, 'minilm_correct': True, 'llm_correct': True, 'google_prediction': '6801', 'jina_prediction': '6801', 'minilm_prediction': '6801', 'llm_prediction': '6801', 'google_similarity': 0.95, 'ground_truth_debit_account': '6801', 'ground_truth_cost_center': None}]
path = generate_report(results, raw, '/tmp/test_report.html')
import os
print(f'Report exists: {os.path.exists(path)}')"
```
  Report generator creates self-contained HTML with embedded confusion matrix and showcase examples
  Task 3: Add report export route to Flask app src/app.py
  1. Add imports at top of app.py: - from src.reporting.report_generator import generate_report - import os, tempfile
  
  Add new route /report/export (POST):
  
  @app.route('/report/export', methods=['POST']) def export_report(): """Generate and download static HTML report."""
  
  Get parameters from form: limit (default 100), k (default 5)
  
  Run benchmark with run_benchmark(conn, k=k, limit=limit)
  
  Collect raw results needed for confusion matrix and examples:
  
  Iterate through benchmark, storing: query_text, predictions, correct flags, similarity
  
  Generate report to tempfile
  
  Return file as download (send_file with as_attachment=True)
  
  Use filename format: semantic_search_report_{timestamp}.html
  
  Add link/button to /benchmark page template to trigger export:
  
  Form with POST to /report/export
  
  Include current limit/k values
  
  Button text: "Export Report"
  
  Note: The export may take time for large datasets. Consider showing a loading indicator or warning about expected duration. Run Flask app and test endpoint:
  
  cd /home/volrath/code/worktrees-orcha-semantic-search/spikes/semantic-search source .venv/bin/activate # Just verify import works and route is registered python -c " from src.app import app routes = [rule.rule for rule in app.url_map.iter_rules()] print('/report/export in routes:', '/report/export' in routes)"
  Flask app has /report/export route that generates and downloads self-contained HTML report
  Task 4: Verify complete reporting system Complete reporting system with: - LLM-as-judge for semantic equivalence (from Plan 01) - Confusion matrix visualization - Curated example showcase - Self-contained HTML report export
  1. Start Flask app: ```bash cd /home/volrath/code/worktrees-orcha-semantic-search/spikes/semantic-search source .venv/bin/activate python -m src.app ```
  
  Open http://localhost:5000/benchmark
  
  Run a small benchmark (limit=20)
  
  Click "Export Report" button
  
  Verify downloaded report:
  
  Opens in browser without errors
  
  Contains aggregate metrics table
  
  Shows confusion matrix image (embedded, not broken)
  
  Displays curated example categories
  
  No external dependencies (works offline)
  
  File size reasonable (<5MB) Type "approved" if report works correctly, or describe any issues
  
  1. seaborn and matplotlib installed via uv 2. Confusion matrix generates valid seaborn heatmap with 'Other' grouping 3. Base64 encoding produces valid data URIs 4. HTML template renders without errors 5. Flask route triggers download 6. Report is fully self-contained (no external assets)
  <success_criteria>
  
  Confusion matrix shows top-15 GL accounts with 'Other' grouping
  
  Static HTML report exports as single file
  
  Report includes all aggregate metrics, confusion matrix, and examples
  
  Report opens correctly in browser without external dependencies
  
  Export route accessible from benchmark page </success_criteria>
  
  After completion, create `.planning/phases/05-reporting-llm-judge/05-02-SUMMARY.md`