Implement static HTML report generation with confusion matrix and integrated curated examples.

Purpose: Generate a self-contained HTML report that can be shared without external dependencies. The report includes aggregate benchmark metrics, a confusion matrix visualizing GL account prediction errors (with low-frequency accounts grouped), and the curated showcase examples from Plan 01.

Output: New src/reporting/ module with confusion matrix generator and HTML report exporter, plus Flask route to trigger export

<execution_context> @./.claude/get-shit-done/workflows/execute-plan.md @./.claude/get-shit-done/templates/summary.md </execution_context>

@.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md @.planning/phases/05-reporting-llm-judge/05-RESEARCH.md @.planning/phases/05-reporting-llm-judge/05-01-SUMMARY.md

Existing evaluation infrastructure

@src/evaluation/metrics.py @src/evaluation/benchmark.py @src/evaluation/example_selector.py @src/app.py

Task 1: Add seaborn/matplotlib dependencies and create confusion matrix module pyproject.toml, src/reporting/__init__.py, src/reporting/confusion_matrix.py 1. Add seaborn and matplotlib to pyproject.toml dependencies: ```bash cd /home/volrath/code/worktrees-orcha-semantic-search/spikes/semantic-search source .venv/bin/activate uv add seaborn matplotlib ```
  1. Create src/reporting/init.py (empty or with exports)

  2. Create src/reporting/confusion_matrix.py:

    • At top: import matplotlib; matplotlib.use('Agg') for non-interactive backend
    • Import: matplotlib.pyplot, seaborn, sklearn.metrics.confusion_matrix, numpy
    • Import: BytesIO, base64 from stdlib
  3. Implement figure_to_base64(fig, format='png', dpi=150) -> str:

    • Save figure to BytesIO buffer
    • Encode as base64
    • Return data URI string: "data:image/png;base64,{encoded}"
    • Close figure after encoding (plt.close(fig))
  4. Implement create_confusion_matrix(y_true: list[str], y_pred: list[str], top_n: int = 15) -> tuple[plt.Figure, list[str]]:

    • Count label frequencies in y_true
    • Keep top_n most frequent labels
    • Map all other labels to 'Other'
    • Compute confusion matrix with sklearn
    • Create seaborn heatmap:
      • figsize=(12, 10)
      • annot=True, fmt='d', cmap='Blues'
      • square=True, linewidths=0.5
    • Set labels: xlabel='Predicted GL Account', ylabel='True GL Account'
    • Set title: 'GL Account Prediction Confusion Matrix'
    • Rotate x labels 45 degrees for readability
    • Return (figure, labels_used)

Key: Use plt.tight_layout() before returning to avoid label cutoff. Run test:

cd /home/volrath/code/worktrees-orcha-semantic-search/spikes/semantic-search
source .venv/bin/activate
python -c "
from src.reporting.confusion_matrix import create_confusion_matrix, figure_to_base64
fig, labels = create_confusion_matrix(['A','A','B','B','C'], ['A','B','A','B','C'], top_n=3)
uri = figure_to_base64(fig)
print(f'Labels: {labels}')
print(f'URI starts with data:image: {uri.startswith(\"data:image/png;base64,\")}')"
Confusion matrix module generates seaborn heatmap with top-N label grouping and base64 encoding
Task 2: Create report generator and HTML template src/reporting/report_generator.py, src/reporting/templates/report.html 1. Create src/reporting/templates/ directory
  1. Create src/reporting/report_generator.py:

    • Import: jinja2.Environment, FileSystemLoader, select_autoescape
    • Import: datetime, os
    • Import: create_confusion_matrix, figure_to_base64 from confusion_matrix
    • Import: select_showcase_examples, format_showcase_for_display from example_selector
  2. Implement generate_report(benchmark_results: dict, raw_results: list[dict], output_path: str) -> str:

    • Extract y_true and y_pred from raw_results for confusion matrix
    • Call create_confusion_matrix with top_n=15
    • Convert figure to base64
    • Call select_showcase_examples on raw_results
    • Format showcase for display
    • Setup Jinja2 environment with FileSystemLoader pointing to templates/
    • Enable autoescape for HTML
    • Render report.html template with context:
      • benchmark_results (dict of model -> BenchmarkResults)
      • confusion_matrix_img (base64 data URI)
      • showcase (formatted examples)
      • generated_at (ISO timestamp)
    • Write rendered HTML to output_path
    • Return output_path
  3. Create src/reporting/templates/report.html:

    • Self-contained HTML5 document
    • Inline CSS (no external stylesheets)
    • Use simple, clean styling (inspired by Bootstrap but inline)
    • Sections: a. Header with title, generation timestamp b. Summary table with aggregate metrics per model (accuracy, latency, cost) c. Confusion matrix image (embedded base64) d. Showcase examples organized by category
    • Each showcase category in collapsible section or distinct card
    • Display: query text, predictions per model, ground truth, similarity scores
    • Use for metrics, for confusion matrix,
      cards for examples
    • Footer with "Generated by Semantic Search Comparison Spike"
    • Template should render without JavaScript (pure HTML/CSS for maximum compatibility). Run test:

      cd /home/volrath/code/worktrees-orcha-semantic-search/spikes/semantic-search
      source .venv/bin/activate
      python -c "
      from src.reporting.report_generator import generate_report
      # Quick test with minimal data
      results = {'google': {'model': 'google', 'total_queries': 10, 'exact_match_gl': 0.8, 'exact_match_cc': 0.0, 'top_3_accuracy': 0.9, 'top_5_accuracy': 0.95, 'top_10_accuracy': 1.0, 'latency_mean_ms': 50, 'latency_p50_ms': 45, 'latency_p95_ms': 100, 'total_tokens': 0, 'total_cost_usd': 0.0}}
      raw = [{'query_text': 'test', 'google_correct': True, 'jina_correct': True, 'minilm_correct': True, 'llm_correct': True, 'google_prediction': '6801', 'jina_prediction': '6801', 'minilm_prediction': '6801', 'llm_prediction': '6801', 'google_similarity': 0.95, 'ground_truth_debit_account': '6801', 'ground_truth_cost_center': None}]
      path = generate_report(results, raw, '/tmp/test_report.html')
      import os
      print(f'Report exists: {os.path.exists(path)}')"
      
      Report generator creates self-contained HTML with embedded confusion matrix and showcase examples Task 3: Add report export route to Flask app src/app.py 1. Add imports at top of app.py: - from src.reporting.report_generator import generate_report - import os, tempfile
      1. Add new route /report/export (POST):

        @app.route('/report/export', methods=['POST'])
        def export_report():
            """Generate and download static HTML report."""
        
        • Get parameters from form: limit (default 100), k (default 5)
        • Run benchmark with run_benchmark(conn, k=k, limit=limit)
        • Collect raw results needed for confusion matrix and examples:
          • Iterate through benchmark, storing: query_text, predictions, correct flags, similarity
        • Generate report to tempfile
        • Return file as download (send_file with as_attachment=True)
        • Use filename format: semantic_search_report_{timestamp}.html
      2. Add link/button to /benchmark page template to trigger export:

        • Form with POST to /report/export
        • Include current limit/k values
        • Button text: "Export Report"

      Note: The export may take time for large datasets. Consider showing a loading indicator or warning about expected duration. Run Flask app and test endpoint:

      cd /home/volrath/code/worktrees-orcha-semantic-search/spikes/semantic-search
      source .venv/bin/activate
      # Just verify import works and route is registered
      python -c "
      from src.app import app
      routes = [rule.rule for rule in app.url_map.iter_rules()]
      print('/report/export in routes:', '/report/export' in routes)"
      
      Flask app has /report/export route that generates and downloads self-contained HTML report
      Task 4: Verify complete reporting system Complete reporting system with: - LLM-as-judge for semantic equivalence (from Plan 01) - Confusion matrix visualization - Curated example showcase - Self-contained HTML report export 1. Start Flask app: ```bash cd /home/volrath/code/worktrees-orcha-semantic-search/spikes/semantic-search source .venv/bin/activate python -m src.app ```
      1. Open http://localhost:5000/benchmark

      2. Run a small benchmark (limit=20)

      3. Click "Export Report" button

      4. Verify downloaded report:

        • Opens in browser without errors
        • Contains aggregate metrics table
        • Shows confusion matrix image (embedded, not broken)
        • Displays curated example categories
        • No external dependencies (works offline)
        • File size reasonable (<5MB) Type "approved" if report works correctly, or describe any issues
      1. seaborn and matplotlib installed via uv 2. Confusion matrix generates valid seaborn heatmap with 'Other' grouping 3. Base64 encoding produces valid data URIs 4. HTML template renders without errors 5. Flask route triggers download 6. Report is fully self-contained (no external assets)

      <success_criteria>

      • Confusion matrix shows top-15 GL accounts with 'Other' grouping
      • Static HTML report exports as single file
      • Report includes all aggregate metrics, confusion matrix, and examples
      • Report opens correctly in browser without external dependencies
      • Export route accessible from benchmark page </success_criteria>
      After completion, create `.planning/phases/05-reporting-llm-judge/05-02-SUMMARY.md`