Phase 5: Reporting & LLM Judge - Research

Researched: 2026-02-20 Domain: LLM-as-judge evaluation, static HTML report generation, confusion matrix visualization, curated example showcase Confidence: HIGH

Summary

This phase completes the evaluation framework with LLM-as-judge for semantic relevance scoring when exact matches fail, generates static HTML reports with aggregate metrics, creates confusion matrices for GL account error analysis, and curates hand-picked examples showcasing best/worst cases per approach.

The core technical challenge is implementing a reliable LLM-as-judge that can determine semantic equivalence between predicted and ground truth GL accounts (e.g., both "6801" and "6800" might be valid for "office supplies" depending on context). The project already has google-genai for Gemini Flash calls and scikit-learn for confusion matrix computation. Static HTML report generation uses Jinja2 (already a Flask dependency) with base64-encoded matplotlib/seaborn visualizations for self-contained, shareable files.

The LLM judge should use a binary classification approach (semantically equivalent: YES/NO) with chain-of-thought reasoning and few-shot examples. Key best practices include: single-criterion prompts, explicit scoring rubrics, and confidence levels in the response. For confusion matrices, sklearn's confusion_matrix paired with seaborn heatmaps provides standard visualization. Hand-picked examples are selected by identifying prediction patterns: highest-confidence correct, lowest-confidence correct, highest-confidence wrong, and edge cases where models disagree.

Primary recommendation: Implement LLM judge as a standalone module invoked when exact match fails, generate self-contained HTML reports using Jinja2 with embedded base64 images, and use sklearn/seaborn for confusion matrix generation.

<user_constraints>

User Constraints (from CONTEXT.md)

Locked Decisions

None -- all implementation decisions delegated to Claude.

Claude's Discretion

All implementation decisions deferred to Claude -- user comfortable with standard approaches:

LLM judge criteria:

Report structure:

Example selection:

Confusion matrix:

Deferred Ideas (OUT OF SCOPE)

None -- discussion stayed within phase scope. </user_constraints>

<phase_requirements>

Phase Requirements

ID Description Research Support
EVAL-08 LLM-as-judge evaluation for semantic relevance Use Gemini Flash with binary classification (YES/NO) prompt, chain-of-thought reasoning, few-shot examples; invoke only when exact match fails
REPT-02 Hand-picked example showcase (curated searches) Select by prediction confidence bands and model agreement patterns; 5-10 examples per category (best, worst, divergent)
REPT-03 Static HTML report export with aggregate metrics Jinja2 template with embedded base64 images; single-file output requiring no external assets
REPT-04 Confusion matrix for GL account prediction errors sklearn.metrics.confusion_matrix + seaborn heatmap; group low-frequency accounts into "Other" category
</phase_requirements>

Standard Stack

Core

Library Version Purpose Why Standard
google-genai 1.64+ (existing) LLM judge API calls Already in use for LLM matching; consistent with existing patterns
scikit-learn 1.8.0 (installed) Confusion matrix computation Standard ML library; confusion_matrix, ConfusionMatrixDisplay
Jinja2 (Flask dep) HTML report templates Already used by Flask; powerful templating for static output
matplotlib 3.x Figure generation for reports Standard Python plotting; save to BytesIO for base64 embedding

Supporting

Library Version Purpose When to Use
seaborn 0.13+ Confusion matrix heatmap styling More aesthetic than raw matplotlib; sns.heatmap()
base64 stdlib Image embedding in HTML Convert matplotlib figures to data URIs
io.BytesIO stdlib In-memory figure storage Avoid temp files when generating base64 images

Alternatives Considered

Instead of Could Use Tradeoff
Gemini Flash judge GPT-4 judge Gemini already in use; GPT-4 would add new dependency and cost
seaborn heatmap plotly interactive Plotly requires JS bundle; seaborn PNG is simpler for static reports
Jinja2 standalone html-reports package Extra dependency; Jinja2 already available
Base64 embedding External image files External files break shareability; base64 is self-contained

Installation:

# Seaborn and matplotlib may need explicit installation
uv add seaborn matplotlib
# All other dependencies already installed

Architecture Patterns

src/
├── evaluation/
│   ├── metrics.py              # existing
│   ├── benchmark.py            # existing
│   ├── cost_tracker.py         # existing
│   ├── llm_judge.py            # NEW: semantic relevance judge
│   └── example_selector.py     # NEW: best/worst case selection
├── reporting/
│   ├── __init__.py             # NEW
│   ├── confusion_matrix.py     # NEW: sklearn/seaborn CM generation
│   ├── report_generator.py     # NEW: Jinja2 static HTML output
│   └── templates/
│       └── report.html         # NEW: static report template
└── app.py                      # extend with report export route

Pattern 1: LLM-as-Judge with Binary Classification

What: Single-criterion evaluation prompt that returns YES/NO for semantic equivalence When to use: When predicted GL account differs from ground truth but might be semantically valid Example:

# Source: LLM-as-a-Judge best practices (Evidently AI, Monte Carlo)
JUDGE_PROMPT = """You are evaluating whether two GL account codes are semantically equivalent for a given line item.

Context: {line_item_description}
Predicted GL Account: {predicted}
Expected GL Account: {ground_truth}

Consider:
1. Are both accounts in the same category (e.g., both are operating expenses)?
2. Would an accountant accept either for this line item?
3. Is the difference meaningful for financial reporting?

Examples:
- "6801" vs "6800" for "Office supplies" -> YES (both are office/admin expenses)
- "6801" vs "4000" for "Office supplies" -> NO (expense vs revenue account)
- "7110" vs "7120" for "Consulting fees" -> YES (both are external services)

Answer with a single word: YES or NO
Then provide a one-sentence explanation.

Response format:
VERDICT: [YES/NO]
REASON: [one sentence]"""

def judge_semantic_equivalence(
    line_item: str,
    predicted: str,
    ground_truth: str
) -> dict:
    """
    Use LLM to judge if prediction is semantically equivalent to ground truth.

    Returns:
        dict with 'equivalent' (bool), 'reason' (str), 'raw_response' (str)
    """
    prompt = JUDGE_PROMPT.format(
        line_item_description=line_item,
        predicted=predicted,
        ground_truth=ground_truth
    )

    response = call_gemini_flash(prompt)

    # Parse response
    lines = response.strip().split('\n')
    verdict_line = [l for l in lines if l.startswith('VERDICT:')][0]
    verdict = 'YES' in verdict_line.upper()

    reason_line = [l for l in lines if l.startswith('REASON:')]
    reason = reason_line[0].replace('REASON:', '').strip() if reason_line else ''

    return {
        'equivalent': verdict,
        'reason': reason,
        'raw_response': response
    }

Pattern 2: Self-Contained HTML Report Generation

What: Jinja2 template rendering with embedded base64 images When to use: Generating shareable static reports Example:

# Source: Practical Business Python (pbpython.com), Matplotlib docs
from io import BytesIO
import base64
from jinja2 import Environment, FileSystemLoader
import matplotlib.pyplot as plt

def fig_to_base64(fig) -> str:
    """Convert matplotlib figure to base64 data URI."""
    buf = BytesIO()
    fig.savefig(buf, format='png', dpi=150, bbox_inches='tight')
    buf.seek(0)
    data = base64.b64encode(buf.read()).decode('utf-8')
    plt.close(fig)
    return f"data:image/png;base64,{data}"

def generate_report(
    benchmark_results: dict,
    confusion_fig,
    examples: list,
    output_path: str
):
    """Generate self-contained HTML report."""
    env = Environment(loader=FileSystemLoader('src/reporting/templates'))
    template = env.get_template('report.html')

    html = template.render(
        results=benchmark_results,
        confusion_matrix_img=fig_to_base64(confusion_fig),
        examples=examples,
        generated_at=datetime.now().isoformat(),
    )

    with open(output_path, 'w') as f:
        f.write(html)

Pattern 3: Confusion Matrix with Grouped Labels

What: Group low-frequency GL accounts into "Other" to keep matrix readable When to use: When unique GL accounts > 15-20 (matrix becomes unreadable) Example:

# Source: sklearn.metrics.confusion_matrix, seaborn heatmap docs
import numpy as np
import seaborn as sns
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt

def create_confusion_matrix(
    y_true: list[str],
    y_pred: list[str],
    top_n: int = 15
) -> tuple:
    """
    Create confusion matrix with top N labels, grouping rest as 'Other'.

    Returns:
        (figure, labels) tuple
    """
    from collections import Counter

    # Find top N most frequent ground truth labels
    label_counts = Counter(y_true)
    top_labels = [label for label, _ in label_counts.most_common(top_n)]

    # Map non-top labels to 'Other'
    def map_label(label):
        return label if label in top_labels else 'Other'

    y_true_mapped = [map_label(y) for y in y_true]
    y_pred_mapped = [map_label(y) for y in y_pred]

    # Include 'Other' in labels if needed
    all_labels = top_labels + ['Other'] if 'Other' in y_true_mapped or 'Other' in y_pred_mapped else top_labels

    # Compute confusion matrix
    cm = confusion_matrix(y_true_mapped, y_pred_mapped, labels=all_labels)

    # Create heatmap
    fig, ax = plt.subplots(figsize=(12, 10))
    sns.heatmap(
        cm,
        annot=True,
        fmt='d',
        cmap='Blues',
        xticklabels=all_labels,
        yticklabels=all_labels,
        ax=ax
    )
    ax.set_xlabel('Predicted GL Account')
    ax.set_ylabel('True GL Account')
    ax.set_title('GL Account Prediction Confusion Matrix')

    # Rotate x labels for readability
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()

    return fig, all_labels

Pattern 4: Example Selection by Prediction Pattern

What: Select representative examples across different outcome categories When to use: Creating curated showcase of best/worst/interesting cases Example:

# Source: Evaluation best practices, manual curation patterns
def select_showcase_examples(
    results: list[dict],
    per_category: int = 5
) -> dict[str, list]:
    """
    Select hand-picked examples for showcase.

    Categories:
    - best_cases: High confidence, correct prediction, all models agree
    - worst_cases: High confidence, wrong prediction
    - edge_cases: Models disagree significantly
    - llm_saves: LLM correct where embedding failed

    Returns:
        dict mapping category name to list of examples
    """
    showcase = {
        'best_cases': [],
        'worst_cases': [],
        'edge_cases': [],
        'llm_saves': [],
    }

    for r in results:
        # Best: all embedding models correct and agree
        if r['google_correct'] and r['jina_correct'] and r['minilm_correct']:
            showcase['best_cases'].append(r)

        # Worst: high similarity but wrong
        elif r['google_similarity'] > 0.9 and not r['google_correct']:
            showcase['worst_cases'].append(r)

        # Edge: models disagree
        elif r['google_prediction'] != r['jina_prediction'] != r['minilm_prediction']:
            showcase['edge_cases'].append(r)

        # LLM saves: embedding wrong, LLM correct
        elif r['llm_correct'] and not r['google_correct']:
            showcase['llm_saves'].append(r)

    # Take top N per category, sorted by interestingness
    for cat in showcase:
        # Sort by similarity score descending (most confident cases)
        showcase[cat] = sorted(
            showcase[cat],
            key=lambda x: x.get('google_similarity', 0),
            reverse=True
        )[:per_category]

    return showcase

Anti-Patterns to Avoid

Don't Hand-Roll

Problem Don't Build Use Instead Why
Confusion matrix calculation Manual counting sklearn.metrics.confusion_matrix Handles label ordering, zero counts correctly
Heatmap visualization Custom plotting code seaborn.heatmap() Better defaults, annotations, color scaling
Base64 encoding Manual implementation base64.b64encode() Standard library, battle-tested
HTML templating String concatenation Jinja2 Escaping, inheritance, conditionals
Statistical comparisons Manual chi-square scipy.stats Edge cases in significance testing

Key insight: The evaluation logic is straightforward (LLM prompt + parse response), but the visualization and report generation have many edge cases (font sizes, label rotation, color scales) that established libraries handle well.

Common Pitfalls

Pitfall 1: LLM Judge Inconsistency

What goes wrong: Same input produces different verdicts on repeated calls Why it happens: Temperature > 0, prompt ambiguity, model variance How to avoid: Set temperature=0, use explicit few-shot examples, structured response format Warning signs: Flaky tests, verdict flips on re-run

Pitfall 2: Prompt Injection in Judge Inputs

What goes wrong: Malicious or weird line item descriptions affect judge behavior Why it happens: User input included directly in prompt How to avoid: Truncate/sanitize line item text, use clear delimiters Warning signs: Judge returns unexpected verdicts for unusual inputs

Pitfall 3: Confusion Matrix Label Explosion

What goes wrong: Matrix is 100x100 with mostly empty cells Why it happens: Many unique GL accounts, most with <5 samples How to avoid: Group low-frequency labels into "Other" category Warning signs: Matrix renders as tiny unreadable squares

Pitfall 4: Base64 Images Too Large

What goes wrong: HTML report is 50MB+ because of high-res images Why it happens: DPI too high, figure size too large How to avoid: Use dpi=150 or lower, reasonable figure sizes (10-12 inches) Warning signs: Report slow to load/render, email attachment limits

Pitfall 5: Missing Dependencies for Matplotlib Backend

What goes wrong: UserWarning: Matplotlib is currently using agg or display errors Why it happens: No display available in server environment How to avoid: Use Agg backend explicitly: matplotlib.use('Agg') Warning signs: Errors mentioning Tkinter, display, or backend

Pitfall 6: Example Selection Bias

What goes wrong: Showcase only shows easy cases or only failures Why it happens: Selection criteria too narrow How to avoid: Explicitly select from multiple categories; include edge cases Warning signs: Examples don't represent true distribution of outcomes

Code Examples

Verified patterns from official sources:

LLM Judge API Call

# Source: google-genai SDK documentation
from google import genai

def call_gemini_judge(prompt: str) -> str:
    """Call Gemini Flash for judge evaluation."""
    client = genai.Client(api_key=os.environ['GOOGLE_API_KEY'])

    response = client.models.generate_content(
        model='gemini-2.5-flash',
        contents=prompt,
        config={
            'temperature': 0,  # Deterministic for consistency
        }
    )

    return response.text.strip()

Confusion Matrix with seaborn

# Source: scikit-learn 1.8.0 docs, seaborn heatmap docs
import matplotlib
matplotlib.use('Agg')  # Non-interactive backend

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

def plot_confusion_matrix(y_true, y_pred, labels):
    """Create publication-quality confusion matrix heatmap."""
    cm = confusion_matrix(y_true, y_pred, labels=labels)

    # Normalize for percentages (optional)
    # cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    fig, ax = plt.subplots(figsize=(12, 10))
    sns.heatmap(
        cm,
        annot=True,
        fmt='d',
        cmap='Blues',
        xticklabels=labels,
        yticklabels=labels,
        square=True,
        linewidths=0.5,
        ax=ax
    )
    ax.set_xlabel('Predicted', fontsize=12)
    ax.set_ylabel('Actual', fontsize=12)
    ax.set_title('GL Account Confusion Matrix', fontsize=14)

    plt.xticks(rotation=45, ha='right')
    plt.yticks(rotation=0)
    plt.tight_layout()

    return fig

Jinja2 Standalone Template Rendering

# Source: Jinja2 documentation, Flask patterns
from jinja2 import Environment, FileSystemLoader, select_autoescape

def render_report(template_path: str, output_path: str, **context):
    """Render Jinja2 template to static HTML file."""
    template_dir = os.path.dirname(template_path)
    template_name = os.path.basename(template_path)

    env = Environment(
        loader=FileSystemLoader(template_dir),
        autoescape=select_autoescape(['html', 'xml'])
    )

    template = env.get_template(template_name)
    html = template.render(**context)

    with open(output_path, 'w', encoding='utf-8') as f:
        f.write(html)

    return output_path

Figure to Base64 Data URI

# Source: matplotlib docs, Saturn Cloud blog
from io import BytesIO
import base64
import matplotlib.pyplot as plt

def figure_to_data_uri(fig, format='png', dpi=150) -> str:
    """Convert matplotlib figure to base64 data URI for HTML embedding."""
    buf = BytesIO()
    fig.savefig(buf, format=format, dpi=dpi, bbox_inches='tight')
    buf.seek(0)
    encoded = base64.b64encode(buf.read()).decode('utf-8')
    mime = 'image/png' if format == 'png' else f'image/{format}'
    return f"data:{mime};base64,{encoded}"

State of the Art

Old Approach Current Approach When Changed Impact
Human evaluation LLM-as-judge 2023-2024 10x faster, 80%+ human agreement
Rule-based equivalence Semantic judge 2024-2025 Catches valid variations
External image files Base64 embedding Always available Self-contained shareable reports
Manual example selection Automated by pattern 2024+ Consistent, reproducible showcases
Single evaluator LLM Multi-agent judges 2025-2026 Better for complex tasks

Deprecated/outdated:

Open Questions

  1. GL Account Semantic Groupings

  2. Judge Cost Impact

  3. Report Format for Stakeholders

Sources

Primary (HIGH confidence)

Secondary (MEDIUM confidence)

Tertiary (LOW confidence)

Metadata

Confidence breakdown:

Research date: 2026-02-20 Valid until: 2026-03-20 (30 days - stable domain, LLM judge patterns evolving but core approach stable)