Phase 5: Reporting & LLM Judge - Research

Researched: 2026-02-20 Domain: LLM-as-judge evaluation, static HTML report generation, confusion matrix visualization, curated example showcase Confidence: HIGH

Summary

This phase completes the evaluation framework with LLM-as-judge for semantic relevance scoring when exact matches fail, generates static HTML reports with aggregate metrics, creates confusion matrices for GL account error analysis, and curates hand-picked examples showcasing best/worst cases per approach.

The core technical challenge is implementing a reliable LLM-as-judge that can determine semantic equivalence between predicted and ground truth GL accounts (e.g., both "6801" and "6800" might be valid for "office supplies" depending on context). The project already has google-genai for Gemini Flash calls and scikit-learn for confusion matrix computation. Static HTML report generation uses Jinja2 (already a Flask dependency) with base64-encoded matplotlib/seaborn visualizations for self-contained, shareable files.

The LLM judge should use a binary classification approach (semantically equivalent: YES/NO) with chain-of-thought reasoning and few-shot examples. Key best practices include: single-criterion prompts, explicit scoring rubrics, and confidence levels in the response. For confusion matrices, sklearn's confusion_matrix paired with seaborn heatmaps provides standard visualization. Hand-picked examples are selected by identifying prediction patterns: highest-confidence correct, lowest-confidence correct, highest-confidence wrong, and edge cases where models disagree.

Primary recommendation: Implement LLM judge as a standalone module invoked when exact match fails, generate self-contained HTML reports using Jinja2 with embedded base64 images, and use sklearn/seaborn for confusion matrix generation.

<user_constraints>

User Constraints (from CONTEXT.md)

Locked Decisions

None -- all implementation decisions delegated to Claude.

Claude's Discretion

All implementation decisions deferred to Claude -- user comfortable with standard approaches:

LLM judge criteria:

When to invoke the judge (exact match failures)
What constitutes "semantically equivalent"
Scoring scale and thresholds
Prompt design for the judge

Report structure:

Sections and layout
Which metrics to highlight
Format (HTML)
Visual design choices

Example selection:

Criteria for best/worst case selection
Number of examples per category
Whether to group by model or show aggregate

Confusion matrix:

What dimensions to visualize
Top N error patterns
Grouping strategies for GL accounts

Deferred Ideas (OUT OF SCOPE)

None -- discussion stayed within phase scope. </user_constraints>

<phase_requirements>

Phase Requirements

ID	Description	Research Support
EVAL-08	LLM-as-judge evaluation for semantic relevance	Use Gemini Flash with binary classification (YES/NO) prompt, chain-of-thought reasoning, few-shot examples; invoke only when exact match fails
REPT-02	Hand-picked example showcase (curated searches)	Select by prediction confidence bands and model agreement patterns; 5-10 examples per category (best, worst, divergent)
REPT-03	Static HTML report export with aggregate metrics	Jinja2 template with embedded base64 images; single-file output requiring no external assets
REPT-04	Confusion matrix for GL account prediction errors	sklearn.metrics.confusion_matrix + seaborn heatmap; group low-frequency accounts into "Other" category
</phase_requirements>

Standard Stack

Core

Library	Version	Purpose	Why Standard
google-genai	1.64+ (existing)	LLM judge API calls	Already in use for LLM matching; consistent with existing patterns
scikit-learn	1.8.0 (installed)	Confusion matrix computation	Standard ML library; `confusion_matrix`, `ConfusionMatrixDisplay`
Jinja2	(Flask dep)	HTML report templates	Already used by Flask; powerful templating for static output
matplotlib	3.x	Figure generation for reports	Standard Python plotting; save to BytesIO for base64 embedding

Supporting

Library	Version	Purpose	When to Use
seaborn	0.13+	Confusion matrix heatmap styling	More aesthetic than raw matplotlib; `sns.heatmap()`
base64	stdlib	Image embedding in HTML	Convert matplotlib figures to data URIs
io.BytesIO	stdlib	In-memory figure storage	Avoid temp files when generating base64 images

Alternatives Considered

Instead of	Could Use	Tradeoff
Gemini Flash judge	GPT-4 judge	Gemini already in use; GPT-4 would add new dependency and cost
seaborn heatmap	plotly interactive	Plotly requires JS bundle; seaborn PNG is simpler for static reports
Jinja2 standalone	html-reports package	Extra dependency; Jinja2 already available
Base64 embedding	External image files	External files break shareability; base64 is self-contained

Installation:

# Seaborn and matplotlib may need explicit installation
uv add seaborn matplotlib
# All other dependencies already installed

Architecture Patterns

Recommended Project Structure

src/
├── evaluation/
│   ├── metrics.py              # existing
│   ├── benchmark.py            # existing
│   ├── cost_tracker.py         # existing
│   ├── llm_judge.py            # NEW: semantic relevance judge
│   └── example_selector.py     # NEW: best/worst case selection
├── reporting/
│   ├── __init__.py             # NEW
│   ├── confusion_matrix.py     # NEW: sklearn/seaborn CM generation
│   ├── report_generator.py     # NEW: Jinja2 static HTML output
│   └── templates/
│       └── report.html         # NEW: static report template
└── app.py                      # extend with report export route

Pattern 1: LLM-as-Judge with Binary Classification

What: Single-criterion evaluation prompt that returns YES/NO for semantic equivalence When to use: When predicted GL account differs from ground truth but might be semantically valid Example:

# Source: LLM-as-a-Judge best practices (Evidently AI, Monte Carlo)
JUDGE_PROMPT = """You are evaluating whether two GL account codes are semantically equivalent for a given line item.

Context: {line_item_description}
Predicted GL Account: {predicted}
Expected GL Account: {ground_truth}

Consider:
1. Are both accounts in the same category (e.g., both are operating expenses)?
2. Would an accountant accept either for this line item?
3. Is the difference meaningful for financial reporting?

Examples:
- "6801" vs "6800" for "Office supplies" -> YES (both are office/admin expenses)
- "6801" vs "4000" for "Office supplies" -> NO (expense vs revenue account)
- "7110" vs "7120" for "Consulting fees" -> YES (both are external services)

Answer with a single word: YES or NO
Then provide a one-sentence explanation.

Response format:
VERDICT: [YES/NO]
REASON: [one sentence]"""

def judge_semantic_equivalence(
    line_item: str,
    predicted: str,
    ground_truth: str
) -> dict:
    """
    Use LLM to judge if prediction is semantically equivalent to ground truth.

    Returns:
        dict with 'equivalent' (bool), 'reason' (str), 'raw_response' (str)
    """
    prompt = JUDGE_PROMPT.format(
        line_item_description=line_item,
        predicted=predicted,
        ground_truth=ground_truth
    )

    response = call_gemini_flash(prompt)

    # Parse response
    lines = response.strip().split('\n')
    verdict_line = [l for l in lines if l.startswith('VERDICT:')][0]
    verdict = 'YES' in verdict_line.upper()

    reason_line = [l for l in lines if l.startswith('REASON:')]
    reason = reason_line[0].replace('REASON:', '').strip() if reason_line else ''

    return {
        'equivalent': verdict,
        'reason': reason,
        'raw_response': response
    }

Pattern 2: Self-Contained HTML Report Generation

What: Jinja2 template rendering with embedded base64 images When to use: Generating shareable static reports Example:

# Source: Practical Business Python (pbpython.com), Matplotlib docs
from io import BytesIO
import base64
from jinja2 import Environment, FileSystemLoader
import matplotlib.pyplot as plt

def fig_to_base64(fig) -> str:
    """Convert matplotlib figure to base64 data URI."""
    buf = BytesIO()
    fig.savefig(buf, format='png', dpi=150, bbox_inches='tight')
    buf.seek(0)
    data = base64.b64encode(buf.read()).decode('utf-8')
    plt.close(fig)
    return f"data:image/png;base64,{data}"

def generate_report(
    benchmark_results: dict,
    confusion_fig,
    examples: list,
    output_path: str
):
    """Generate self-contained HTML report."""
    env = Environment(loader=FileSystemLoader('src/reporting/templates'))
    template = env.get_template('report.html')

    html = template.render(
        results=benchmark_results,
        confusion_matrix_img=fig_to_base64(confusion_fig),
        examples=examples,
        generated_at=datetime.now().isoformat(),
    )

    with open(output_path, 'w') as f:
        f.write(html)

Pattern 3: Confusion Matrix with Grouped Labels

What: Group low-frequency GL accounts into "Other" to keep matrix readable When to use: When unique GL accounts > 15-20 (matrix becomes unreadable) Example:

# Source: sklearn.metrics.confusion_matrix, seaborn heatmap docs
import numpy as np
import seaborn as sns
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt

def create_confusion_matrix(
    y_true: list[str],
    y_pred: list[str],
    top_n: int = 15
) -> tuple:
    """
    Create confusion matrix with top N labels, grouping rest as 'Other'.

    Returns:
        (figure, labels) tuple
    """
    from collections import Counter

    # Find top N most frequent ground truth labels
    label_counts = Counter(y_true)
    top_labels = [label for label, _ in label_counts.most_common(top_n)]

    # Map non-top labels to 'Other'
    def map_label(label):
        return label if label in top_labels else 'Other'

    y_true_mapped = [map_label(y) for y in y_true]
    y_pred_mapped = [map_label(y) for y in y_pred]

    # Include 'Other' in labels if needed
    all_labels = top_labels + ['Other'] if 'Other' in y_true_mapped or 'Other' in y_pred_mapped else top_labels

    # Compute confusion matrix
    cm = confusion_matrix(y_true_mapped, y_pred_mapped, labels=all_labels)

    # Create heatmap
    fig, ax = plt.subplots(figsize=(12, 10))
    sns.heatmap(
        cm,
        annot=True,
        fmt='d',
        cmap='Blues',
        xticklabels=all_labels,
        yticklabels=all_labels,
        ax=ax
    )
    ax.set_xlabel('Predicted GL Account')
    ax.set_ylabel('True GL Account')
    ax.set_title('GL Account Prediction Confusion Matrix')

    # Rotate x labels for readability
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()

    return fig, all_labels

Pattern 4: Example Selection by Prediction Pattern

What: Select representative examples across different outcome categories When to use: Creating curated showcase of best/worst/interesting cases Example:

# Source: Evaluation best practices, manual curation patterns
def select_showcase_examples(
    results: list[dict],
    per_category: int = 5
) -> dict[str, list]:
    """
    Select hand-picked examples for showcase.

    Categories:
    - best_cases: High confidence, correct prediction, all models agree
    - worst_cases: High confidence, wrong prediction
    - edge_cases: Models disagree significantly
    - llm_saves: LLM correct where embedding failed

    Returns:
        dict mapping category name to list of examples
    """
    showcase = {
        'best_cases': [],
        'worst_cases': [],
        'edge_cases': [],
        'llm_saves': [],
    }

    for r in results:
        # Best: all embedding models correct and agree
        if r['google_correct'] and r['jina_correct'] and r['minilm_correct']:
            showcase['best_cases'].append(r)

        # Worst: high similarity but wrong
        elif r['google_similarity'] > 0.9 and not r['google_correct']:
            showcase['worst_cases'].append(r)

        # Edge: models disagree
        elif r['google_prediction'] != r['jina_prediction'] != r['minilm_prediction']:
            showcase['edge_cases'].append(r)

        # LLM saves: embedding wrong, LLM correct
        elif r['llm_correct'] and not r['google_correct']:
            showcase['llm_saves'].append(r)

    # Take top N per category, sorted by interestingness
    for cat in showcase:
        # Sort by similarity score descending (most confident cases)
        showcase[cat] = sorted(
            showcase[cat],
            key=lambda x: x.get('google_similarity', 0),
            reverse=True
        )[:per_category]

    return showcase

Anti-Patterns to Avoid

Calling LLM judge for every prediction: Expensive and slow; only invoke when exact match fails
Unstructured judge responses: Always use explicit response format (VERDICT/REASON) for reliable parsing
Large confusion matrices: 50x50 matrices are unreadable; group to top 15-20 labels
External dependencies in reports: Images, CSS, or JS files break shareability; embed everything

Don't Hand-Roll

Problem	Don't Build	Use Instead	Why
Confusion matrix calculation	Manual counting	`sklearn.metrics.confusion_matrix`	Handles label ordering, zero counts correctly
Heatmap visualization	Custom plotting code	`seaborn.heatmap()`	Better defaults, annotations, color scaling
Base64 encoding	Manual implementation	`base64.b64encode()`	Standard library, battle-tested
HTML templating	String concatenation	Jinja2	Escaping, inheritance, conditionals
Statistical comparisons	Manual chi-square	scipy.stats	Edge cases in significance testing

Key insight: The evaluation logic is straightforward (LLM prompt + parse response), but the visualization and report generation have many edge cases (font sizes, label rotation, color scales) that established libraries handle well.

Common Pitfalls

Pitfall 1: LLM Judge Inconsistency

What goes wrong: Same input produces different verdicts on repeated calls Why it happens: Temperature > 0, prompt ambiguity, model variance How to avoid: Set temperature=0, use explicit few-shot examples, structured response format Warning signs: Flaky tests, verdict flips on re-run

Pitfall 2: Prompt Injection in Judge Inputs

What goes wrong: Malicious or weird line item descriptions affect judge behavior Why it happens: User input included directly in prompt How to avoid: Truncate/sanitize line item text, use clear delimiters Warning signs: Judge returns unexpected verdicts for unusual inputs

Pitfall 3: Confusion Matrix Label Explosion

What goes wrong: Matrix is 100x100 with mostly empty cells Why it happens: Many unique GL accounts, most with <5 samples How to avoid: Group low-frequency labels into "Other" category Warning signs: Matrix renders as tiny unreadable squares

Pitfall 4: Base64 Images Too Large

What goes wrong: HTML report is 50MB+ because of high-res images Why it happens: DPI too high, figure size too large How to avoid: Use dpi=150 or lower, reasonable figure sizes (10-12 inches) Warning signs: Report slow to load/render, email attachment limits

Pitfall 5: Missing Dependencies for Matplotlib Backend

What goes wrong: UserWarning: Matplotlib is currently using agg or display errors Why it happens: No display available in server environment How to avoid: Use Agg backend explicitly: matplotlib.use('Agg') Warning signs: Errors mentioning Tkinter, display, or backend

Pitfall 6: Example Selection Bias

What goes wrong: Showcase only shows easy cases or only failures Why it happens: Selection criteria too narrow How to avoid: Explicitly select from multiple categories; include edge cases Warning signs: Examples don't represent true distribution of outcomes

Code Examples

Verified patterns from official sources:

LLM Judge API Call

# Source: google-genai SDK documentation
from google import genai

def call_gemini_judge(prompt: str) -> str:
    """Call Gemini Flash for judge evaluation."""
    client = genai.Client(api_key=os.environ['GOOGLE_API_KEY'])

    response = client.models.generate_content(
        model='gemini-2.5-flash',
        contents=prompt,
        config={
            'temperature': 0,  # Deterministic for consistency
        }
    )

    return response.text.strip()

Confusion Matrix with seaborn

# Source: scikit-learn 1.8.0 docs, seaborn heatmap docs
import matplotlib
matplotlib.use('Agg')  # Non-interactive backend

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

def plot_confusion_matrix(y_true, y_pred, labels):
    """Create publication-quality confusion matrix heatmap."""
    cm = confusion_matrix(y_true, y_pred, labels=labels)

    # Normalize for percentages (optional)
    # cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    fig, ax = plt.subplots(figsize=(12, 10))
    sns.heatmap(
        cm,
        annot=True,
        fmt='d',
        cmap='Blues',
        xticklabels=labels,
        yticklabels=labels,
        square=True,
        linewidths=0.5,
        ax=ax
    )
    ax.set_xlabel('Predicted', fontsize=12)
    ax.set_ylabel('Actual', fontsize=12)
    ax.set_title('GL Account Confusion Matrix', fontsize=14)

    plt.xticks(rotation=45, ha='right')
    plt.yticks(rotation=0)
    plt.tight_layout()

    return fig

Jinja2 Standalone Template Rendering

# Source: Jinja2 documentation, Flask patterns
from jinja2 import Environment, FileSystemLoader, select_autoescape

def render_report(template_path: str, output_path: str, **context):
    """Render Jinja2 template to static HTML file."""
    template_dir = os.path.dirname(template_path)
    template_name = os.path.basename(template_path)

    env = Environment(
        loader=FileSystemLoader(template_dir),
        autoescape=select_autoescape(['html', 'xml'])
    )

    template = env.get_template(template_name)
    html = template.render(**context)

    with open(output_path, 'w', encoding='utf-8') as f:
        f.write(html)

    return output_path

Figure to Base64 Data URI

# Source: matplotlib docs, Saturn Cloud blog
from io import BytesIO
import base64
import matplotlib.pyplot as plt

def figure_to_data_uri(fig, format='png', dpi=150) -> str:
    """Convert matplotlib figure to base64 data URI for HTML embedding."""
    buf = BytesIO()
    fig.savefig(buf, format=format, dpi=dpi, bbox_inches='tight')
    buf.seek(0)
    encoded = base64.b64encode(buf.read()).decode('utf-8')
    mime = 'image/png' if format == 'png' else f'image/{format}'
    return f"data:{mime};base64,{encoded}"

State of the Art

Old Approach	Current Approach	When Changed	Impact
Human evaluation	LLM-as-judge	2023-2024	10x faster, 80%+ human agreement
Rule-based equivalence	Semantic judge	2024-2025	Catches valid variations
External image files	Base64 embedding	Always available	Self-contained shareable reports
Manual example selection	Automated by pattern	2024+	Consistent, reproducible showcases
Single evaluator LLM	Multi-agent judges	2025-2026	Better for complex tasks

Deprecated/outdated:

Using response_mime_type: 'application/json' for judge: Simple YES/NO is more reliable than JSON parsing
Single monolithic judge prompt: Decompose into single-criterion evaluations
vertexai.generative_models module: Deprecated June 2025, use google-genai SDK

Open Questions

GL Account Semantic Groupings
- What we know: Account codes like 6801/6800 might be semantically equivalent
- What's unclear: What is the actual account chart hierarchy?
- Recommendation: Use LLM judge with domain context; it can infer relationships
Judge Cost Impact
- What we know: Only invoked when exact match fails; Gemini Flash is cheap ($0.30/1M input)
- What's unclear: What percentage of predictions will need judging?
- Recommendation: Track judge invocation rate; set budget cap if needed
Report Format for Stakeholders
- What we know: HTML is specified; self-contained is required
- What's unclear: Who will consume the report? (technical vs business)
- Recommendation: Include both summary metrics and detailed examples

Sources

Primary (HIGH confidence)

sklearn.metrics.confusion_matrix - confusion matrix computation
seaborn.heatmap - heatmap visualization
Jinja2 documentation - template rendering
matplotlib savefig to BytesIO - base64 embedding

Secondary (MEDIUM confidence)

LLM-as-a-Judge: Complete Guide (Evidently AI) - judge best practices
LLM-As-Judge: 7 Best Practices (Monte Carlo) - prompt engineering
3 Ways to Embed Matplotlib in HTML (Medium) - base64 patterns
Confusion Matrix Visualization (Medium) - styling tips

Tertiary (LOW confidence)

GL account semantic equivalence rules - domain-specific, inferred from context
Judge invocation rate estimate - depends on model accuracy, unknown until runtime

Metadata

Confidence breakdown:

Standard stack: HIGH - all libraries verified installed or as Flask dependencies
LLM judge pattern: MEDIUM - best practices well-documented, but domain-specific tuning needed
Confusion matrix: HIGH - sklearn/seaborn standard patterns
Static HTML generation: HIGH - Jinja2/base64 well-documented
Example selection: MEDIUM - criteria are subjective; patterns are recommendations

Research date: 2026-02-20 Valid until: 2026-03-20 (30 days - stable domain, LLM judge patterns evolving but core approach stable)