Researched: 2026-02-20 Domain: LLM-as-judge evaluation, static HTML report generation, confusion matrix visualization, curated example showcase Confidence: HIGH
This phase completes the evaluation framework with LLM-as-judge for semantic relevance scoring when exact matches fail, generates static HTML reports with aggregate metrics, creates confusion matrices for GL account error analysis, and curates hand-picked examples showcasing best/worst cases per approach.
The core technical challenge is implementing a reliable LLM-as-judge that can determine semantic equivalence between predicted and ground truth GL accounts (e.g., both "6801" and "6800" might be valid for "office supplies" depending on context). The project already has google-genai for Gemini Flash calls and scikit-learn for confusion matrix computation. Static HTML report generation uses Jinja2 (already a Flask dependency) with base64-encoded matplotlib/seaborn visualizations for self-contained, shareable files.
The LLM judge should use a binary classification approach (semantically equivalent: YES/NO) with chain-of-thought reasoning and few-shot examples. Key best practices include: single-criterion prompts, explicit scoring rubrics, and confidence levels in the response. For confusion matrices, sklearn's confusion_matrix paired with seaborn heatmaps provides standard visualization. Hand-picked examples are selected by identifying prediction patterns: highest-confidence correct, lowest-confidence correct, highest-confidence wrong, and edge cases where models disagree.
Primary recommendation: Implement LLM judge as a standalone module invoked when exact match fails, generate self-contained HTML reports using Jinja2 with embedded base64 images, and use sklearn/seaborn for confusion matrix generation.
<user_constraints>
None -- all implementation decisions delegated to Claude.
All implementation decisions deferred to Claude -- user comfortable with standard approaches:
LLM judge criteria:
Report structure:
Example selection:
Confusion matrix:
None -- discussion stayed within phase scope. </user_constraints>
<phase_requirements>
| ID | Description | Research Support |
|---|---|---|
| EVAL-08 | LLM-as-judge evaluation for semantic relevance | Use Gemini Flash with binary classification (YES/NO) prompt, chain-of-thought reasoning, few-shot examples; invoke only when exact match fails |
| REPT-02 | Hand-picked example showcase (curated searches) | Select by prediction confidence bands and model agreement patterns; 5-10 examples per category (best, worst, divergent) |
| REPT-03 | Static HTML report export with aggregate metrics | Jinja2 template with embedded base64 images; single-file output requiring no external assets |
| REPT-04 | Confusion matrix for GL account prediction errors | sklearn.metrics.confusion_matrix + seaborn heatmap; group low-frequency accounts into "Other" category |
| </phase_requirements> |
| Library | Version | Purpose | Why Standard |
|---|---|---|---|
| google-genai | 1.64+ (existing) | LLM judge API calls | Already in use for LLM matching; consistent with existing patterns |
| scikit-learn | 1.8.0 (installed) | Confusion matrix computation | Standard ML library; confusion_matrix, ConfusionMatrixDisplay |
| Jinja2 | (Flask dep) | HTML report templates | Already used by Flask; powerful templating for static output |
| matplotlib | 3.x | Figure generation for reports | Standard Python plotting; save to BytesIO for base64 embedding |
| Library | Version | Purpose | When to Use |
|---|---|---|---|
| seaborn | 0.13+ | Confusion matrix heatmap styling | More aesthetic than raw matplotlib; sns.heatmap() |
| base64 | stdlib | Image embedding in HTML | Convert matplotlib figures to data URIs |
| io.BytesIO | stdlib | In-memory figure storage | Avoid temp files when generating base64 images |
| Instead of | Could Use | Tradeoff |
|---|---|---|
| Gemini Flash judge | GPT-4 judge | Gemini already in use; GPT-4 would add new dependency and cost |
| seaborn heatmap | plotly interactive | Plotly requires JS bundle; seaborn PNG is simpler for static reports |
| Jinja2 standalone | html-reports package | Extra dependency; Jinja2 already available |
| Base64 embedding | External image files | External files break shareability; base64 is self-contained |
Installation:
# Seaborn and matplotlib may need explicit installation
uv add seaborn matplotlib
# All other dependencies already installed
src/
├── evaluation/
│ ├── metrics.py # existing
│ ├── benchmark.py # existing
│ ├── cost_tracker.py # existing
│ ├── llm_judge.py # NEW: semantic relevance judge
│ └── example_selector.py # NEW: best/worst case selection
├── reporting/
│ ├── __init__.py # NEW
│ ├── confusion_matrix.py # NEW: sklearn/seaborn CM generation
│ ├── report_generator.py # NEW: Jinja2 static HTML output
│ └── templates/
│ └── report.html # NEW: static report template
└── app.py # extend with report export route
What: Single-criterion evaluation prompt that returns YES/NO for semantic equivalence When to use: When predicted GL account differs from ground truth but might be semantically valid Example:
# Source: LLM-as-a-Judge best practices (Evidently AI, Monte Carlo)
JUDGE_PROMPT = """You are evaluating whether two GL account codes are semantically equivalent for a given line item.
Context: {line_item_description}
Predicted GL Account: {predicted}
Expected GL Account: {ground_truth}
Consider:
1. Are both accounts in the same category (e.g., both are operating expenses)?
2. Would an accountant accept either for this line item?
3. Is the difference meaningful for financial reporting?
Examples:
- "6801" vs "6800" for "Office supplies" -> YES (both are office/admin expenses)
- "6801" vs "4000" for "Office supplies" -> NO (expense vs revenue account)
- "7110" vs "7120" for "Consulting fees" -> YES (both are external services)
Answer with a single word: YES or NO
Then provide a one-sentence explanation.
Response format:
VERDICT: [YES/NO]
REASON: [one sentence]"""
def judge_semantic_equivalence(
line_item: str,
predicted: str,
ground_truth: str
) -> dict:
"""
Use LLM to judge if prediction is semantically equivalent to ground truth.
Returns:
dict with 'equivalent' (bool), 'reason' (str), 'raw_response' (str)
"""
prompt = JUDGE_PROMPT.format(
line_item_description=line_item,
predicted=predicted,
ground_truth=ground_truth
)
response = call_gemini_flash(prompt)
# Parse response
lines = response.strip().split('\n')
verdict_line = [l for l in lines if l.startswith('VERDICT:')][0]
verdict = 'YES' in verdict_line.upper()
reason_line = [l for l in lines if l.startswith('REASON:')]
reason = reason_line[0].replace('REASON:', '').strip() if reason_line else ''
return {
'equivalent': verdict,
'reason': reason,
'raw_response': response
}
What: Jinja2 template rendering with embedded base64 images When to use: Generating shareable static reports Example:
# Source: Practical Business Python (pbpython.com), Matplotlib docs
from io import BytesIO
import base64
from jinja2 import Environment, FileSystemLoader
import matplotlib.pyplot as plt
def fig_to_base64(fig) -> str:
"""Convert matplotlib figure to base64 data URI."""
buf = BytesIO()
fig.savefig(buf, format='png', dpi=150, bbox_inches='tight')
buf.seek(0)
data = base64.b64encode(buf.read()).decode('utf-8')
plt.close(fig)
return f"data:image/png;base64,{data}"
def generate_report(
benchmark_results: dict,
confusion_fig,
examples: list,
output_path: str
):
"""Generate self-contained HTML report."""
env = Environment(loader=FileSystemLoader('src/reporting/templates'))
template = env.get_template('report.html')
html = template.render(
results=benchmark_results,
confusion_matrix_img=fig_to_base64(confusion_fig),
examples=examples,
generated_at=datetime.now().isoformat(),
)
with open(output_path, 'w') as f:
f.write(html)
What: Group low-frequency GL accounts into "Other" to keep matrix readable When to use: When unique GL accounts > 15-20 (matrix becomes unreadable) Example:
# Source: sklearn.metrics.confusion_matrix, seaborn heatmap docs
import numpy as np
import seaborn as sns
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
def create_confusion_matrix(
y_true: list[str],
y_pred: list[str],
top_n: int = 15
) -> tuple:
"""
Create confusion matrix with top N labels, grouping rest as 'Other'.
Returns:
(figure, labels) tuple
"""
from collections import Counter
# Find top N most frequent ground truth labels
label_counts = Counter(y_true)
top_labels = [label for label, _ in label_counts.most_common(top_n)]
# Map non-top labels to 'Other'
def map_label(label):
return label if label in top_labels else 'Other'
y_true_mapped = [map_label(y) for y in y_true]
y_pred_mapped = [map_label(y) for y in y_pred]
# Include 'Other' in labels if needed
all_labels = top_labels + ['Other'] if 'Other' in y_true_mapped or 'Other' in y_pred_mapped else top_labels
# Compute confusion matrix
cm = confusion_matrix(y_true_mapped, y_pred_mapped, labels=all_labels)
# Create heatmap
fig, ax = plt.subplots(figsize=(12, 10))
sns.heatmap(
cm,
annot=True,
fmt='d',
cmap='Blues',
xticklabels=all_labels,
yticklabels=all_labels,
ax=ax
)
ax.set_xlabel('Predicted GL Account')
ax.set_ylabel('True GL Account')
ax.set_title('GL Account Prediction Confusion Matrix')
# Rotate x labels for readability
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
return fig, all_labels
What: Select representative examples across different outcome categories When to use: Creating curated showcase of best/worst/interesting cases Example:
# Source: Evaluation best practices, manual curation patterns
def select_showcase_examples(
results: list[dict],
per_category: int = 5
) -> dict[str, list]:
"""
Select hand-picked examples for showcase.
Categories:
- best_cases: High confidence, correct prediction, all models agree
- worst_cases: High confidence, wrong prediction
- edge_cases: Models disagree significantly
- llm_saves: LLM correct where embedding failed
Returns:
dict mapping category name to list of examples
"""
showcase = {
'best_cases': [],
'worst_cases': [],
'edge_cases': [],
'llm_saves': [],
}
for r in results:
# Best: all embedding models correct and agree
if r['google_correct'] and r['jina_correct'] and r['minilm_correct']:
showcase['best_cases'].append(r)
# Worst: high similarity but wrong
elif r['google_similarity'] > 0.9 and not r['google_correct']:
showcase['worst_cases'].append(r)
# Edge: models disagree
elif r['google_prediction'] != r['jina_prediction'] != r['minilm_prediction']:
showcase['edge_cases'].append(r)
# LLM saves: embedding wrong, LLM correct
elif r['llm_correct'] and not r['google_correct']:
showcase['llm_saves'].append(r)
# Take top N per category, sorted by interestingness
for cat in showcase:
# Sort by similarity score descending (most confident cases)
showcase[cat] = sorted(
showcase[cat],
key=lambda x: x.get('google_similarity', 0),
reverse=True
)[:per_category]
return showcase
| Problem | Don't Build | Use Instead | Why |
|---|---|---|---|
| Confusion matrix calculation | Manual counting | sklearn.metrics.confusion_matrix |
Handles label ordering, zero counts correctly |
| Heatmap visualization | Custom plotting code | seaborn.heatmap() |
Better defaults, annotations, color scaling |
| Base64 encoding | Manual implementation | base64.b64encode() |
Standard library, battle-tested |
| HTML templating | String concatenation | Jinja2 | Escaping, inheritance, conditionals |
| Statistical comparisons | Manual chi-square | scipy.stats | Edge cases in significance testing |
Key insight: The evaluation logic is straightforward (LLM prompt + parse response), but the visualization and report generation have many edge cases (font sizes, label rotation, color scales) that established libraries handle well.
What goes wrong: Same input produces different verdicts on repeated calls Why it happens: Temperature > 0, prompt ambiguity, model variance How to avoid: Set temperature=0, use explicit few-shot examples, structured response format Warning signs: Flaky tests, verdict flips on re-run
What goes wrong: Malicious or weird line item descriptions affect judge behavior Why it happens: User input included directly in prompt How to avoid: Truncate/sanitize line item text, use clear delimiters Warning signs: Judge returns unexpected verdicts for unusual inputs
What goes wrong: Matrix is 100x100 with mostly empty cells Why it happens: Many unique GL accounts, most with <5 samples How to avoid: Group low-frequency labels into "Other" category Warning signs: Matrix renders as tiny unreadable squares
What goes wrong: HTML report is 50MB+ because of high-res images Why it happens: DPI too high, figure size too large How to avoid: Use dpi=150 or lower, reasonable figure sizes (10-12 inches) Warning signs: Report slow to load/render, email attachment limits
What goes wrong: UserWarning: Matplotlib is currently using agg or display errors
Why it happens: No display available in server environment
How to avoid: Use Agg backend explicitly: matplotlib.use('Agg')
Warning signs: Errors mentioning Tkinter, display, or backend
What goes wrong: Showcase only shows easy cases or only failures Why it happens: Selection criteria too narrow How to avoid: Explicitly select from multiple categories; include edge cases Warning signs: Examples don't represent true distribution of outcomes
Verified patterns from official sources:
# Source: google-genai SDK documentation
from google import genai
def call_gemini_judge(prompt: str) -> str:
"""Call Gemini Flash for judge evaluation."""
client = genai.Client(api_key=os.environ['GOOGLE_API_KEY'])
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=prompt,
config={
'temperature': 0, # Deterministic for consistency
}
)
return response.text.strip()
# Source: scikit-learn 1.8.0 docs, seaborn heatmap docs
import matplotlib
matplotlib.use('Agg') # Non-interactive backend
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix
def plot_confusion_matrix(y_true, y_pred, labels):
"""Create publication-quality confusion matrix heatmap."""
cm = confusion_matrix(y_true, y_pred, labels=labels)
# Normalize for percentages (optional)
# cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
fig, ax = plt.subplots(figsize=(12, 10))
sns.heatmap(
cm,
annot=True,
fmt='d',
cmap='Blues',
xticklabels=labels,
yticklabels=labels,
square=True,
linewidths=0.5,
ax=ax
)
ax.set_xlabel('Predicted', fontsize=12)
ax.set_ylabel('Actual', fontsize=12)
ax.set_title('GL Account Confusion Matrix', fontsize=14)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
return fig
# Source: Jinja2 documentation, Flask patterns
from jinja2 import Environment, FileSystemLoader, select_autoescape
def render_report(template_path: str, output_path: str, **context):
"""Render Jinja2 template to static HTML file."""
template_dir = os.path.dirname(template_path)
template_name = os.path.basename(template_path)
env = Environment(
loader=FileSystemLoader(template_dir),
autoescape=select_autoescape(['html', 'xml'])
)
template = env.get_template(template_name)
html = template.render(**context)
with open(output_path, 'w', encoding='utf-8') as f:
f.write(html)
return output_path
# Source: matplotlib docs, Saturn Cloud blog
from io import BytesIO
import base64
import matplotlib.pyplot as plt
def figure_to_data_uri(fig, format='png', dpi=150) -> str:
"""Convert matplotlib figure to base64 data URI for HTML embedding."""
buf = BytesIO()
fig.savefig(buf, format=format, dpi=dpi, bbox_inches='tight')
buf.seek(0)
encoded = base64.b64encode(buf.read()).decode('utf-8')
mime = 'image/png' if format == 'png' else f'image/{format}'
return f"data:{mime};base64,{encoded}"
| Old Approach | Current Approach | When Changed | Impact |
|---|---|---|---|
| Human evaluation | LLM-as-judge | 2023-2024 | 10x faster, 80%+ human agreement |
| Rule-based equivalence | Semantic judge | 2024-2025 | Catches valid variations |
| External image files | Base64 embedding | Always available | Self-contained shareable reports |
| Manual example selection | Automated by pattern | 2024+ | Consistent, reproducible showcases |
| Single evaluator LLM | Multi-agent judges | 2025-2026 | Better for complex tasks |
Deprecated/outdated:
response_mime_type: 'application/json' for judge: Simple YES/NO is more reliable than JSON parsingGL Account Semantic Groupings
Judge Cost Impact
Report Format for Stakeholders
Confidence breakdown:
Research date: 2026-02-20 Valid until: 2026-03-20 (30 days - stable domain, LLM judge patterns evolving but core approach stable)