Phase 5: Reporting & LLM Judge - Context

Gathered: 2026-02-20 Status: Ready for planning

## Phase Boundary

Complete the evaluation framework with LLM-as-judge for semantic relevance when exact matches fail, plus static HTML reports with hand-picked examples and confusion matrices. This phase produces shareable artifacts that demonstrate the spike's findings.

## Implementation Decisions

Claude's Discretion

All implementation decisions deferred to Claude — user comfortable with standard approaches:

LLM judge criteria:

When to invoke the judge (exact match failures)
What constitutes "semantically equivalent"
Scoring scale and thresholds
Prompt design for the judge

Report structure:

Sections and layout
Which metrics to highlight
Format (HTML)
Visual design choices

Example selection:

Criteria for best/worst case selection
Number of examples per category
Whether to group by model or show aggregate

Confusion matrix:

What dimensions to visualize
Top N error patterns
Grouping strategies for GL accounts

## Specific Ideas

No specific requirements — open to standard approaches.

## Deferred Ideas

None — discussion stayed within phase scope.

Phase: 05-reporting-llm-judge Context gathered: 2026-02-20