Implement LLM-as-judge for semantic relevance evaluation and example selector for curated showcase.

Purpose: When exact GL account match fails, the LLM judge determines if the prediction is semantically equivalent (e.g., 6801 vs 6800 both valid for "office supplies"). The example selector identifies representative cases (best predictions, worst failures, edge cases, LLM saves) for the final report.

Output: Two modules - llm_judge.py for semantic equivalence evaluation, example_selector.py for showcase curation

<execution_context> @./.claude/get-shit-done/workflows/execute-plan.md @./.claude/get-shit-done/templates/summary.md </execution_context>

@.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md @.planning/phases/05-reporting-llm-judge/05-RESEARCH.md

Existing evaluation infrastructure

@src/evaluation/metrics.py @src/evaluation/benchmark.py @src/search/llm_matching.py

Task 1: Create LLM-as-judge module src/evaluation/llm_judge.py Create LLM judge module for semantic equivalence evaluation:
  1. Import google-genai and existing patterns from llm_matching.py
  2. Create JUDGE_PROMPT constant with:
    • Context template: line item description, predicted account, expected account
    • Few-shot examples (6801 vs 6800 = YES, 6801 vs 4000 = NO)
    • Explicit response format: VERDICT: [YES/NO]\nREASON: [one sentence]
  3. Implement judge_semantic_equivalence(line_item: str, predicted: str, ground_truth: str) -> dict:
    • Build prompt from template
    • Call Gemini Flash with temperature=0 for deterministic responses
    • Parse response for VERDICT and REASON lines
    • Return dict with keys: 'equivalent' (bool), 'reason' (str), 'raw_response' (str)
    • Handle parsing errors gracefully (default to equivalent=False)
  4. Implement batch_judge_mismatches(results: list[SearchResult]) -> list[dict]:
    • Filter to only results where predicted != ground_truth for GL account
    • Use existing _normalize_account from metrics.py for comparison
    • Call judge_semantic_equivalence for each mismatch
    • Return list of dicts with: query_id, predicted, ground_truth, equivalent, reason
    • Add tqdm progress bar for long batches
  5. Handle missing GOOGLE_API_KEY gracefully with early return and warning

Use existing patterns:

  • Lazy client initialization (see llm_matching.py)
  • dotenv loading for API key
  • Error handling for API failures Run Python import test:
cd /home/volrath/code/worktrees-orcha-semantic-search/spikes/semantic-search
source .venv/bin/activate
python -c "from src.evaluation.llm_judge import judge_semantic_equivalence, batch_judge_mismatches; print('Import OK')"
llm_judge.py exports judge_semantic_equivalence and batch_judge_mismatches, with JUDGE_PROMPT containing few-shot examples, temperature=0 for deterministic output
Task 2: Create example selector module src/evaluation/example_selector.py Create example selector module for showcase curation:
  1. Import SearchResult from metrics.py and _normalize_account

  2. Implement select_showcase_examples(results: list[dict], per_category: int = 5) -> dict[str, list]:

    • Input: list of benchmark result dicts with keys: query_text, google_correct, jina_correct, minilm_correct, llm_correct, google_prediction, jina_prediction, minilm_prediction, llm_prediction, google_similarity, ground_truth_debit_account, ground_truth_cost_center
    • Categories to select:
      • best_cases: All embedding models correct AND agree (highest similarity first)
      • worst_cases: High similarity (>0.9) but wrong prediction (highest similarity first)
      • edge_cases: Models disagree (3+ different predictions)
      • llm_saves: LLM correct where all embedding models failed
    • Sort each category by google_similarity descending
    • Return top per_category examples for each category
  3. Implement format_showcase_for_display(showcase: dict) -> dict:

    • Format examples for HTML rendering
    • Add category descriptions for each category
    • Truncate long text fields (>100 chars) with ellipsis
    • Return dict ready for Jinja2 template
  4. Add helper _is_correct(prediction: str, ground_truth: str) -> bool:

    • Use _normalize_account for comparison
    • Handle None values

Include docstrings explaining selection criteria for each category. Run Python import test:

cd /home/volrath/code/worktrees-orcha-semantic-search/spikes/semantic-search
source .venv/bin/activate
python -c "from src.evaluation.example_selector import select_showcase_examples, format_showcase_for_display; print('Import OK')"
example_selector.py exports select_showcase_examples and format_showcase_for_display, with 4 category selection logic (best_cases, worst_cases, edge_cases, llm_saves)
1. Both modules import without errors 2. judge_semantic_equivalence returns dict with 'equivalent', 'reason', 'raw_response' keys 3. batch_judge_mismatches accepts list of SearchResult and returns list of judge verdicts 4. select_showcase_examples returns dict with 4 category keys 5. Format function handles long text truncation

<success_criteria>

After completion, create `.planning/phases/05-reporting-llm-judge/05-01-SUMMARY.md`