Purpose: When exact GL account match fails, the LLM judge determines if the prediction is semantically equivalent (e.g., 6801 vs 6800 both valid for "office supplies"). The example selector identifies representative cases (best predictions, worst failures, edge cases, LLM saves) for the final report.
Output: Two modules - llm_judge.py for semantic equivalence evaluation, example_selector.py for showcase curation
<execution_context> @./.claude/get-shit-done/workflows/execute-plan.md @./.claude/get-shit-done/templates/summary.md </execution_context>
@src/evaluation/metrics.py @src/evaluation/benchmark.py @src/search/llm_matching.py
judge_semantic_equivalence(line_item: str, predicted: str, ground_truth: str) -> dict:
batch_judge_mismatches(results: list[SearchResult]) -> list[dict]:
Use existing patterns:
cd /home/volrath/code/worktrees-orcha-semantic-search/spikes/semantic-search
source .venv/bin/activate
python -c "from src.evaluation.llm_judge import judge_semantic_equivalence, batch_judge_mismatches; print('Import OK')"
Import SearchResult from metrics.py and _normalize_account
Implement select_showcase_examples(results: list[dict], per_category: int = 5) -> dict[str, list]:
best_cases: All embedding models correct AND agree (highest similarity first)worst_cases: High similarity (>0.9) but wrong prediction (highest similarity first)edge_cases: Models disagree (3+ different predictions)llm_saves: LLM correct where all embedding models failedImplement format_showcase_for_display(showcase: dict) -> dict:
Add helper _is_correct(prediction: str, ground_truth: str) -> bool:
Include docstrings explaining selection criteria for each category.
cd /home/volrath/code/worktrees-orcha-semantic-search/spikes/semantic-search
source .venv/bin/activate
python -c "from src.evaluation.example_selector import select_showcase_examples, format_showcase_for_display; print('Import OK')"
<success_criteria>