For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
Goal: Refactor evaluation to use per-invoice LLM calls with separate AccountsMatcher and CostCenterMatcher, including full CoA/CC datasets.
Architecture: Export Regnology's CoA (642 accounts) and CC (104 centers) from Orcha DB to static CSV files. Modify llm_eval.py to call two matchers per invoice instead of one LLM call per line item.
Tech Stack: Python, psycopg, Gemini Flash, existing curation module
Files:
data/regnology_coa.csvStep 1: Create data directory
mkdir -p /home/volrath/code/worktrees-orcha-semantic-search/spikes/semantic-search/data
Step 2: Export GL accounts from Orcha DB
psql -h localhost -U postgres -d orcha -c "
COPY (
SELECT
elem->>'number' as number,
elem->>'name' as name,
elem->>'description' as description,
elem->>'balance-position' as balance_position
FROM gl_accounts_dataset,
jsonb_array_elements(data) as elem
WHERE legal_entity_id = '00000000-0000-0000-0000-000000000001'
AND is_active = true
ORDER BY elem->>'number'
) TO STDOUT CSV HEADER
" > /home/volrath/code/worktrees-orcha-semantic-search/spikes/semantic-search/data/regnology_coa.csv
Step 3: Verify export
head -5 /home/volrath/code/worktrees-orcha-semantic-search/spikes/semantic-search/data/regnology_coa.csv
wc -l /home/volrath/code/worktrees-orcha-semantic-search/spikes/semantic-search/data/regnology_coa.csv
Expected: CSV header + 642 rows
Step 4: Commit
git add data/regnology_coa.csv
git commit -m "data: export Regnology chart of accounts from Orcha DB"
Files:
data/regnology_cc.csvStep 1: Export cost centers from Orcha DB
psql -h localhost -U postgres -d orcha -c "
COPY (
SELECT
elem->>'Cost Center Num' as number,
elem->>'Cost Center' as name,
elem->>'Description' as description
FROM cost_center_dataset,
jsonb_array_elements(data) as elem
WHERE legal_entity_id = '00000000-0000-0000-0000-000000000001'
AND position IS NOT NULL
ORDER BY elem->>'Cost Center Num'
) TO STDOUT CSV HEADER
" > /home/volrath/code/worktrees-orcha-semantic-search/spikes/semantic-search/data/regnology_cc.csv
Step 2: Verify export
head -5 /home/volrath/code/worktrees-orcha-semantic-search/spikes/semantic-search/data/regnology_cc.csv
wc -l /home/volrath/code/worktrees-orcha-semantic-search/spikes/semantic-search/data/regnology_cc.csv
Expected: CSV header + 104 rows
Step 3: Commit
git add data/regnology_cc.csv
git commit -m "data: export Regnology cost centers from Orcha DB"
Files:
src/evaluation/llm_eval.pyStep 1: Add path constants and load functions after imports (around line 30)
Add after the ORCHA_DB_DSN line:
# Static data paths
DATA_DIR = Path(__file__).parent.parent.parent / 'data'
COA_CSV_PATH = DATA_DIR / 'regnology_coa.csv'
CC_CSV_PATH = DATA_DIR / 'regnology_cc.csv'
def load_coa_csv() -> str:
"""Load Chart of Accounts CSV content."""
return COA_CSV_PATH.read_text()
def load_cc_csv() -> str:
"""Load Cost Centers CSV content."""
return CC_CSV_PATH.read_text()
Step 2: Verify the functions work
cd /home/volrath/code/worktrees-orcha-semantic-search/spikes/semantic-search
python -c "from src.evaluation.llm_eval import load_coa_csv, load_cc_csv; print(len(load_coa_csv()), len(load_cc_csv()))"
Expected: Two numbers (character counts of the CSV files)
Step 3: Commit
git add src/evaluation/llm_eval.py
git commit -m "feat: add CSV loading functions for CoA and CC data"
Files:
src/evaluation/llm_eval.pyStep 1: Add the accounts matcher prompt function after load_cc_csv()
def build_accounts_matcher_prompt(
supplier_name: str,
line_items_json: str,
coa_csv: str,
curated_csv: str,
) -> str:
"""Build prompt for AccountsMatcher (GL account prediction)."""
return f"""You are a double-entry bookkeeping specialist for accounts payable invoices.
Your task: assign the correct GL debit account for each line item.
## Chart of Accounts
```csv
{coa_csv}
Below are similar historical bookings. The cluster_count shows how often this pattern appeared:
{curated_csv}
Supplier: {supplier_name}
Line Items: {line_items_json}
Respond with ONLY a JSON object. The line_items array must have the same length as the input:
{{
"line_items": [
{{"debit_account": "XXXXXX", "confidence": 0.85, "reasoning": "Brief explanation"}},
...
]
}}
Do not include any text before or after the JSON."""
**Step 2: Commit**
```bash
git add src/evaluation/llm_eval.py
git commit -m "feat: add AccountsMatcher prompt builder"
Files:
src/evaluation/llm_eval.pyStep 1: Add the cost center matcher prompt function after build_accounts_matcher_prompt()
def build_cost_center_matcher_prompt(
supplier_name: str,
line_items_json: str,
cc_csv: str,
curated_csv: str,
) -> str:
"""Build prompt for CostCenterMatcher (cost center prediction)."""
return f"""You are a cost center allocation specialist for accounts payable invoices.
Your task: assign the correct cost center for each line item.
## Cost Centers
```csv
{cc_csv}
Below are similar historical bookings. The cluster_count shows how often this pattern appeared:
{curated_csv}
Supplier: {supplier_name}
Line Items: {line_items_json}
Respond with ONLY a JSON object. The line_items array must have the same length as the input:
{{
"line_items": [
{{"cost_center": "XXXXXX", "confidence": 0.85, "reasoning": "Brief explanation"}},
...
]
}}
Do not include any text before or after the JSON."""
**Step 2: Commit**
```bash
git add src/evaluation/llm_eval.py
git commit -m "feat: add CostCenterMatcher prompt builder"
Files:
src/evaluation/llm_eval.pyStep 1: Add matcher call functions after the prompt builders
def call_accounts_matcher(
supplier_name: str,
line_items_json: str,
coa_csv: str,
curated_csv: str,
) -> tuple[list[dict], str, str]:
"""
Call LLM for GL account matching.
Returns: (results_list, prompt_used, raw_response)
"""
prompt = build_accounts_matcher_prompt(supplier_name, line_items_json, coa_csv, curated_csv)
parsed, raw = call_llm(prompt)
if parsed and 'line_items' in parsed:
return parsed['line_items'], prompt, raw
return [], prompt, raw
def call_cost_center_matcher(
supplier_name: str,
line_items_json: str,
cc_csv: str,
curated_csv: str,
) -> tuple[list[dict], str, str]:
"""
Call LLM for cost center matching.
Returns: (results_list, prompt_used, raw_response)
"""
prompt = build_cost_center_matcher_prompt(supplier_name, line_items_json, cc_csv, curated_csv)
parsed, raw = call_llm(prompt)
if parsed and 'line_items' in parsed:
return parsed['line_items'], prompt, raw
return [], prompt, raw
Step 2: Commit
git add src/evaluation/llm_eval.py
git commit -m "feat: add matcher call functions"
Files:
src/evaluation/llm_eval.pyStep 1: Update LineItemResult to track both matcher prompts/responses (around line 46)
Replace the existing LineItemResult class:
@dataclass
class LineItemResult:
"""Result for a single line item evaluation."""
description: str
amount: Optional[float]
# LLM predictions
llm_gl: Optional[str] = None
llm_gl_confidence: Optional[float] = None
llm_gl_reasoning: Optional[str] = None
llm_cc: Optional[str] = None
llm_cc_confidence: Optional[float] = None
llm_cc_reasoning: Optional[str] = None
# Current Orcha values (for reference)
orcha_gl: Optional[str] = None
orcha_cc: Optional[str] = None
# Historical ground truth (from file)
historical_gl: str = ""
historical_cc: str = ""
# Match status
gl_match: bool = False
cc_match: bool = False
Step 2: Add InvoiceResult fields for debug info (update InvoiceResult around line 69)
Replace the existing InvoiceResult class:
@dataclass
class InvoiceResult:
"""Result for an entire invoice evaluation."""
invoice_number: str
supplier: str
issue_type: str
historical_gl: str
historical_cc: str
line_items: list[LineItemResult] = field(default_factory=list)
error: Optional[str] = None
elapsed_seconds: float = 0.0
# Debug info (per-invoice, not per-line-item)
curation_csv: str = ""
accounts_matcher_prompt: str = ""
accounts_matcher_response: str = ""
cost_center_matcher_prompt: str = ""
cost_center_matcher_response: str = ""
Step 3: Commit
git add src/evaluation/llm_eval.py
git commit -m "refactor: update dataclasses for per-invoice matching"
Files:
src/evaluation/llm_eval.pyStep 1: Replace the evaluate_invoice function (around line 216)
def evaluate_invoice(
issue: IssueRecord,
orcha_conn,
semantic_conn,
coa_csv: str,
cc_csv: str,
k: int = 10,
threshold: float = 0.6,
) -> InvoiceResult:
"""Evaluate a single invoice with per-invoice matcher calls."""
start_time = time.perf_counter()
result = InvoiceResult(
invoice_number=issue.invoice_number,
supplier=issue.supplier,
issue_type=issue.issue_type,
historical_gl=issue.historical_dim1,
historical_cc=issue.historical_dim3,
)
# Fetch invoice from Orcha
invoice_data = fetch_invoice_from_orcha(orcha_conn, issue.invoice_number)
if not invoice_data:
result.error = f"Invoice not found in Orcha DB: {issue.invoice_number}"
result.elapsed_seconds = time.perf_counter() - start_time
return result
supplier_name = invoice_data['issuer_name']
line_items = invoice_data['line_items']
if not line_items:
result.error = "No line items found"
result.elapsed_seconds = time.perf_counter() - start_time
return result
try:
# Step 1: Collect all line item descriptions for curation
descriptions = [li.get('description', '') for li in line_items]
# Step 2: Curate bookings once for entire invoice
curated = curate_bookings_for_invoice(
semantic_conn,
supplier_name,
descriptions,
model='google',
k=k,
threshold=threshold,
)
curated_csv = curated_bookings_to_csv(curated)
result.curation_csv = curated_csv
# Step 3: Build line items JSON for prompts
line_items_for_prompt = [
{"index": i, "description": li.get('description', ''), "amount": li.get('amount', 0) or 0}
for i, li in enumerate(line_items)
]
line_items_json = json.dumps(line_items_for_prompt, indent=2)
# Step 4: Call AccountsMatcher (one call for all line items)
gl_results, gl_prompt, gl_raw = call_accounts_matcher(
supplier_name, line_items_json, coa_csv, curated_csv
)
result.accounts_matcher_prompt = gl_prompt
result.accounts_matcher_response = gl_raw
# Step 5: Call CostCenterMatcher (one call for all line items)
cc_results, cc_prompt, cc_raw = call_cost_center_matcher(
supplier_name, line_items_json, cc_csv, curated_csv
)
result.cost_center_matcher_prompt = cc_prompt
result.cost_center_matcher_response = cc_raw
# Step 6: Merge results into LineItemResults
for i, li in enumerate(line_items):
li_result = LineItemResult(
description=li.get('description', ''),
amount=li.get('amount', 0) or 0,
orcha_gl=li.get('debit-account', {}).get('number') if li.get('debit-account') else None,
orcha_cc=li.get('cost-center', {}).get('number') if li.get('cost-center') else None,
historical_gl=issue.historical_dim1,
historical_cc=issue.historical_dim3,
)
# Extract GL prediction
if i < len(gl_results):
gl_pred = gl_results[i]
li_result.llm_gl = gl_pred.get('debit_account')
li_result.llm_gl_confidence = gl_pred.get('confidence')
li_result.llm_gl_reasoning = gl_pred.get('reasoning')
# Extract CC prediction
if i < len(cc_results):
cc_pred = cc_results[i]
li_result.llm_cc = cc_pred.get('cost_center')
li_result.llm_cc_confidence = cc_pred.get('confidence')
li_result.llm_cc_reasoning = cc_pred.get('reasoning')
# Check matches against historical ground truth
li_result.gl_match = check_match(li_result.llm_gl, issue.historical_dim1)
li_result.cc_match = check_match(li_result.llm_cc, issue.historical_dim3)
result.line_items.append(li_result)
except Exception as e:
result.error = f"Error during evaluation: {str(e)}"
result.elapsed_seconds = time.perf_counter() - start_time
return result
Step 2: Commit
git add src/evaluation/llm_eval.py
git commit -m "refactor: evaluate_invoice to use per-invoice matcher calls"
Files:
src/evaluation/llm_eval.pyStep 1: Update run_evaluation function (around line 528)
Find the line orcha_conn = psycopg.connect(ORCHA_DB_DSN) and add CSV loading before it:
# Load static data once
update_progress("Loading Chart of Accounts and Cost Centers...")
coa_csv = load_coa_csv()
cc_csv = load_cc_csv()
update_progress(f"Loaded CoA ({len(coa_csv)} chars) and CC ({len(cc_csv)} chars)")
Step 2: Update the evaluate_invoice call to pass CSV data
Find the loop for i, issue in enumerate(issues): and update the evaluate_invoice call:
result = evaluate_invoice(issue, orcha_conn, semantic_conn, coa_csv, cc_csv, k, threshold)
Step 3: Remove the rate limiting sleep
Find and remove this line (around line 300 in the old code, no longer needed):
# Rate limiting
time.sleep(0.5)
The old per-line-item loop had this sleep. Now we only have 2 LLM calls per invoice, which is much fewer calls.
Step 4: Commit
git add src/evaluation/llm_eval.py
git commit -m "refactor: run_evaluation loads and passes CSV data"
Files:
src/evaluation/llm_eval.pyStep 1: Update the HTML report generation to show per-invoice debug info
In generate_html_report(), find the table row generation for line items (around line 429) and update the debug section. Replace the debug row (starting with <tr id="debug_{li_id}") with:
<tr id="debug_{li_id}" style="display: none;">
<td colspan="8">
<div class="p-3 bg-light">
<p class="reasoning"><strong>GL Reasoning:</strong> {_escape(li.llm_gl_reasoning or 'N/A')} (confidence: {li.llm_gl_confidence or 'N/A'})</p>
<p class="reasoning"><strong>CC Reasoning:</strong> {_escape(li.llm_cc_reasoning or 'N/A')} (confidence: {li.llm_cc_confidence or 'N/A'})</p>
</div>
</td>
</tr>
Step 2: Add invoice-level debug section
After the table closing tag (</table>), add invoice-level debug info:
# Add invoice-level debug (prompts/responses)
inv_debug_id = inv.invoice_number.replace('/', '_')
html += f"""
<div class="mt-3">
<div class="collapsible-header p-2 border rounded mb-2" onclick="toggleCollapsible('curation_{inv_debug_id}')">
<span class="expand-icon">▶</span> <strong>Curated Historical Bookings</strong>
</div>
<div id="curation_{inv_debug_id}" class="collapsible-content">
<pre>{_escape(inv.curation_csv)}</pre>
</div>
<div class="collapsible-header p-2 border rounded mb-2" onclick="toggleCollapsible('gl_prompt_{inv_debug_id}')">
<span class="expand-icon">▶</span> <strong>AccountsMatcher Prompt</strong>
</div>
<div id="gl_prompt_{inv_debug_id}" class="collapsible-content">
<pre>{_escape(inv.accounts_matcher_prompt)}</pre>
</div>
<div class="collapsible-header p-2 border rounded mb-2" onclick="toggleCollapsible('gl_response_{inv_debug_id}')">
<span class="expand-icon">▶</span> <strong>AccountsMatcher Response</strong>
</div>
<div id="gl_response_{inv_debug_id}" class="collapsible-content">
<pre>{_escape(inv.accounts_matcher_response)}</pre>
</div>
<div class="collapsible-header p-2 border rounded mb-2" onclick="toggleCollapsible('cc_prompt_{inv_debug_id}')">
<span class="expand-icon">▶</span> <strong>CostCenterMatcher Prompt</strong>
</div>
<div id="cc_prompt_{inv_debug_id}" class="collapsible-content">
<pre>{_escape(inv.cost_center_matcher_prompt)}</pre>
</div>
<div class="collapsible-header p-2 border rounded mb-2" onclick="toggleCollapsible('cc_response_{inv_debug_id}')">
<span class="expand-icon">▶</span> <strong>CostCenterMatcher Response</strong>
</div>
<div id="cc_response_{inv_debug_id}" class="collapsible-content">
<pre>{_escape(inv.cost_center_matcher_response)}</pre>
</div>
</div>
"""
Step 3: Commit
git add src/evaluation/llm_eval.py
git commit -m "refactor: update HTML report for per-invoice debug info"
Files:
src/evaluation/llm_eval.pyStep 1: Remove the old build_llm_prompt function
Find and delete the function def build_llm_prompt(supplier_name: str, description: str, amount: float, curation_csv: str) -> str: (around line 128-167)
Step 2: Verify no references to old function
cd /home/volrath/code/worktrees-orcha-semantic-search/spikes/semantic-search
grep -n "build_llm_prompt" src/evaluation/llm_eval.py
Expected: No matches (or only the new matcher prompt functions)
Step 3: Commit
git add src/evaluation/llm_eval.py
git commit -m "chore: remove old per-line-item prompt builder"
Files:
src/evaluation/llm_eval.pyStep 1: Run evaluation on a small sample
cd /home/volrath/code/worktrees-orcha-semantic-search/spikes/semantic-search
python -m src.evaluation.llm_eval old-issues.txt test_evaluation_report.html 3
Expected: Should process 3 invoices and generate test_evaluation_report.html
Step 2: Check the report structure
firefox test_evaluation_report.html &
Verify:
Step 3: Run full evaluation (optional)
python -m src.evaluation.llm_eval old-issues.txt evaluation_report_v2.html
Step 4: Commit test results if successful
git add -A
git commit -m "test: verify orcha-aligned evaluation pipeline works"
| Task | Description | Est. Changes |
|---|---|---|
| 1 | Export CoA CSV | New file |
| 2 | Export CC CSV | New file |
| 3 | Add CSV loading functions | +15 lines |
| 4 | AccountsMatcher prompt builder | +40 lines |
| 5 | CostCenterMatcher prompt builder | +40 lines |
| 6 | Matcher call functions | +30 lines |
| 7 | Update dataclasses | ~20 lines changed |
| 8 | Refactor evaluate_invoice | ~80 lines changed |
| 9 | Update run_evaluation | ~10 lines changed |
| 10 | Update HTML report | ~50 lines changed |
| 11 | Remove old code | -40 lines |
| 12 | Test pipeline | Verification |