Orcha-Aligned Evaluation Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Goal: Refactor evaluation to use per-invoice LLM calls with separate AccountsMatcher and CostCenterMatcher, including full CoA/CC datasets.

Architecture: Export Regnology's CoA (642 accounts) and CC (104 centers) from Orcha DB to static CSV files. Modify llm_eval.py to call two matchers per invoice instead of one LLM call per line item.

Tech Stack: Python, psycopg, Gemini Flash, existing curation module

Task 1: Export Chart of Accounts to CSV

Files:

Create: data/regnology_coa.csv

Step 1: Create data directory

mkdir -p /home/volrath/code/worktrees-orcha-semantic-search/spikes/semantic-search/data

Step 2: Export GL accounts from Orcha DB

psql -h localhost -U postgres -d orcha -c "
COPY (
  SELECT
    elem->>'number' as number,
    elem->>'name' as name,
    elem->>'description' as description,
    elem->>'balance-position' as balance_position
  FROM gl_accounts_dataset,
       jsonb_array_elements(data) as elem
  WHERE legal_entity_id = '00000000-0000-0000-0000-000000000001'
    AND is_active = true
  ORDER BY elem->>'number'
) TO STDOUT CSV HEADER
" > /home/volrath/code/worktrees-orcha-semantic-search/spikes/semantic-search/data/regnology_coa.csv

Step 3: Verify export

head -5 /home/volrath/code/worktrees-orcha-semantic-search/spikes/semantic-search/data/regnology_coa.csv
wc -l /home/volrath/code/worktrees-orcha-semantic-search/spikes/semantic-search/data/regnology_coa.csv

Expected: CSV header + 642 rows

Step 4: Commit

git add data/regnology_coa.csv
git commit -m "data: export Regnology chart of accounts from Orcha DB"

Task 2: Export Cost Centers to CSV

Files:

Create: data/regnology_cc.csv

Step 1: Export cost centers from Orcha DB

psql -h localhost -U postgres -d orcha -c "
COPY (
  SELECT
    elem->>'Cost Center Num' as number,
    elem->>'Cost Center' as name,
    elem->>'Description' as description
  FROM cost_center_dataset,
       jsonb_array_elements(data) as elem
  WHERE legal_entity_id = '00000000-0000-0000-0000-000000000001'
    AND position IS NOT NULL
  ORDER BY elem->>'Cost Center Num'
) TO STDOUT CSV HEADER
" > /home/volrath/code/worktrees-orcha-semantic-search/spikes/semantic-search/data/regnology_cc.csv

Step 2: Verify export

head -5 /home/volrath/code/worktrees-orcha-semantic-search/spikes/semantic-search/data/regnology_cc.csv
wc -l /home/volrath/code/worktrees-orcha-semantic-search/spikes/semantic-search/data/regnology_cc.csv

Expected: CSV header + 104 rows

Step 3: Commit

git add data/regnology_cc.csv
git commit -m "data: export Regnology cost centers from Orcha DB"

Task 3: Add CSV Loading Functions

Files:

Modify: src/evaluation/llm_eval.py

Step 1: Add path constants and load functions after imports (around line 30)

Add after the ORCHA_DB_DSN line:

# Static data paths
DATA_DIR = Path(__file__).parent.parent.parent / 'data'
COA_CSV_PATH = DATA_DIR / 'regnology_coa.csv'
CC_CSV_PATH = DATA_DIR / 'regnology_cc.csv'


def load_coa_csv() -> str:
    """Load Chart of Accounts CSV content."""
    return COA_CSV_PATH.read_text()


def load_cc_csv() -> str:
    """Load Cost Centers CSV content."""
    return CC_CSV_PATH.read_text()

Step 2: Verify the functions work

cd /home/volrath/code/worktrees-orcha-semantic-search/spikes/semantic-search
python -c "from src.evaluation.llm_eval import load_coa_csv, load_cc_csv; print(len(load_coa_csv()), len(load_cc_csv()))"

Expected: Two numbers (character counts of the CSV files)

Step 3: Commit

git add src/evaluation/llm_eval.py
git commit -m "feat: add CSV loading functions for CoA and CC data"

Task 4: Create AccountsMatcher Prompt Builder

Files:

Modify: src/evaluation/llm_eval.py

Step 1: Add the accounts matcher prompt function after load_cc_csv()

def build_accounts_matcher_prompt(
    supplier_name: str,
    line_items_json: str,
    coa_csv: str,
    curated_csv: str,
) -> str:
    """Build prompt for AccountsMatcher (GL account prediction)."""
    return f"""You are a double-entry bookkeeping specialist for accounts payable invoices.
Your task: assign the correct GL debit account for each line item.

## Chart of Accounts

```csv
{coa_csv}

Historical Booking Patterns (from similar invoices)

Below are similar historical bookings. The cluster_count shows how often this pattern appeared:

{curated_csv}

Invoice Data

Supplier: {supplier_name}

Line Items: {line_items_json}

Instructions

For each line item, find the most appropriate GL account from the Chart of Accounts
Consider historical patterns for similar suppliers/descriptions
Prefer specific accounts over generic ones
Match by the nature of the expense (software, consulting, travel, etc.)

Response Format

Respond with ONLY a JSON object. The line_items array must have the same length as the input:

{{
  "line_items": [
    {{"debit_account": "XXXXXX", "confidence": 0.85, "reasoning": "Brief explanation"}},
    ...
  ]
}}

Do not include any text before or after the JSON."""


**Step 2: Commit**

```bash
git add src/evaluation/llm_eval.py
git commit -m "feat: add AccountsMatcher prompt builder"

Task 5: Create CostCenterMatcher Prompt Builder

Files:

Modify: src/evaluation/llm_eval.py

Step 1: Add the cost center matcher prompt function after build_accounts_matcher_prompt()

def build_cost_center_matcher_prompt(
    supplier_name: str,
    line_items_json: str,
    cc_csv: str,
    curated_csv: str,
) -> str:
    """Build prompt for CostCenterMatcher (cost center prediction)."""
    return f"""You are a cost center allocation specialist for accounts payable invoices.
Your task: assign the correct cost center for each line item.

## Cost Centers

```csv
{cc_csv}

Historical Booking Patterns (from similar invoices)

Below are similar historical bookings. The cluster_count shows how often this pattern appeared:

{curated_csv}

Invoice Data

Supplier: {supplier_name}

Line Items: {line_items_json}

Instructions

For each line item, find the most appropriate cost center
Consider historical patterns for similar suppliers/descriptions
Different line items on the same invoice may belong to different cost centers
Match based on which department/team consumes the goods/services

Response Format

Respond with ONLY a JSON object. The line_items array must have the same length as the input:

{{
  "line_items": [
    {{"cost_center": "XXXXXX", "confidence": 0.85, "reasoning": "Brief explanation"}},
    ...
  ]
}}

Do not include any text before or after the JSON."""


**Step 2: Commit**

```bash
git add src/evaluation/llm_eval.py
git commit -m "feat: add CostCenterMatcher prompt builder"

Task 6: Create Matcher Call Functions

Files:

Modify: src/evaluation/llm_eval.py

Step 1: Add matcher call functions after the prompt builders

def call_accounts_matcher(
    supplier_name: str,
    line_items_json: str,
    coa_csv: str,
    curated_csv: str,
) -> tuple[list[dict], str, str]:
    """
    Call LLM for GL account matching.

    Returns: (results_list, prompt_used, raw_response)
    """
    prompt = build_accounts_matcher_prompt(supplier_name, line_items_json, coa_csv, curated_csv)
    parsed, raw = call_llm(prompt)

    if parsed and 'line_items' in parsed:
        return parsed['line_items'], prompt, raw
    return [], prompt, raw


def call_cost_center_matcher(
    supplier_name: str,
    line_items_json: str,
    cc_csv: str,
    curated_csv: str,
) -> tuple[list[dict], str, str]:
    """
    Call LLM for cost center matching.

    Returns: (results_list, prompt_used, raw_response)
    """
    prompt = build_cost_center_matcher_prompt(supplier_name, line_items_json, cc_csv, curated_csv)
    parsed, raw = call_llm(prompt)

    if parsed and 'line_items' in parsed:
        return parsed['line_items'], prompt, raw
    return [], prompt, raw

Step 2: Commit

git add src/evaluation/llm_eval.py
git commit -m "feat: add matcher call functions"

Task 7: Update LineItemResult Dataclass

Files:

Modify: src/evaluation/llm_eval.py

Step 1: Update LineItemResult to track both matcher prompts/responses (around line 46)

Replace the existing LineItemResult class:

@dataclass
class LineItemResult:
    """Result for a single line item evaluation."""
    description: str
    amount: Optional[float]
    # LLM predictions
    llm_gl: Optional[str] = None
    llm_gl_confidence: Optional[float] = None
    llm_gl_reasoning: Optional[str] = None
    llm_cc: Optional[str] = None
    llm_cc_confidence: Optional[float] = None
    llm_cc_reasoning: Optional[str] = None
    # Current Orcha values (for reference)
    orcha_gl: Optional[str] = None
    orcha_cc: Optional[str] = None
    # Historical ground truth (from file)
    historical_gl: str = ""
    historical_cc: str = ""
    # Match status
    gl_match: bool = False
    cc_match: bool = False

Step 2: Add InvoiceResult fields for debug info (update InvoiceResult around line 69)

Replace the existing InvoiceResult class:

@dataclass
class InvoiceResult:
    """Result for an entire invoice evaluation."""
    invoice_number: str
    supplier: str
    issue_type: str
    historical_gl: str
    historical_cc: str
    line_items: list[LineItemResult] = field(default_factory=list)
    error: Optional[str] = None
    elapsed_seconds: float = 0.0
    # Debug info (per-invoice, not per-line-item)
    curation_csv: str = ""
    accounts_matcher_prompt: str = ""
    accounts_matcher_response: str = ""
    cost_center_matcher_prompt: str = ""
    cost_center_matcher_response: str = ""

Step 3: Commit

git add src/evaluation/llm_eval.py
git commit -m "refactor: update dataclasses for per-invoice matching"

Task 8: Refactor evaluate_invoice Function

Files:

Modify: src/evaluation/llm_eval.py

Step 1: Replace the evaluate_invoice function (around line 216)

def evaluate_invoice(
    issue: IssueRecord,
    orcha_conn,
    semantic_conn,
    coa_csv: str,
    cc_csv: str,
    k: int = 10,
    threshold: float = 0.6,
) -> InvoiceResult:
    """Evaluate a single invoice with per-invoice matcher calls."""
    start_time = time.perf_counter()

    result = InvoiceResult(
        invoice_number=issue.invoice_number,
        supplier=issue.supplier,
        issue_type=issue.issue_type,
        historical_gl=issue.historical_dim1,
        historical_cc=issue.historical_dim3,
    )

    # Fetch invoice from Orcha
    invoice_data = fetch_invoice_from_orcha(orcha_conn, issue.invoice_number)
    if not invoice_data:
        result.error = f"Invoice not found in Orcha DB: {issue.invoice_number}"
        result.elapsed_seconds = time.perf_counter() - start_time
        return result

    supplier_name = invoice_data['issuer_name']
    line_items = invoice_data['line_items']

    if not line_items:
        result.error = "No line items found"
        result.elapsed_seconds = time.perf_counter() - start_time
        return result

    try:
        # Step 1: Collect all line item descriptions for curation
        descriptions = [li.get('description', '') for li in line_items]

        # Step 2: Curate bookings once for entire invoice
        curated = curate_bookings_for_invoice(
            semantic_conn,
            supplier_name,
            descriptions,
            model='google',
            k=k,
            threshold=threshold,
        )
        curated_csv = curated_bookings_to_csv(curated)
        result.curation_csv = curated_csv

        # Step 3: Build line items JSON for prompts
        line_items_for_prompt = [
            {"index": i, "description": li.get('description', ''), "amount": li.get('amount', 0) or 0}
            for i, li in enumerate(line_items)
        ]
        line_items_json = json.dumps(line_items_for_prompt, indent=2)

        # Step 4: Call AccountsMatcher (one call for all line items)
        gl_results, gl_prompt, gl_raw = call_accounts_matcher(
            supplier_name, line_items_json, coa_csv, curated_csv
        )
        result.accounts_matcher_prompt = gl_prompt
        result.accounts_matcher_response = gl_raw

        # Step 5: Call CostCenterMatcher (one call for all line items)
        cc_results, cc_prompt, cc_raw = call_cost_center_matcher(
            supplier_name, line_items_json, cc_csv, curated_csv
        )
        result.cost_center_matcher_prompt = cc_prompt
        result.cost_center_matcher_response = cc_raw

        # Step 6: Merge results into LineItemResults
        for i, li in enumerate(line_items):
            li_result = LineItemResult(
                description=li.get('description', ''),
                amount=li.get('amount', 0) or 0,
                orcha_gl=li.get('debit-account', {}).get('number') if li.get('debit-account') else None,
                orcha_cc=li.get('cost-center', {}).get('number') if li.get('cost-center') else None,
                historical_gl=issue.historical_dim1,
                historical_cc=issue.historical_dim3,
            )

            # Extract GL prediction
            if i < len(gl_results):
                gl_pred = gl_results[i]
                li_result.llm_gl = gl_pred.get('debit_account')
                li_result.llm_gl_confidence = gl_pred.get('confidence')
                li_result.llm_gl_reasoning = gl_pred.get('reasoning')

            # Extract CC prediction
            if i < len(cc_results):
                cc_pred = cc_results[i]
                li_result.llm_cc = cc_pred.get('cost_center')
                li_result.llm_cc_confidence = cc_pred.get('confidence')
                li_result.llm_cc_reasoning = cc_pred.get('reasoning')

            # Check matches against historical ground truth
            li_result.gl_match = check_match(li_result.llm_gl, issue.historical_dim1)
            li_result.cc_match = check_match(li_result.llm_cc, issue.historical_dim3)

            result.line_items.append(li_result)

    except Exception as e:
        result.error = f"Error during evaluation: {str(e)}"

    result.elapsed_seconds = time.perf_counter() - start_time
    return result

Step 2: Commit

git add src/evaluation/llm_eval.py
git commit -m "refactor: evaluate_invoice to use per-invoice matcher calls"

Task 9: Update run_evaluation to Load and Pass CSV Data

Files:

Modify: src/evaluation/llm_eval.py

Step 1: Update run_evaluation function (around line 528)

Find the line orcha_conn = psycopg.connect(ORCHA_DB_DSN) and add CSV loading before it:

    # Load static data once
    update_progress("Loading Chart of Accounts and Cost Centers...")
    coa_csv = load_coa_csv()
    cc_csv = load_cc_csv()
    update_progress(f"Loaded CoA ({len(coa_csv)} chars) and CC ({len(cc_csv)} chars)")

Step 2: Update the evaluate_invoice call to pass CSV data

Find the loop for i, issue in enumerate(issues): and update the evaluate_invoice call:

            result = evaluate_invoice(issue, orcha_conn, semantic_conn, coa_csv, cc_csv, k, threshold)

Step 3: Remove the rate limiting sleep

Find and remove this line (around line 300 in the old code, no longer needed):

        # Rate limiting
        time.sleep(0.5)

The old per-line-item loop had this sleep. Now we only have 2 LLM calls per invoice, which is much fewer calls.

Step 4: Commit

git add src/evaluation/llm_eval.py
git commit -m "refactor: run_evaluation loads and passes CSV data"

Task 10: Update HTML Report for New Debug Structure

Files:

Modify: src/evaluation/llm_eval.py

Step 1: Update the HTML report generation to show per-invoice debug info

In generate_html_report(), find the table row generation for line items (around line 429) and update the debug section. Replace the debug row (starting with <tr id="debug_{li_id}") with:

                    <tr id="debug_{li_id}" style="display: none;">
                        <td colspan="8">
                            <div class="p-3 bg-light">
                                <p class="reasoning"><strong>GL Reasoning:</strong> {_escape(li.llm_gl_reasoning or 'N/A')} (confidence: {li.llm_gl_confidence or 'N/A'})</p>
                                <p class="reasoning"><strong>CC Reasoning:</strong> {_escape(li.llm_cc_reasoning or 'N/A')} (confidence: {li.llm_cc_confidence or 'N/A'})</p>
                            </div>
                        </td>
                    </tr>

Step 2: Add invoice-level debug section

After the table closing tag (</table>), add invoice-level debug info:

            # Add invoice-level debug (prompts/responses)
            inv_debug_id = inv.invoice_number.replace('/', '_')
            html += f"""
            <div class="mt-3">
                <div class="collapsible-header p-2 border rounded mb-2" onclick="toggleCollapsible('curation_{inv_debug_id}')">
                    <span class="expand-icon">&#9654;</span> <strong>Curated Historical Bookings</strong>
                </div>
                <div id="curation_{inv_debug_id}" class="collapsible-content">
                    <pre>{_escape(inv.curation_csv)}</pre>
                </div>

                <div class="collapsible-header p-2 border rounded mb-2" onclick="toggleCollapsible('gl_prompt_{inv_debug_id}')">
                    <span class="expand-icon">&#9654;</span> <strong>AccountsMatcher Prompt</strong>
                </div>
                <div id="gl_prompt_{inv_debug_id}" class="collapsible-content">
                    <pre>{_escape(inv.accounts_matcher_prompt)}</pre>
                </div>

                <div class="collapsible-header p-2 border rounded mb-2" onclick="toggleCollapsible('gl_response_{inv_debug_id}')">
                    <span class="expand-icon">&#9654;</span> <strong>AccountsMatcher Response</strong>
                </div>
                <div id="gl_response_{inv_debug_id}" class="collapsible-content">
                    <pre>{_escape(inv.accounts_matcher_response)}</pre>
                </div>

                <div class="collapsible-header p-2 border rounded mb-2" onclick="toggleCollapsible('cc_prompt_{inv_debug_id}')">
                    <span class="expand-icon">&#9654;</span> <strong>CostCenterMatcher Prompt</strong>
                </div>
                <div id="cc_prompt_{inv_debug_id}" class="collapsible-content">
                    <pre>{_escape(inv.cost_center_matcher_prompt)}</pre>
                </div>

                <div class="collapsible-header p-2 border rounded mb-2" onclick="toggleCollapsible('cc_response_{inv_debug_id}')">
                    <span class="expand-icon">&#9654;</span> <strong>CostCenterMatcher Response</strong>
                </div>
                <div id="cc_response_{inv_debug_id}" class="collapsible-content">
                    <pre>{_escape(inv.cost_center_matcher_response)}</pre>
                </div>
            </div>
"""

Step 3: Commit

git add src/evaluation/llm_eval.py
git commit -m "refactor: update HTML report for per-invoice debug info"

Task 11: Remove Old Per-Line-Item Code

Files:

Modify: src/evaluation/llm_eval.py

Step 1: Remove the old build_llm_prompt function

Find and delete the function def build_llm_prompt(supplier_name: str, description: str, amount: float, curation_csv: str) -> str: (around line 128-167)

Step 2: Verify no references to old function

cd /home/volrath/code/worktrees-orcha-semantic-search/spikes/semantic-search
grep -n "build_llm_prompt" src/evaluation/llm_eval.py

Expected: No matches (or only the new matcher prompt functions)

Step 3: Commit

git add src/evaluation/llm_eval.py
git commit -m "chore: remove old per-line-item prompt builder"

Task 12: Test the Full Pipeline

Files:

Run: src/evaluation/llm_eval.py

Step 1: Run evaluation on a small sample

cd /home/volrath/code/worktrees-orcha-semantic-search/spikes/semantic-search
python -m src.evaluation.llm_eval old-issues.txt test_evaluation_report.html 3

Expected: Should process 3 invoices and generate test_evaluation_report.html

Step 2: Check the report structure

firefox test_evaluation_report.html &

Verify:

Summary stats display correctly
Each invoice shows all line items
Debug sections for AccountsMatcher and CostCenterMatcher prompts/responses
GL and CC matches highlighted correctly

Step 3: Run full evaluation (optional)

python -m src.evaluation.llm_eval old-issues.txt evaluation_report_v2.html

Step 4: Commit test results if successful

git add -A
git commit -m "test: verify orcha-aligned evaluation pipeline works"

Summary

Task	Description	Est. Changes
1	Export CoA CSV	New file
2	Export CC CSV	New file
3	Add CSV loading functions	+15 lines
4	AccountsMatcher prompt builder	+40 lines
5	CostCenterMatcher prompt builder	+40 lines
6	Matcher call functions	+30 lines
7	Update dataclasses	~20 lines changed
8	Refactor evaluate_invoice	~80 lines changed
9	Update run_evaluation	~10 lines changed
10	Update HTML report	~50 lines changed
11	Remove old code	-40 lines
12	Test pipeline	Verification