Orcha-Aligned Evaluation Report Design

Overview

Refactor the evaluation report generation to more closely match Orcha's ingestion process, enabling meaningful comparison and eventual integration.

Key changes:

Process all line items per invoice in single LLM calls (not one call per line item)
Separate LLM calls for AccountsMatcher and CostCenterMatcher (like Orcha)
Include full Chart of Accounts and Cost Centers in prompts (like Orcha)
Continue using semantic search for historical bookings (replacing Orcha's trigram matching)

Current vs Target State

Aspect	Current	Target
LLM calls per invoice	N (one per line item)	2 (AccountsMatcher + CostCenterMatcher)
CoA/CC in prompt	None	Full datasets as CSV
Historical bookings	Curated per line item	Curated once per invoice
Output structure	Combined GL+CC	Separate GL and CC results

Data Fetching

Static CSV Files

Export CoA and CC from Orcha DB once, commit to repo:

# Chart of Accounts
psql -h localhost -U postgres -d orcha -c "\COPY (
  SELECT number, name, description, balance_position
  FROM gl_accounts_dataset
  WHERE legal_entity_id = '<REGNOLOGY_ID>' AND is_active = true
  ORDER BY number
) TO STDOUT CSV HEADER" > data/regnology_coa.csv

# Cost Centers
psql -h localhost -U postgres -d orcha -c "\COPY (
  SELECT *
  FROM cost_center_dataset
  WHERE legal_entity_id = '<REGNOLOGY_ID>' AND position IS NOT NULL
  ORDER BY number
) TO STDOUT CSV HEADER" > data/regnology_cc.csv

In Script

COA_CSV_PATH = Path(__file__).parent.parent.parent / 'data' / 'regnology_coa.csv'
CC_CSV_PATH = Path(__file__).parent.parent.parent / 'data' / 'regnology_cc.csv'

# Loaded once at startup
coa_csv = COA_CSV_PATH.read_text()
cc_csv = CC_CSV_PATH.read_text()

Prompt Structures

AccountsMatcher Prompt

You are a double-entry bookkeeping specialist for accounts payable invoices.
Your task: assign the correct GL debit account for each line item.

## Chart of Accounts
${coa_csv}

## Historical Booking Patterns (from similar invoices)
${curated_bookings_csv}

## Invoice Data
Supplier: ${supplier_name}
Line Items:
${line_items_json}

## Instructions
1. For each line item, find the most appropriate GL account from the Chart of Accounts
2. Consider historical patterns for similar suppliers/descriptions
3. Prefer specific accounts over generic ones

## Response Format
{
  "line_items": [
    {"debit_account": "XXXXXX", "confidence": 0.85, "reasoning": "..."},
    ...
  ]
}

CostCenterMatcher Prompt

You are a cost center allocation specialist for accounts payable invoices.
Your task: assign the correct cost center for each line item.

## Cost Centers
${cc_csv}

## Historical Booking Patterns (from similar invoices)
${curated_bookings_csv}

## Invoice Data
Supplier: ${supplier_name}
Line Items:
${line_items_json}

## Instructions
1. For each line item, find the most appropriate cost center
2. Consider historical patterns for similar suppliers/descriptions
3. Different line items may belong to different cost centers

## Response Format
{
  "line_items": [
    {"cost_center": "XXXXXX", "confidence": 0.85, "reasoning": "..."},
    ...
  ]
}

Per-Invoice Curation

The existing curate_bookings_for_invoice() already supports multiple line items. Change from calling it per line item to calling it once per invoice with all descriptions.

Current (per line item):

for li in line_items:
    curated = curate_bookings_for_invoice(conn, supplier_name, [li['description']], ...)

Target (per invoice):

all_descriptions = [li.get('description', '') for li in line_items]
curated = curate_bookings_for_invoice(conn, supplier_name, all_descriptions, ...)

How Curation Works

Invoice with 3 line items:
  - "Software license renewal"
  - "Professional services - implementation"
  - "Training workshop"

Step 1: Semantic search per line item (k=10 each)
  → Up to 30 raw matches

Step 2: Merge and dedupe by ID
  → ~20 unique matches

Step 3: Cluster by (debit_account, cost_center)
  → Groups matches by their GL/CC combination

Step 4: Deduplicate within clusters (similarity > 0.9)
  → ~2-3 diverse examples per cluster

Step 5: Output CSV sorted by cluster frequency

Output includes cluster_count to indicate pattern frequency.

Evaluation Flow

def evaluate_invoice(issue, orcha_conn, semantic_conn, coa_csv, cc_csv, ...):
    # 1. Fetch invoice from Orcha (unchanged)
    invoice_data = fetch_invoice_from_orcha(orcha_conn, issue.invoice_number)

    # 2. Collect ALL line item descriptions for curation
    descriptions = [li.get('description', '') for li in invoice_data['line_items']]

    # 3. Curate bookings once for entire invoice
    curated = curate_bookings_for_invoice(conn, supplier_name, descriptions, ...)
    curated_csv = curated_bookings_to_csv(curated)

    # 4. Build line items JSON for prompt
    line_items_json = json.dumps([
        {"description": li['description'], "amount": li['amount']}
        for li in invoice_data['line_items']
    ], indent=2)

    # 5. Call AccountsMatcher (one LLM call for all line items)
    gl_results = call_accounts_matcher(supplier_name, line_items_json, coa_csv, curated_csv)

    # 6. Call CostCenterMatcher (one LLM call for all line items)
    cc_results = call_cost_center_matcher(supplier_name, line_items_json, cc_csv, curated_csv)

    # 7. Merge results and compare against ground truth
    for i, li in enumerate(invoice_data['line_items']):
        li_result = LineItemResult(
            llm_gl=gl_results[i]['debit_account'],
            llm_cc=cc_results[i]['cost_center'],
            ...
        )

Changes to `run_evaluation()`

def run_evaluation(issues_file, output_file, ...):
    # Load static data once
    coa_csv = COA_CSV_PATH.read_text()
    cc_csv = CC_CSV_PATH.read_text()

    # ... existing setup ...

    for issue in issues:
        result = evaluate_invoice(
            issue, orcha_conn, semantic_conn,
            coa_csv, cc_csv,  # Pass pre-loaded data
            k, threshold
        )
        results.append(result)

Files to Modify

src/evaluation/llm_eval.py
- Add call_accounts_matcher() and call_cost_center_matcher() functions
- Refactor evaluate_invoice() to call matchers once per invoice
- Update run_evaluation() to load and pass CoA/CC data
- Remove per-line-item build_llm_prompt() and call_llm()
New: data/regnology_coa.csv
- Exported from Orcha's gl_accounts_dataset table
New: data/regnology_cc.csv
- Exported from Orcha's cost_center_dataset table

Success Criteria

Evaluation report shows similar or better accuracy compared to current implementation
LLM calls reduced from N per invoice to 2 per invoice
Report still shows per-line-item results with match status
Debug info includes both AccountsMatcher and CostCenterMatcher prompts/responses