Note (2026-04-24): After this document was written, legal_entity was renamed to tenant and the old tenant was renamed to organization. Read references to these terms with the pre-rename meaning.

SCC Test Ingestion — Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Goal: Ingest 12 test invoices from SCC Group through the full SES acquisition→ingestion pipeline, with custom prompts for Austrian AP account assignment, and generate an HTML report with the results.

Architecture: Four Babashka scripts + one Python helper. scc_setup.clj provisions tenant/LE/doc-source/master-data/prompts in PostgreSQL. scc_convert.py converts .msg→.eml (with CATS Excel data embedded for freelancer emails). scc_ingest.clj uploads EMLs to LocalStack S3 (S3→SQS notification triggers the pipeline automatically). scc_report.clj queries completed documents and generates a self-contained HTML file.

Tech Stack: Babashka (postgresql pod, honeysql, cheshire, data.csv), Python 3 (extract-msg, openpyxl, email stdlib), LocalStack (S3/SQS)

Design doc: docs/plans/2026-03-26-scc-test-ingestion-design.md

Source data: ~/code/orcha/drive/SCC/ (12 .msg files in Eingangsrechnungen/, 1 .xlsx mapping file)


Task 1: Create Feature Branch

Step 1: Create and switch to branch

git checkout -b feature/scc-test-ingestion

Step 2: Commit

Not needed yet — just branch creation.


Task 2: Python MSG→EML Converter

Files:

This script converts all 12 .msg files to .eml format suitable for the SES acquisition pipeline.

Dependencies (use the existing venv or create one):

python3 -m venv /tmp/scc-venv
/tmp/scc-venv/bin/pip install extract-msg openpyxl

Step 1: Write the conversion script

scripts/scc_convert.py — accepts 3 arguments:

  1. --input-dir — path to the directory with .msg files (default: ~/code/orcha/drive/SCC/Eingangsrechnungen)
  2. --output-dir — path to save .eml files (default: /tmp/scc-eml)
  3. --token — the SES doc source token (used for TO header rewriting)

The script:

#!/usr/bin/env python3
"""Convert .msg files to .eml format for SES pipeline ingestion.

For freelancer emails containing CATS Excel timesheets, the timesheet data
is parsed and appended as a structured HTML table to the email body.
This allows the extraction prompt to access project/order information.

Attachment filenames from extract-msg may contain null bytes (\x00) —
strip them before any string operations.
"""
import argparse
import email
import email.mime.multipart
import email.mime.text
import email.mime.base
import email.mime.image
import email.utils
import io
import os
import sys
from datetime import datetime, timezone

import extract_msg
import openpyxl


MAIL_DOMAIN = "mail.getorcha.com"


def clean_filename(name):
    """Strip null bytes and whitespace from MSG attachment filenames."""
    if name is None:
        return None
    return name.replace('\x00', '').strip()


def parse_cats_excel(data_bytes):
    """Parse a CATS timesheet Excel file into an HTML table.

    CATS files have columns: Persnr, Projekt, Pos, Kurztext, Stunden, RZ, Datum, PSP, Langtext
    Returns an HTML string with a summary table grouped by Projekt+Pos.
    """
    wb = openpyxl.load_workbook(io.BytesIO(data_bytes), data_only=True)
    ws = wb.active

    rows = list(ws.iter_rows(min_row=1, values_only=True))
    if not rows:
        return ""

    headers = [str(h) if h else "" for h in rows[0]]
    data_rows = [r for r in rows[1:] if any(v is not None for v in r)]

    # Build summary: group by (Projekt, Pos) and sum hours
    # Columns: Persnr=0, Projekt=1, Pos=2, Kurztext=3, Stunden=4
    summary = {}
    detail_lines = []
    for row in data_rows:
        projekt = row[1]
        pos = row[2]
        kurztext = row[3] or ""
        stunden = row[4] or 0
        datum = row[6]

        key = (projekt, pos)
        if key not in summary:
            summary[key] = {"total_hours": 0, "descriptions": set()}
        summary[key]["total_hours"] += float(stunden)
        summary[key]["descriptions"].add(str(kurztext).strip())

        # Detail line
        date_str = datum.strftime("%Y-%m-%d") if isinstance(datum, datetime) else str(datum)
        detail_lines.append(f"  {projekt} | {pos} | {stunden:.2f}h | {date_str} | {kurztext}")

    # Format as structured text block (readable by LLM, included in email body)
    lines = []
    lines.append("")
    lines.append("=" * 70)
    lines.append("TIMESHEET DATA (CATS) — extracted from attached Excel")
    lines.append("Use this data for order/position-based account assignment (CO Kontierung)")
    lines.append("=" * 70)
    lines.append("")
    lines.append("SUMMARY BY ORDER/POSITION:")
    lines.append(f"  {'Order (Projekt)':<20} {'Position (Pos)':<18} {'Total Hours':<14} Descriptions")
    lines.append("  " + "-" * 66)
    for (projekt, pos), info in sorted(summary.items()):
        descs = "; ".join(sorted(info["descriptions"]))[:60]
        lines.append(f"  {str(projekt):<20} {str(pos):<18} {info['total_hours']:<14.2f} {descs}")
    lines.append("")
    lines.append("DETAILED TIME ENTRIES:")
    lines.append(f"  {'Order':<10} {'Pos':<8} {'Hours':<10} {'Date':<12} Description")
    lines.append("  " + "-" * 66)
    for line in detail_lines:
        lines.append(line)
    lines.append("=" * 70)

    return "\n".join(lines)


def is_cats_excel(filename):
    """Check if an attachment is a CATS timesheet Excel file."""
    if not filename:
        return False
    return filename.upper().startswith("CATS_") and filename.endswith(".xlsx")


def convert_msg_to_eml(msg_path, token, output_dir):
    """Convert a single .msg file to .eml format.

    - Preserves From, Subject, Date, HTML body, and all attachments
    - Rewrites TO header to documents+{token}@{MAIL_DOMAIN}
    - For emails with CATS Excel: parses timesheet and appends to body
    """
    msg = extract_msg.openMsg(msg_path)

    # Build MIME message
    mime_msg = email.mime.multipart.MIMEMultipart("mixed")

    # Headers
    mime_msg["From"] = str(msg.sender or "unknown@scc.at")
    mime_msg["To"] = f"documents+{token}@{MAIL_DOMAIN}"
    mime_msg["Subject"] = str(msg.subject or "")
    if msg.date:
        mime_msg["Date"] = email.utils.format_datetime(msg.date)
    mime_msg["Message-ID"] = email.utils.make_msgid(domain="scc.at")

    # Body — start with original HTML or plain text
    body = msg.htmlBody or msg.body or ""
    if isinstance(body, bytes):
        body = body.decode("utf-8", errors="replace")

    # Check for CATS Excel attachments and embed timesheet data
    cats_text = ""
    for att in msg.attachments:
        fname = clean_filename(att.longFilename or att.shortFilename)
        if is_cats_excel(fname) and att.data:
            cats_text += parse_cats_excel(att.data)

    if cats_text:
        # Append to body — as preformatted text if HTML, or plain text
        if "<html" in body.lower():
            body += f"\n<pre>{cats_text}</pre>\n"
        else:
            body += cats_text

    # Determine body content type
    if "<html" in body.lower():
        body_part = email.mime.text.MIMEText(body, "html", "utf-8")
    else:
        body_part = email.mime.text.MIMEText(body, "plain", "utf-8")
    mime_msg.attach(body_part)

    # Attachments — include all (PDFs, images, Excel files)
    for att in msg.attachments:
        fname = clean_filename(att.longFilename or att.shortFilename)
        if not fname or not att.data:
            continue

        content_type = att.mimetype or "application/octet-stream"
        if content_type and "/" in content_type:
            maintype, subtype = content_type.strip().split("/", 1)
        else:
            maintype, subtype = "application", "octet-stream"

        att_part = email.mime.base.MIMEBase(maintype, subtype)
        att_part.set_payload(att.data)
        email.encoders.encode_base64(att_part)
        att_part.add_header("Content-Disposition", "attachment", filename=fname)
        mime_msg.attach(att_part)

    msg.close()

    # Save .eml
    basename = os.path.splitext(os.path.basename(msg_path))[0]
    # Sanitize filename
    safe_name = "".join(c if c.isalnum() or c in "-_ " else "_" for c in basename)
    eml_path = os.path.join(output_dir, f"{safe_name}.eml")

    with open(eml_path, "wb") as f:
        f.write(mime_msg.as_bytes())

    return eml_path


def main():
    parser = argparse.ArgumentParser(description="Convert .msg files to .eml for SES pipeline")
    parser.add_argument("--input-dir", required=True, help="Directory containing .msg files")
    parser.add_argument("--output-dir", required=True, help="Directory to save .eml files")
    parser.add_argument("--token", required=True, help="SES doc source token for TO header")
    args = parser.parse_args()

    os.makedirs(args.output_dir, exist_ok=True)

    msg_files = sorted(f for f in os.listdir(args.input_dir) if f.endswith(".msg"))
    if not msg_files:
        print(f"No .msg files found in {args.input_dir}")
        sys.exit(1)

    print(f"Converting {len(msg_files)} .msg files...")
    for fname in msg_files:
        msg_path = os.path.join(args.input_dir, fname)
        eml_path = convert_msg_to_eml(msg_path, args.token, args.output_dir)
        print(f"  {fname} -> {os.path.basename(eml_path)}")

    print(f"\nDone. EML files saved to {args.output_dir}")


if __name__ == "__main__":
    main()

Step 2: Test the conversion manually

/tmp/scc-venv/bin/python3 scripts/scc_convert.py \
  --input-dir ~/code/orcha/drive/SCC/Eingangsrechnungen \
  --output-dir /tmp/scc-eml \
  --token test-token

Expected: 12 .eml files in /tmp/scc-eml/. Verify:

Spot-check a freelancer EML:

grep -A 5 "TIMESHEET DATA" /tmp/scc-eml/*Novak*.eml

Expected: should see the CATS summary with Projekt 16873, Pos 40, total hours.

Step 3: Commit

git add scripts/scc_convert.py
git commit -m "feat(scc): add MSG to EML converter with CATS Excel embedding"

Task 3: BB Setup Script

Files:

This script provisions all database records needed for the SCC test tenant.

Step 1: Write the setup script

scripts/scc_setup.clj:

(ns scc-setup
  "Sets up SCC test tenant with legal entity, SES doc source, master data, and custom prompts.

   Usage: bb scc:setup
   Prerequisite: PostgreSQL running with migrations applied.

   Idempotent — checks for existing tenant by slug before inserting."
  (:require [babashka.pods :as pods]
            [cheshire.core :as json]
            [clojure.java.io :as io]
            [honey.sql :as sql]))


(pods/load-pod 'org.babashka/postgresql "0.1.2")
(require '[pod.babashka.postgresql :as pg])


;; ============================================================================
;; Configuration
;; ============================================================================

(def db {:dbtype "postgresql" :host "localhost" :port 5432 :dbname "orcha" :user "postgres"})

(def tenant-slug "scc-group")
(def tenant-name "SCC Group")
(def legal-entity-name "SCC EDV-Beratung AG")
(def ses-token "scc-test-2026")
(def mail-domain "mail.getorcha.com")

(def msg-source-dir (str (System/getProperty "user.home") "/code/orcha/drive/SCC"))
(def mapping-file (str msg-source-dir "/Mappinginformationen.xlsx"))


;; ============================================================================
;; Master Data: GL Accounts (from Kontenplan sheet)
;; ============================================================================

(def gl-accounts
  "GL accounts from Mappinginformationen.xlsx Kontenplan sheet.
   Format: [{:number \"0000009900\" :name \"Anlagenverrechnung\"} ...]
   Extracted from column B (Kontonummer) and column C (Kontonummer und Bezeichnung).
   Name is derived by stripping the number prefix from column C."
  ;; NOTE: This is populated by reading the Excel file.
  ;; For the implementation plan, the actual data extraction is done at runtime
  ;; by calling a Python helper or by pre-extracting to EDN.
  ;; Since bb can't read xlsx natively, we'll pre-extract to EDN files.
  :load-from-edn)


;; ============================================================================
;; Master Data: Cost Centers (flattened from 3 sheets)
;; ============================================================================

(def cost-centers
  [{:code "10154"  :employee "Christoph Paumann" :vehicle_plate "W-78547A" :location ""}
   {:code "100920" :employee ""                  :vehicle_plate ""         :location "Hofgasse 3, 8010 Graz"}
   {:code "10141"  :employee "Franz Dorfer"      :vehicle_plate ""         :location ""}
   {:code "20001"  :employee "Michael Homole"    :vehicle_plate "W-12068J" :location ""}
   {:code "100910" :employee ""                  :vehicle_plate ""         :location "Mantlergasse 30-32, 1130 Wien"}])

(def cost-center-headers ["code" "employee" "vehicle_plate" "location"])


;; ============================================================================
;; Custom Prompts
;; ============================================================================

(def extraction-additions
  "IMPORTANT: This is an Austrian group of companies (SCC Group). All invoices are in German.

BUCHUNGSKREIS (Company Code) IDENTIFICATION:
The SCC Group has three companies. Determine which company is the RECIPIENT of the invoice:
- SCC EDV-Beratung AG → Buchungskreis 1000
- HR Force EDV-Beratung GmbH → Buchungskreis 2000
- XIT-CROSS INFORMATION TECHNOLOGIES GMBH → Buchungskreis 3000
Add a top-level field \"buchungskreis\" with the company code number (1000, 2000, or 3000).

INVOICE TYPE:
Add a top-level field \"document_subtype\" with value \"rechnung\" (invoice) or \"gutschrift\" (credit note).

ADDITIONAL REQUIRED FIELDS — extract and add as top-level fields:
- \"zahlungsziel\": payment terms / due date (as text, e.g. \"30 Tage netto\" or a specific date)
- \"bankverbindung\": bank details object with {\"iban\": \"...\", \"bic\": \"...\", \"bank_name\": \"...\"}
  Extract the SUPPLIER's bank details (where the invoice should be paid to).

FREELANCER INVOICES WITH TIMESHEET DATA:
If the email body contains a section marked 'TIMESHEET DATA (CATS)', this is a freelancer invoice.
The timesheet data contains project/order numbers (Projekt), positions (Pos), and hours.
Add a top-level field \"timesheet_summary\" as an array of objects:
  [{\"projekt\": 16873, \"pos\": 40, \"total_hours\": 27.0, \"descriptions\": [\"...\"]}, ...]
This data is critical for account assignment — it determines which order/position each amount is booked to.")


(def cost-center-match-additions
  "SPECIAL COST CENTER LOOKUP RULES FOR SCC GROUP:

The cost center dataset has 4 columns: code, employee, vehicle_plate, location.
Use these rules IN ORDER to find the correct cost center:

1. VEHICLE-BASED MATCHING:
   If the invoice or email subject mentions a license plate number (Austrian format like W-78547A or W-12068J),
   look up the vehicle_plate column to find the employee, then return that employee's cost center.
   Example: Invoice for car service mentioning 'W-12068J' → employee Michael Homole → cost center 20001.

2. EMPLOYEE-BASED MATCHING:
   If the invoice mentions an employee name (in subject, body, or forwarding chain), match against
   the employee column. The email subject often contains the employee name after the vendor.
   Example: Subject 'ASFINAG, Ersatzmaut EUR 200,00 - Paumann' → Christoph Paumann → cost center 10154.

3. LOCATION-BASED MATCHING:
   For utility invoices (electricity, rent, heating) or office-related services, match the
   service delivery address against the location column.
   Example: Electricity bill for 'Hofgasse 3, 8010 Graz' → cost center 100920.

4. FREELANCER INVOICES — NO COST CENTER:
   If the email body contains 'TIMESHEET DATA (CATS)' or the invoice is from a freelancer/external
   consultant, do NOT assign a cost center. These invoices use order-based CO assignment
   (Auftragsnummer + Auftragsposition) instead. Return null for cost center with reasoning
   'Freelancer invoice — uses order/position-based assignment, not cost center.'")


(def accounts-match-additions
  "SPECIAL ACCOUNT ASSIGNMENT RULES FOR SCC GROUP:

1. FIXED ASSET INVOICES:
   Invoices for vehicles (car purchases, Ankaufsvereinbarung), machinery, or other fixed assets
   (Anlagevermögen) MUST be booked to account 9900 (Anlagenverrechnung).
   The asset accounting department will rebook them to the correct asset account later.
   Examples: Vehicle purchase invoices from Peter Göndle, BMW/Porsche invoices.

2. GENERAL EXPENSE INVOICES:
   All other invoices are booked to the appropriate expense account from the chart of accounts.
   Match based on the nature of the expense:
   - Vehicle services/maintenance → relevant vehicle cost account
   - Office supplies → office materials account
   - Electricity/utilities → energy cost account
   - Legal services → legal/consulting fees account
   - Hotel/travel → travel cost account
   - Catering → entertainment/event cost account
   - IT consulting/freelancer services → IT consulting/external services account

3. When in doubt between two accounts, prefer the more specific one.")


;; ============================================================================
;; Setup Functions
;; ============================================================================

(defn exec! [query]
  (pg/execute! db (sql/format query)))

(defn exec-one! [query]
  (first (pg/execute! db (sql/format query))))

(defn tenant-exists? []
  (some? (exec-one! {:select [:id] :from [:tenant] :where [:= :slug tenant-slug]})))


(defn create-tenant! []
  (println "Creating tenant...")
  (let [tenant (exec-one! {:insert-into :tenant
                           :values [{:name tenant-name :slug tenant-slug}]
                           :returning [:id]})]
    (println "  Tenant ID:" (:tenant/id tenant))
    (:tenant/id tenant)))


(defn create-legal-entity! [tenant-id]
  (println "Creating legal entity...")
  (let [le (exec-one! {:insert-into :legal-entity
                       :values [{:tenant-id tenant-id
                                 :name legal-entity-name
                                 :company-country "AT"}]
                       :returning [:id]})]
    (println "  Legal Entity ID:" (:legal-entity/id le))
    (:legal-entity/id le)))


(defn create-ses-doc-source! [legal-entity-id]
  (println "Creating SES doc source...")
  (let [ds (exec-one! {:insert-into :ap-doc-source
                       :values [{:legal-entity-id legal-entity-id
                                 :type [:cast "ses" :doc_source_type]
                                 :is-active true}]
                       :returning [:id]})]
    (exec! {:insert-into :ap-doc-source-ses
            :values [{:doc-source-id (:ap-doc-source/id ds)
                      :ses-receiving-address (str "documents+" ses-token "@" mail-domain)
                      :tenant-address-token ses-token}]})
    (println "  Doc Source ID:" (:ap-doc-source/id ds))
    (println "  SES Address: documents+" ses-token "@" mail-domain)
    (:ap-doc-source/id ds)))


(defn insert-gl-accounts! [legal-entity-id accounts-data]
  (println (str "Inserting GL accounts dataset (" (count accounts-data) " accounts)..."))
  (exec! {:insert-into :gl-accounts-dataset
          :values [{:legal-entity-id legal-entity-id
                    :data [:cast (json/generate-string accounts-data) :jsonb]
                    :is-active true}]})
  (println "  Done."))


(defn insert-cost-centers! [legal-entity-id]
  (println (str "Inserting cost center dataset (" (count cost-centers) " rows)..."))
  (exec! {:insert-into :cost-center-dataset
          :values [{:legal-entity-id legal-entity-id
                    :data [:cast (json/generate-string cost-centers) :jsonb]
                    :headers [:cast (json/generate-string cost-center-headers) :jsonb]
                    :position 1}]})
  (println "  Done."))


(defn insert-prompt-customization! [legal-entity-id prompt-key additions]
  (println (str "  Inserting prompt customization: " prompt-key))
  (exec! {:insert-into :legal-entity-prompt-customization
          :values [{:legal-entity-id legal-entity-id
                    :prompt-key prompt-key
                    :additions additions}]}))


(defn insert-custom-prompts! [legal-entity-id]
  (println "Inserting custom prompt additions...")
  (insert-prompt-customization! legal-entity-id "extraction" extraction-additions)
  (insert-prompt-customization! legal-entity-id "cost-center-match" cost-center-match-additions)
  (insert-prompt-customization! legal-entity-id "accounts-match" accounts-match-additions)
  (println "  Done."))


;; ============================================================================
;; GL Accounts Extraction (calls Python to parse Excel)
;; ============================================================================

(defn extract-gl-accounts-from-excel
  "Calls Python to extract GL accounts from the Kontenplan sheet.
   Returns a vector of {:number \"...\" :name \"...\"} maps."
  []
  (println "Extracting GL accounts from Excel...")
  (let [py-code (str
                  "import openpyxl, json, sys\n"
                  "wb = openpyxl.load_workbook('" mapping-file "', data_only=True)\n"
                  "ws = wb['Kontenplan']\n"
                  "accounts = []\n"
                  "for row in ws.iter_rows(min_row=3, values_only=True):\n"
                  "    num = row[1]\n"
                  "    combined = row[2]\n"
                  "    if num and combined:\n"
                  "        name = combined.split(' - ', 1)[1] if ' - ' in str(combined) else str(combined)\n"
                  "        accounts.append({'number': str(num).strip(), 'name': name.strip()})\n"
                  "json.dump(accounts, sys.stdout)\n")
        result (clojure.java.shell/sh "/tmp/scc-venv/bin/python3" "-c" py-code)]
    (if (zero? (:exit result))
      (let [accounts (json/parse-string (:out result) true)]
        (println (str "  Found " (count accounts) " accounts"))
        accounts)
      (do
        (println "  ERROR extracting GL accounts:" (:err result))
        (System/exit 1)))))


;; ============================================================================
;; Entry Point
;; ============================================================================

(defn -main [& _args]
  (println "=== SCC Test Setup ===\n")

  (when (tenant-exists?)
    (println "Tenant" tenant-slug "already exists. Delete it first if you want to re-run setup.")
    (println "  psql -h localhost -U postgres -d orcha -c \"DELETE FROM tenant WHERE slug = 'scc-group'\"")
    (System/exit 0))

  (let [gl-accounts-data (extract-gl-accounts-from-excel)
        tenant-id        (create-tenant!)
        le-id            (create-legal-entity! tenant-id)]
    (create-ses-doc-source! le-id)
    (insert-gl-accounts! le-id gl-accounts-data)
    (insert-cost-centers! le-id)
    (insert-custom-prompts! le-id)

    (println "\n=== Setup Complete ===")
    (println "Tenant ID:       " tenant-id)
    (println "Legal Entity ID: " le-id)
    (println "SES Token:       " ses-token)
    (println "SES Address:      documents+" ses-token "@" mail-domain)
    (println "\nNext: run bb scc:convert, then bb scc:ingest")))

Step 2: Add bb tasks to bb.edn

Add these tasks under the existing tasks section:

  ;; SCC Test
  scc:setup
  {:doc      "Set up SCC test tenant, legal entity, master data, and custom prompts"
   :requires ([scc-setup])
   :task     (apply scc-setup/-main *command-line-args*)}

  scc:convert
  {:doc      "Convert SCC .msg files to .eml: bb scc:convert [--input-dir DIR] [--output-dir DIR]"
   :requires ([scc-ingest])
   :task     (apply scc-ingest/convert! *command-line-args*)}

  scc:ingest
  {:doc      "Upload converted SCC .eml files to S3 for pipeline processing"
   :requires ([scc-ingest])
   :task     (apply scc-ingest/ingest! *command-line-args*)}

  scc:report
  {:doc      "Generate HTML report for SCC test results"
   :requires ([scc-report])
   :task     (apply scc-report/-main *command-line-args*)}

Step 3: Run setup and verify

bb scc:setup

Expected output:

=== SCC Test Setup ===

Extracting GL accounts from Excel...
  Found 211 accounts
Creating tenant...
  Tenant ID: <uuid>
Creating legal entity...
  Legal Entity ID: <uuid>
Creating SES doc source...
  Doc Source ID: <uuid>
  SES Address: documents+scc-test-2026@mail.getorcha.com
Inserting GL accounts dataset (211 accounts)...
  Done.
Inserting cost center dataset (5 rows)...
  Done.
Inserting custom prompt additions...
  Inserting prompt customization: extraction
  Inserting prompt customization: cost-center-match
  Inserting prompt customization: accounts-match
  Done.

=== Setup Complete ===

Verify in database:

psql -h localhost -U postgres -d orcha -c "SELECT id, name, slug FROM tenant WHERE slug = 'scc-group'"
psql -h localhost -U postgres -d orcha -c "SELECT id, name FROM legal_entity WHERE name = 'SCC EDV-Beratung AG'"
psql -h localhost -U postgres -d orcha -c "SELECT prompt_key FROM legal_entity_prompt_customization WHERE legal_entity_id = (SELECT id FROM legal_entity WHERE name = 'SCC EDV-Beratung AG')"

Step 4: Commit

git add scripts/scc_setup.clj
git commit -m "feat(scc): add setup script for tenant, master data, and custom prompts"

Task 4: BB Ingest Script (Convert + Upload to S3)

Files:

This script wraps the Python conversion and uploads EMLs to S3.

Step 1: Write the ingest script

scripts/scc_ingest.clj:

(ns scc-ingest
  "Converts SCC .msg files to .eml and uploads to S3 for pipeline processing.

   Two entry points:
   - convert!  — runs Python MSG→EML conversion
   - ingest!   — uploads .eml files to S3 (triggers pipeline via S3→SQS notification)

   Usage:
     bb scc:convert             # Convert .msg → .eml
     bb scc:ingest              # Upload .eml files to S3"
  (:require [babashka.process :as p]
            [clojure.java.io :as io]
            [clojure.string :as string]
            [script-config :as cfg]))


(def default-input-dir (str (System/getProperty "user.home") "/code/orcha/drive/SCC/Eingangsrechnungen"))
(def default-output-dir "/tmp/scc-eml")
(def ses-token "scc-test-2026")
(def python-bin "/tmp/scc-venv/bin/python3")
(def convert-script "scripts/scc_convert.py")


(defn ^:private aws [& args]
  (let [cmd (concat ["aws" "--endpoint-url" (cfg/aws-endpoint) "--region" (cfg/aws-region)] args)
        env {"AWS_ACCESS_KEY_ID"     "test"
             "AWS_SECRET_ACCESS_KEY" "test"
             "AWS_DEFAULT_REGION"    (cfg/aws-region)}
        {:keys [exit out err]} (apply p/shell {:out :string :err :string :continue true :extra-env env} cmd)]
    (when (not= 0 exit)
      (println "ERROR:" (string/join " " cmd))
      (println err)
      (System/exit 1))
    (when (seq out) (string/trim out))))


(defn convert! [& _args]
  (println "=== Converting MSG → EML ===")
  (when-not (.exists (io/file python-bin))
    (println "Python venv not found. Create it:")
    (println "  python3 -m venv /tmp/scc-venv")
    (println "  /tmp/scc-venv/bin/pip install extract-msg openpyxl")
    (System/exit 1))
  (let [{:keys [exit err]} (p/shell {:continue true}
                                    python-bin convert-script
                                    "--input-dir" default-input-dir
                                    "--output-dir" default-output-dir
                                    "--token" ses-token)]
    (when (not= 0 exit)
      (println "Conversion failed:" err)
      (System/exit 1))))


(defn ingest! [& _args]
  (println "=== Uploading EML files to S3 ===")
  (let [eml-dir (io/file default-output-dir)
        eml-files (sort (filter #(string/ends-with? (.getName %) ".eml") (.listFiles eml-dir)))]
    (when (empty? eml-files)
      (println "No .eml files found in" default-output-dir)
      (println "Run bb scc:convert first.")
      (System/exit 1))

    (let [bucket (cfg/ses-emails-bucket)]
      (println (str "Uploading " (count eml-files) " emails to " bucket "...\n"))
      (doseq [f eml-files]
        (let [s3-key (str "incoming/" (random-uuid) ".eml")]
          (print (str "  " (.getName f) " → " s3-key " ... "))
          (flush)
          (aws "s3" "cp" (str f) (str "s3://" bucket "/" s3-key))
          (println "done")))

      (println (str "\n" (count eml-files) " emails uploaded."))
      (println "S3 → SQS notifications will trigger the acquisition worker automatically.")
      (println "Monitor app logs for processing status."))))

Step 2: Test conversion + upload

bb scc:convert
bb scc:ingest

Expected: 12 .eml files uploaded to S3. The S3→SQS notification (configured by init_aws.clj) triggers the acquisition worker automatically.

Step 3: Verify pipeline is processing

Check app logs for acquisition and ingestion activity. Documents should appear in the database within a few minutes.

psql -h localhost -U postgres -d orcha -c "
  SELECT d.id, d.file_original_name, i.status
  FROM document d
  JOIN ap_ingestion i ON i.document_id = d.id
  JOIN legal_entity le ON le.id = d.legal_entity_id
  WHERE le.name = 'SCC EDV-Beratung AG'
  ORDER BY d.created_at"

Step 4: Commit

git add scripts/scc_ingest.clj
git commit -m "feat(scc): add EML conversion wrapper and S3 upload script"

Task 5: Update bb.edn with SCC Tasks

Files:

Step 1: Add the SCC tasks

Add after the existing sandbox tasks in bb.edn:

  ;; SCC Test
  scc:setup
  {:doc      "Set up SCC test tenant, legal entity, master data, and custom prompts"
   :requires ([scc-setup])
   :task     (apply scc-setup/-main *command-line-args*)}

  scc:convert
  {:doc      "Convert SCC .msg files to .eml format"
   :requires ([scc-ingest])
   :task     (apply scc-ingest/convert! *command-line-args*)}

  scc:ingest
  {:doc      "Upload converted SCC .eml files to S3 for pipeline processing"
   :requires ([scc-ingest])
   :task     (apply scc-ingest/ingest! *command-line-args*)}

  scc:report
  {:doc      "Generate HTML report for SCC test results"
   :requires ([scc-report])
   :task     (apply scc-report/-main *command-line-args*)}

Step 2: Commit

git add bb.edn
git commit -m "feat(scc): add bb tasks for SCC test workflow"

Task 6: BB HTML Report Generator

Files:

This script queries ingested SCC documents and generates a self-contained HTML report.

Step 1: Write the report script

scripts/scc_report.clj:

(ns scc-report
  "Generates an HTML report of SCC test ingestion results.

   Queries all documents for the SCC legal entity, extracts structured_data,
   and renders an HTML file with account assignment information per invoice.

   Usage: bb scc:report [--output path/to/report.html] [--base-url https://app.example.com]"
  (:require [babashka.pods :as pods]
            [cheshire.core :as json]
            [clojure.string :as string]
            [honey.sql :as sql]))


(pods/load-pod 'org.babashka/postgresql "0.1.2")
(require '[pod.babashka.postgresql :as pg])


(def db {:dbtype "postgresql" :host "localhost" :port 5432 :dbname "orcha" :user "postgres"})
(def default-output "/tmp/scc-report.html")
(def default-base-url "http://localhost:3000")


(defn fetch-documents
  "Fetch all SCC documents with their structured data and ingestion info."
  []
  (pg/execute!
   db
   (sql/format
    {:select [:document.id
              :document.file-original-name
              :document.created-at
              [:document.structured-data :doc-structured-data]
              :ap-ingestion.status
              [:ap-ingestion.structured-data :ing-structured-data]]
     :from [:document]
     :join [:legal-entity [:= :legal-entity.id :document.legal-entity-id]
            :ap-ingestion [:= :ap-ingestion.document-id :document.id]]
     :where [:= :legal-entity.name "SCC EDV-Beratung AG"]
     :order-by [[:document.created-at :asc]]})))


(defn parse-structured-data
  "Parse structured data from JSON string or return map as-is."
  [data]
  (cond
    (string? data) (json/parse-string data true)
    (map? data) data
    :else nil))


(defn format-amount
  "Format amount with currency."
  [amount currency]
  (when amount
    (str (format "%.2f" (double amount)) " " (or currency "EUR"))))


(defn escape-html [s]
  (when s
    (-> (str s)
        (string/replace "&" "&amp;")
        (string/replace "<" "&lt;")
        (string/replace ">" "&gt;")
        (string/replace "\"" "&quot;"))))


(defn render-line-item
  "Render a single line item row."
  [idx item]
  (let [desc  (or (:description item) (:name item) "—")
        acct  (or (get-in item [:debit-account :number])
                  (get-in item [:account :number]) "—")
        acct-name (or (get-in item [:debit-account :name])
                      (get-in item [:account :name]) "")
        cc    (get-in item [:cost-center :number])
        cc-name (get-in item [:cost-center :name])
        order (get-in item [:order :number])
        order-pos (get-in item [:order :position])
        vat   (or (:tax-rate item) (:vat-rate item) "—")
        net   (:net-amount item)
        side  (if (get-in item [:credit-account :number]) "Haben" "Soll")]
    (str "<tr>"
         "<td>" (inc idx) "</td>"
         "<td>" (escape-html desc) "</td>"
         "<td>" (escape-html acct) (when (seq acct-name) (str " — " (escape-html acct-name))) "</td>"
         "<td>" (if cc
                  (str (escape-html cc) (when cc-name (str " — " (escape-html cc-name))))
                  (if order
                    (str "Auftrag " (escape-html (str order)) (when order-pos (str " / Pos " (escape-html (str order-pos)))))
                    "—"))
         "</td>"
         "<td>" (escape-html (str vat)) (when (number? vat) "%") "</td>"
         "<td class='amount'>" (if net (format "%.2f" (double net)) "—") "</td>"
         "<td>" side "</td>"
         "</tr>")))


(defn render-invoice
  "Render a single invoice section."
  [doc base-url]
  (let [data       (or (parse-structured-data (:doc-structured-data doc))
                       (parse-structured-data (:ing-structured-data doc))
                       {})
        doc-id     (:document/id doc)
        filename   (:document/file-original-name doc)
        vendor     (or (:vendor-name data) (get-in data [:vendor :name]) "Unknown")
        inv-num    (or (:invoice-number data) (:reference-number data) "—")
        gross      (:total-gross-amount data)
        currency   (or (:currency data) "EUR")
        inv-date   (or (:invoice-date data) (:document-date data) "—")
        buchungskreis (or (:buchungskreis data) "—")
        doc-type   (or (:document_subtype data) (if (neg? (or gross 0)) "gutschrift" "rechnung"))
        zahlungsziel (or (:zahlungsziel data) (:payment-terms data) "—")
        bank       (:bankverbindung data)
        iban       (or (:iban bank) (get-in data [:vendor :iban]) "—")
        bic        (or (:bic bank) (get-in data [:vendor :bic]) "—")
        line-items (or (:line-items data) [])
        link       (str base-url "/documents/view/" doc-id)]
    (str
     "<div class='invoice'>"
     "<h2>" (escape-html vendor) " — " (escape-html (str inv-num)) "</h2>"
     "<p class='meta'>Quelldatei: " (escape-html filename)
     " | <a href='" link "' target='_blank'>In Orcha öffnen →</a></p>"

     "<table class='header-info'>"
     "<tr><th>Buchungskreis</th><td>" (escape-html (str buchungskreis)) "</td>"
     "    <th>Typ</th><td>" (escape-html (str doc-type)) "</td></tr>"
     "<tr><th>Buchungsdatum</th><td>" (escape-html (str inv-date)) "</td>"
     "    <th>Rechnungsnummer</th><td>" (escape-html (str inv-num)) "</td></tr>"
     "<tr><th>Bruttobetrag</th><td>" (or (format-amount gross currency) "—") "</td>"
     "    <th>Währung</th><td>" (escape-html (str currency)) "</td></tr>"
     "<tr><th>Zahlungsziel</th><td>" (escape-html (str zahlungsziel)) "</td>"
     "    <th>Bankverbindung</th><td>IBAN: " (escape-html (str iban)) " / BIC: " (escape-html (str bic)) "</td></tr>"
     "</table>"

     (if (seq line-items)
       (str "<table class='line-items'>"
            "<thead><tr>"
            "<th>#</th><th>Bezeichnung</th><th>Konto</th>"
            "<th>Kostenstelle / Auftrag</th><th>VSt</th>"
            "<th>Nettobetrag</th><th>S/H</th>"
            "</tr></thead><tbody>"
            (string/join "\n" (map-indexed render-line-item line-items))
            "</tbody></table>")
       "<p class='no-items'>Keine Positionen extrahiert.</p>")

     "</div>")))


(def css
  "body { font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif; max-width: 1200px; margin: 0 auto; padding: 20px; background: #f8f9fa; color: #1a1a1a; }
h1 { border-bottom: 3px solid #2563eb; padding-bottom: 12px; }
.invoice { background: white; border-radius: 8px; padding: 24px; margin: 20px 0; box-shadow: 0 1px 3px rgba(0,0,0,0.1); }
.invoice h2 { margin-top: 0; color: #1e40af; }
.meta { color: #6b7280; font-size: 0.9em; }
.meta a { color: #2563eb; }
table { border-collapse: collapse; width: 100%; margin: 12px 0; }
.header-info th { text-align: left; padding: 6px 12px; background: #f0f4ff; width: 15%; font-size: 0.85em; color: #374151; }
.header-info td { padding: 6px 12px; width: 35%; }
.line-items th { text-align: left; padding: 8px 12px; background: #1e40af; color: white; font-size: 0.85em; }
.line-items td { padding: 6px 12px; border-bottom: 1px solid #e5e7eb; font-size: 0.9em; }
.line-items tr:nth-child(even) { background: #f9fafb; }
.amount { text-align: right; font-variant-numeric: tabular-nums; }
.no-items { color: #9ca3af; font-style: italic; }
.summary { background: #eff6ff; border-radius: 8px; padding: 16px 24px; margin: 16px 0; }
.status-completed { color: #059669; }
.status-failed { color: #dc2626; }
.status-in-progress { color: #d97706; }")


(defn render-report
  "Render the full HTML report."
  [documents base-url]
  (let [completed (count (filter #(= "completed" (:ap-ingestion/status %)) documents))
        failed    (count (filter #(= "failed" (:ap-ingestion/status %)) documents))
        pending   (- (count documents) completed failed)]
    (str
     "<!DOCTYPE html>\n<html lang='de'>\n<head>\n"
     "<meta charset='utf-8'>\n"
     "<title>SCC Group — Orcha Testbericht</title>\n"
     "<style>" css "</style>\n"
     "</head>\n<body>\n"
     "<h1>SCC Group — Kontierungsergebnisse</h1>\n"
     "<p>Generiert am " (java.time.LocalDate/now) "</p>\n"

     "<div class='summary'>"
     "<strong>" (count documents) " Rechnungen verarbeitet:</strong> "
     "<span class='status-completed'>" completed " abgeschlossen</span>"
     (when (pos? failed) (str " | <span class='status-failed'>" failed " fehlgeschlagen</span>"))
     (when (pos? pending) (str " | <span class='status-in-progress'>" pending " in Bearbeitung</span>"))
     "</div>\n"

     (string/join "\n" (map #(render-invoice % base-url) documents))

     "\n</body>\n</html>")))


(defn -main [& args]
  (let [output   (or (some #(when (string/starts-with? % "--output") (second (string/split % #"=" 2))) args)
                     default-output)
        base-url (or (some #(when (string/starts-with? % "--base-url") (second (string/split % #"=" 2))) args)
                     default-base-url)]
    (println "=== SCC Report Generator ===\n")
    (let [documents (fetch-documents)]
      (if (empty? documents)
        (do (println "No SCC documents found. Run bb scc:ingest first and wait for processing.")
            (System/exit 1))
        (do
          (println (str "Found " (count documents) " documents"))
          (let [html (render-report documents base-url)]
            (spit output html)
            (println (str "Report saved to: " output))
            (println (str "Open: file://" output))))))))

Step 2: Test (after ingestion completes)

bb scc:report

Expected: HTML file at /tmp/scc-report.html with 12 invoice sections.

Step 3: Commit

git add scripts/scc_report.clj
git commit -m "feat(scc): add HTML report generator for test results"

Task 7: End-to-End Test Run

This task is manual verification of the full pipeline.

Step 1: Ensure prerequisites

Step 2: Run the full workflow

bb scc:setup      # Provision tenant, LE, master data, prompts
bb scc:convert    # Convert .msg → .eml
bb scc:ingest     # Upload .eml files to S3

Step 3: Wait for ingestion to complete

Monitor progress:

watch -n 5 'psql -h localhost -U postgres -d orcha -t -c "
  SELECT i.status, count(*)
  FROM ap_ingestion i
  JOIN document d ON d.id = i.document_id
  JOIN legal_entity le ON le.id = d.legal_entity_id
  WHERE le.name = '\''SCC EDV-Beratung AG'\''
  GROUP BY i.status"'

Wait until all 12 show completed (or identify failures).

Step 4: Generate report

bb scc:report

Open /tmp/scc-report.html in browser and verify:

Step 5: Iterate on prompts if needed

If account assignment results aren't satisfactory, update the custom prompts in the database and re-ingest specific documents.


Task Summary

# Task Files Depends On
1 Create branch
2 Python MSG→EML converter scripts/scc_convert.py 1
3 BB setup script scripts/scc_setup.clj 1
4 BB ingest script scripts/scc_ingest.clj 2, 3
5 Update bb.edn bb.edn 3, 4, 6
6 BB HTML report scripts/scc_report.clj 1
7 End-to-end test 2, 3, 4, 5, 6