Note (2026-04-24): After this document was written, legal_entity was renamed to tenant and the old tenant was renamed to organization. Read references to these terms with the pre-rename meaning.

SCC Test Ingestion Design

Context

SCC Group is a potential client that sent 12 test invoices (as .msg email files) and a mapping spreadsheet. They want to see Orcha extract account assignment information (Kontierungsinformationen) from their invoices.

Their current process: a single inbox (eingangsrechnung@scc.at) receives invoices for 3 group companies. Invoices are forwarded to SAP, triaged, sent for approval, then booked with full account assignment (Buchungskreis, creditor, GL account, cost center or order number, VAT, etc.).

Decisions

Data Inventory

12 MSG Files (invoices)

# File Vendor Attachments Notes
1 ASFINAG, Ersatzmaut EUR 200,00 - Paumann ASFINAG 1 PDF Toll — vehicle W-78547A → Paumann → KST 10154
2 Amazon, 5 x Ohrpolster f. Kopfhörer, EUR 40,75 Amazon 1 PDF + 2 images Office supplies
3 Brotkost RE für Onboarding am 01.04. Brotkost 1 PDF Catering
4 Ihre Rechnungen (Hotel Schillerpark) Hotel Schillerpark Linz 1 PDF Hotel
5 Ihre neue oekostrom AG-Rechnung ist verfügbar oekostrom AG 1 PDF Electricity — location-based KST
6 Ihre neue oekostrom AG-Rechnung ist verfügbar 2 oekostrom AG 1 PDF Electricity — location-based KST
7 Merbag GmbH, Re. 63416607, EUR 25,33 - Homole Merbag 1 PDF Vehicle service — W-12068J → Homole → KST 20001
8 ÖRAG, Dauerrechnung EUR 22.604,09 ÖRAG 1 PDF Legal services / recurring
9 Peter A. Novak, Rechung 02/2026 Peter A. Novak (freelancer) 1 PDF + 1 CATS Excel Freelancer — order/position from Excel
10 Peter Göndle, Ankaufsvereinbarung Porsche 911 Peter Göndle 1 PDF + 5 images Fixed asset → account 9900
11 Peter Göndle GmbH, Rechnung BMW i4 M60 xDrive Peter Göndle GmbH 1 PDF + 5 images Fixed asset → account 9900
12 WG: Leistungsnachweis und Rechnung März 2026 Krammel (freelancer) 1 PDF + 1 CATS Excel Freelancer — order/position from Excel

Mapping Spreadsheet (Mappinginformationen.xlsx)

Architecture

flowchart LR
    MSG[".msg files"] -->|Python extract-msg| EML[".eml files"]
    XLSX["Mappinginformationen.xlsx"] -->|BB script| DB[(PostgreSQL)]
    EML -->|BB upload| S3["LocalStack S3"]
    S3 -->|SQS trigger| ACQ["SES Acquisition"]
    ACQ --> TRIAGE["Triage"]
    TRIAGE --> INGEST["Ingestion Pipeline"]
    INGEST --> DOC["Documents + structured_data"]
    DOC -->|BB report script| HTML["HTML Report"]

Component Design

1. Setup Script (Babashka)

Inserts into PostgreSQL:

2. MSG→EML Conversion (Python, called from BB)

For each .msg file:

  1. Parse with extract-msg: sender, subject, date, HTML body, attachments
  2. Construct MIME/EML with Python email stdlib
  3. Rewrite TO header to documents+{TOKEN}@mail.getorcha.com
  4. For freelancer emails (Novak, Krammel): parse CATS Excel, append structured timesheet table to HTML body
  5. Save .eml to staging directory

3. Pipeline Trigger (Babashka)

  1. Upload each .eml to S3: incoming/{uuid}.eml
  2. Send SQS message per EML to acquisition queue (S3 event notification format)
  3. Pipeline takes over: SES acquisition → triage → ingestion → extraction → post-processing

4. Custom Prompts

:extraction additions:

:cost-center-match additions:

:accounts-match additions:

5. HTML Report (Babashka)

Standalone script that queries all SCC documents from the DB after ingestion completes.

Generates a self-contained HTML file with per-invoice sections:

Header info:

Line items:

Per invoice: link to the document in the app

6. Cost Center Dataset Structure

Flattened from 3 spreadsheet sheets:

code    | employee           | vehicle_plate | location
--------|--------------------|---------------|---------------------------
10154   | Christoph Paumann  | W-78547A      |
100920  |                    |               | Hofgasse 3, 8010 Graz
10141   | Franz Dorfer       |               |
20001   | Michael Homole     | W-12068J      |
100910  |                    |               | Mantlergasse 30-32, 1130 Wien