O3 Sub-Processor List -- Codebase Verification Review

Date: 12.04.2026 Reviewer: Automated codebase analysis Scope: Verified every claim in O3 v3.0 against the actual Orcha codebase (orcha/src/, orcha/infra/, orcha/resources/) and infrastructure (CDK stacks)

Sources Consulted

Source	Access Method
`infra/stacks/foundation_stack.py`	Direct read -- CDK stack defining S3, SQS, SES, Cognito, KMS, SSM
`infra/stacks/data_stack.py`	Direct read -- RDS PostgreSQL configuration
`infra/stacks/compute_stack.py`	Direct read -- ALB, EC2, security groups, TLS certificates
`infra/app.py`	Direct read -- region configuration (`eu-central-1`)
`src/com/getorcha/ai/llm.clj`	Direct read -- Anthropic and Google Gemini HTTP clients
`src/com/getorcha/ai/agent.clj`	Direct read -- LangChain4j agent loop (Anthropic + Google)
`src/com/getorcha/workers/ap/ingestion/extraction.clj`	Direct read -- invoice extraction prompts sent to Anthropic
`src/com/getorcha/workers/ap/ingestion/classification.clj`	Direct read -- document classification via Anthropic/Google
`src/com/getorcha/workers/ap/ingestion/transcription.clj`	Direct read -- Google Document AI OCR + Gemini vision
`src/com/getorcha/workers/ap/acquisition/email/triage.clj`	Direct read -- email triage via Google Gemini
`src/com/getorcha/workers/ap/acquisition/email/ses.clj`	Direct read -- SES inbound email parsing
`src/com/getorcha/workers/ap/acquisition/email/outlook.clj`	Direct read -- Microsoft Graph API Outlook sync
`src/com/getorcha/workers/ap/acquisition/email/gmail.clj`	Direct read -- Gmail API + Pub/Sub integration
`src/com/getorcha/oauth/providers/microsoft.clj`	Direct read -- Microsoft Entra ID OAuth/OIDC
`src/com/getorcha/app/http/login.clj`	Direct read -- Microsoft SSO (bypasses Cognito)
`src/com/getorcha/notifications.clj`	Direct read -- Slack API, Microsoft Bot Framework, SES email sending
`src/com/getorcha/search.clj`	Direct read -- Google Vertex AI embeddings
`src/com/getorcha/integrations/ap/maesn.clj`	Direct read -- Maesn/DATEV API integration
`src/com/getorcha/workers/ap/matching/searchable_text.clj`	Direct read -- searchable text construction for embeddings
`src/com/getorcha/link/mcp/file_store/google_drive.clj`	Direct read -- Google Drive file access
`resources/com/getorcha/config.edn`	Direct read -- all service configuration
`deps.edn`	Direct read -- dependency declarations
Full-text search across `src/`, `infra/`, `resources/`	Grep for: openai, anthropic, google, microsoft, slack, teams, maesn, datev, mask, redact, pii, cdn, cloudfront, zero-retention

Changes Made

1. REMOVED: "CDN" from AWS purpose (Row 1)

Original claim: "Hosting, database, storage, CDN, authentication" Finding: No CloudFront or CDN configuration exists anywhere in infra/. Zero grep hits for cloudfront or cdn in the infra directory. Traffic flows directly Internet -> ALB -> EC2. Confidence: High. Searched all infra files. Action: Removed "CDN". Expanded AWS purpose to list actual services: RDS PostgreSQL, S3, SQS, SES, Cognito, KMS, Secrets Manager.

2. CORRECTED: Anthropic data categories (Row 2)

Original claim: "Document contents (minimized)" Finding: No data minimization exists. Full OCR text -- including IBANs, VAT IDs, bank details (BIC, account numbers), addresses, personal names -- is sent verbatim to https://api.anthropic.com/v1/messages.

Source: extraction.clj sends the complete transcription string (full OCR output) plus optionally email-body-text. The extraction prompt requests all fields including "IBAN", "BIC", "bank_name", "account_name".

Source: Grep for mask, redact, anonymiz, sanitiz.*pii, scrub across all of src/ returned only one hit -- the word "fieldMask" in the Google Document AI request (an API field selector, not PII masking).

Confidence: High. Comprehensive search confirms no masking/redaction logic exists. Action: Removed "(minimized)". Updated to: "Document contents (full text including financial data, IBANs, VAT IDs, addresses)".

3. CORRECTED: Google DC location and purpose (Row 3)

Original claim: DC Location "EU (configured)" for all Google services. Finding: Only partially true.

Source: config.edn -- Document AI uses location: "eu" (endpoint: eu-documentai.googleapis.com) Source: config.edn -- Vertex AI embeddings use location: "europe-west1" (endpoint: europe-west1-aiplatform.googleapis.com) Source: llm.clj -- Gemini uses generativelanguage.googleapis.com -- this is a global endpoint, not EU-configured.

Confidence: High. Verified endpoints directly in code. Action: Updated DC Location to: "EU for OCR (eu) and embeddings (europe-west1); global endpoint for Gemini AI".

Original claim: Purpose "OCR, AI post-processing, email triage, semantic search embeddings" Finding: Incomplete. Google is used for significantly more:

OCR (Document AI)
Vision transcription (Gemini -- PDF page images at 150 DPI)
Document classification (Gemini)
Email triage (Gemini -- multimodal with thumbnails)
Post-processing: cost center, accounts, accruals, financial validation, uncertain validations (Gemini)
Document matching / LLM decision (Gemini)
Supplier verification with Google Search (Gemini)
Semantic search embeddings (Vertex AI)

Confidence: High. Enumerated from actual callsites in code. Action: Expanded purpose list. Also expanded data categories to include document images and supplier data.

4. ADDED: Microsoft Corporation (New Row 4)

Original: Not listed at all. Finding: Microsoft services are used in three distinct ways:

SSO Authentication (Entra ID): oauth/providers/microsoft.clj implements full OAuth/OIDC against login.microsoftonline.com. Microsoft auth bypasses Cognito entirely due to multi-tenant issuer validation issues. Processes: user email, name, tenant ID.
Outlook Email Acquisition (Graph API): workers/ap/acquisition/email/outlook.clj uses graph.microsoft.com/v1.0 for delta sync of email messages and webhook subscriptions. Accesses: full email contents, attachments, metadata.
Teams Notifications (Bot Framework): notifications.clj sends adaptive card messages via Bot Framework. Sends: notification title and body containing document status, supplier names.

Confidence: High. Code clearly establishes all three integration points. Action: Added Microsoft as sub-processor #4.

5. ADDED: Slack Technologies (New Row 5)

Original: Not listed. Finding: notifications.clj sends messages to Slack via slack.com/api/chat.postMessage using OAuth access tokens. Additionally, app/http/settings/notifications.clj implements Slack OAuth flow for channel setup. Notification messages contain document processing status which may include supplier names and document references.

Confidence: High. Direct API calls confirmed in code. Action: Added Slack as sub-processor #5.

6. ADDED: Maesn GmbH (New Row 6)

Original: Not listed. Finding: integrations/ap/maesn.clj sends full invoice structured data to api.maesn.dev for DATEV booking proposal creation. Data sent includes:

Issuer/recipient names, addresses, IBANs, VAT IDs
Invoice amounts, line items, payment terms
File uploads (cover pages with invoice images)
Ledger queries and account lookups

Maesn is a German company acting as intermediary to the DATEV accounting system.

Confidence: High. API base URL https://api.maesn.dev and full payload construction visible in code. Action: Added Maesn as sub-processor #6 with "No third country transfer" (German company).

7. REMOVED: OpenAI (Former Row 5)

Original claim: "OpenAI OpCo, LLC -- AI data extraction, classification (pre-approved, not yet active)" Finding: Zero references to OpenAI anywhere in the codebase. Grep for openai, open-ai, gpt-4, gpt-3 (case-insensitive) across the entire repository returned no results. No SSM parameters, no config entries, no dependencies, no code.

Confidence: High. Exhaustive search. Action: Removed from sub-processor table entirely. Removed from pre-approved AI providers list (Section 10). If OpenAI is desired as pre-approved, it should go through the approval procedure in Section 3 when actually needed.

8. CORRECTED: Amazon SES purpose and data categories (Former Row 4, now part of Row 1)

Original claim: Purpose "System and notification emails"; Data "Email addresses, sender name" Finding: SES is used for both inbound and outbound:

Inbound (primary function): Receipt rule store-to-s3 receives emails at documents@mail.{env}.getorcha.com, stores raw .eml to S3, triggers SQS. Data processed: full email contents (from, to, subject, body, all attachments, forwarding chain, message-id).

Outbound: aws/send-email! sends notification emails and admin alerts via SES v2.

Confidence: High. Infrastructure (foundation_stack.py lines 504-517) and application code (ses.clj, notifications.clj) both confirm. Action: Merged SES into the AWS row (Row 1) since it's the same legal entity (AWS EMEA SARL). Expanded purpose to "email receiving and sending (SES)".

9. CORRECTED: Section 8 -- Data Minimization claims

Original claim: "verification of masking of sensitive fields (IBAN, tax numbers)" Finding: False. No masking, redaction, or field-level filtering exists before AI API calls. Full document text with all PII is sent verbatim.

The only IBAN-related processing is:

normalize-issuer-iban in extraction.clj -- strips spaces from extracted IBANs after LLM returns them (output normalization, not input masking)
format-iban in components.clj -- display formatting (groups of 4 chars)
searchable_text.clj -- explicitly excludes IBAN from embedding text (but this is for search, not for AI extraction)

Confidence: High. Comprehensive grep confirms no masking logic. Action: Rewrote Section 8 to accurately describe what data is sent to each provider, removed false masking claim, added TODO recommending PII masking implementation.

10. CORRECTED: Section 8 -- Zero data retention

Original claim: "Zero data retention policy applies" and "Binding provisions in O13 DPA Section 4.4" Finding: No API-level zero-retention settings are configured in the code. Specifically:

No anthropic-beta headers for Anthropic's zero-retention feature
No Google-specific data retention configuration
Whether zero-retention is enforced depends entirely on contractual (DPA) terms, which cannot be verified from code.

Confidence: Medium. The contractual provisions may exist but are not technically enforced at the API level. This is a gap worth investigating. Action: Added TODO flag: "Verify that zero-retention provisions are contractually binding in current DPAs -- no API-level zero-retention headers are currently set in the application code."

11. CORRECTED: Version number inconsistency

Original: Header says "Version 5.0" but change history only goes to 3.0. Action: Set to Version 4.0 to follow sequentially from the change history.

12. CORRECTED: Section 9 -- TIA references

Original claim: "TIAs for Anthropic, Google, and OpenAI are available as separate documents" Action: Removed OpenAI reference. Added TODO to verify TIA documents exist for Microsoft and Slack.

13. CORRECTED: Section 10 -- Pre-approved AI providers

Original: Listed Anthropic, Google, and OpenAI. Action: Removed OpenAI (not integrated). List now contains only Anthropic and Google.

14. ADDED: Notes on Gmail/Outlook/Drive data flows

New: Added explanatory section under the sub-processor table clarifying:

Gmail and Outlook email acquisition involves Orcha accessing data held at Google/Microsoft via OAuth
Google Drive integration reads files from customer's Drive
Slack/Teams notifications are customer-configured channels

Confidence Assessment

Finding	Confidence	Basis
No CDN/CloudFront exists	High	Exhaustive search of infra/
No data minimization/masking before AI calls	High	Exhaustive grep of src/ for mask/redact/pii patterns
No OpenAI integration	High	Exhaustive search, zero hits
Microsoft used for auth + Outlook + Teams	High	Multiple source files, direct API URLs
Slack used for notifications	High	Direct `slack.com/api` calls in code
Maesn used for DATEV integration	High	Direct `api.maesn.dev` calls in code
Gemini uses global endpoint (not EU)	High	URL pattern in `llm.clj` uses `generativelanguage.googleapis.com`
Document AI uses EU endpoint	High	Config value `location: "eu"`, endpoint pattern includes location
Vertex AI uses europe-west1	High	Config value `location: "europe-west1"`
Full financial PII sent to Anthropic	High	Extraction prompt and transcription pass-through verified
No API-level zero-retention headers	Medium	No headers found, but contractual terms not verifiable from code
SCCs/DPA/TIA existence for each provider	Unknown	Legal documents not in codebase; cannot verify
§203 StGB contractual compliance	Unknown	Legal assessment not verifiable from code

Retractions

None. All claims in this review are supported by direct code evidence cited above.

Recommended Actions (TODOs)

Implement PII masking before AI API calls -- IBANs, tax numbers, and bank details should be masked or tokenized before transmission to Anthropic and Google. This is stated as a principle in Section 8 but not implemented.
Enable Anthropic zero-retention headers -- Set the appropriate anthropic-beta header to technically enforce zero data retention, rather than relying solely on contractual terms.
Investigate Gemini data residency -- The Gemini API (generativelanguage.googleapis.com) is a global endpoint. If EU data residency is required, consider migrating to Vertex AI Gemini endpoints which support regional configuration.
Verify/create TIA documents -- Ensure Transfer Impact Assessments exist and are current for all third-country sub-processors: Anthropic, Google, Microsoft, Slack.
Verify DPA coverage -- Ensure DPAs are in place with Microsoft (for Graph API, Bot Framework), Slack, and Maesn.
Review notification content -- Assess whether notification messages sent to Slack/Teams contain personal data that requires additional safeguards.