O3 Sub-Processor List -- Codebase Verification Review

Date: 12.04.2026 Reviewer: Automated codebase analysis Scope: Verified every claim in O3 v3.0 against the actual Orcha codebase (orcha/src/, orcha/infra/, orcha/resources/) and infrastructure (CDK stacks)

Sources Consulted

Source Access Method
infra/stacks/foundation_stack.py Direct read -- CDK stack defining S3, SQS, SES, Cognito, KMS, SSM
infra/stacks/data_stack.py Direct read -- RDS PostgreSQL configuration
infra/stacks/compute_stack.py Direct read -- ALB, EC2, security groups, TLS certificates
infra/app.py Direct read -- region configuration (eu-central-1)
src/com/getorcha/ai/llm.clj Direct read -- Anthropic and Google Gemini HTTP clients
src/com/getorcha/ai/agent.clj Direct read -- LangChain4j agent loop (Anthropic + Google)
src/com/getorcha/workers/ap/ingestion/extraction.clj Direct read -- invoice extraction prompts sent to Anthropic
src/com/getorcha/workers/ap/ingestion/classification.clj Direct read -- document classification via Anthropic/Google
src/com/getorcha/workers/ap/ingestion/transcription.clj Direct read -- Google Document AI OCR + Gemini vision
src/com/getorcha/workers/ap/acquisition/email/triage.clj Direct read -- email triage via Google Gemini
src/com/getorcha/workers/ap/acquisition/email/ses.clj Direct read -- SES inbound email parsing
src/com/getorcha/workers/ap/acquisition/email/outlook.clj Direct read -- Microsoft Graph API Outlook sync
src/com/getorcha/workers/ap/acquisition/email/gmail.clj Direct read -- Gmail API + Pub/Sub integration
src/com/getorcha/oauth/providers/microsoft.clj Direct read -- Microsoft Entra ID OAuth/OIDC
src/com/getorcha/app/http/login.clj Direct read -- Microsoft SSO (bypasses Cognito)
src/com/getorcha/notifications.clj Direct read -- Slack API, Microsoft Bot Framework, SES email sending
src/com/getorcha/search.clj Direct read -- Google Vertex AI embeddings
src/com/getorcha/integrations/ap/maesn.clj Direct read -- Maesn/DATEV API integration
src/com/getorcha/workers/ap/matching/searchable_text.clj Direct read -- searchable text construction for embeddings
src/com/getorcha/link/mcp/file_store/google_drive.clj Direct read -- Google Drive file access
resources/com/getorcha/config.edn Direct read -- all service configuration
deps.edn Direct read -- dependency declarations
Full-text search across src/, infra/, resources/ Grep for: openai, anthropic, google, microsoft, slack, teams, maesn, datev, mask, redact, pii, cdn, cloudfront, zero-retention

Changes Made

1. REMOVED: "CDN" from AWS purpose (Row 1)

Original claim: "Hosting, database, storage, CDN, authentication" Finding: No CloudFront or CDN configuration exists anywhere in infra/. Zero grep hits for cloudfront or cdn in the infra directory. Traffic flows directly Internet -> ALB -> EC2. Confidence: High. Searched all infra files. Action: Removed "CDN". Expanded AWS purpose to list actual services: RDS PostgreSQL, S3, SQS, SES, Cognito, KMS, Secrets Manager.

2. CORRECTED: Anthropic data categories (Row 2)

Original claim: "Document contents (minimized)" Finding: No data minimization exists. Full OCR text -- including IBANs, VAT IDs, bank details (BIC, account numbers), addresses, personal names -- is sent verbatim to https://api.anthropic.com/v1/messages.

Source: extraction.clj sends the complete transcription string (full OCR output) plus optionally email-body-text. The extraction prompt requests all fields including "IBAN", "BIC", "bank_name", "account_name".

Source: Grep for mask, redact, anonymiz, sanitiz.*pii, scrub across all of src/ returned only one hit -- the word "fieldMask" in the Google Document AI request (an API field selector, not PII masking).

Confidence: High. Comprehensive search confirms no masking/redaction logic exists. Action: Removed "(minimized)". Updated to: "Document contents (full text including financial data, IBANs, VAT IDs, addresses)".

3. CORRECTED: Google DC location and purpose (Row 3)

Original claim: DC Location "EU (configured)" for all Google services. Finding: Only partially true.

Source: config.edn -- Document AI uses location: "eu" (endpoint: eu-documentai.googleapis.com) Source: config.edn -- Vertex AI embeddings use location: "europe-west1" (endpoint: europe-west1-aiplatform.googleapis.com) Source: llm.clj -- Gemini uses generativelanguage.googleapis.com -- this is a global endpoint, not EU-configured.

Confidence: High. Verified endpoints directly in code. Action: Updated DC Location to: "EU for OCR (eu) and embeddings (europe-west1); global endpoint for Gemini AI".

Original claim: Purpose "OCR, AI post-processing, email triage, semantic search embeddings" Finding: Incomplete. Google is used for significantly more:

Confidence: High. Enumerated from actual callsites in code. Action: Expanded purpose list. Also expanded data categories to include document images and supplier data.

4. ADDED: Microsoft Corporation (New Row 4)

Original: Not listed at all. Finding: Microsoft services are used in three distinct ways:

  1. SSO Authentication (Entra ID): oauth/providers/microsoft.clj implements full OAuth/OIDC against login.microsoftonline.com. Microsoft auth bypasses Cognito entirely due to multi-tenant issuer validation issues. Processes: user email, name, tenant ID.

  2. Outlook Email Acquisition (Graph API): workers/ap/acquisition/email/outlook.clj uses graph.microsoft.com/v1.0 for delta sync of email messages and webhook subscriptions. Accesses: full email contents, attachments, metadata.

  3. Teams Notifications (Bot Framework): notifications.clj sends adaptive card messages via Bot Framework. Sends: notification title and body containing document status, supplier names.

Confidence: High. Code clearly establishes all three integration points. Action: Added Microsoft as sub-processor #4.

5. ADDED: Slack Technologies (New Row 5)

Original: Not listed. Finding: notifications.clj sends messages to Slack via slack.com/api/chat.postMessage using OAuth access tokens. Additionally, app/http/settings/notifications.clj implements Slack OAuth flow for channel setup. Notification messages contain document processing status which may include supplier names and document references.

Confidence: High. Direct API calls confirmed in code. Action: Added Slack as sub-processor #5.

6. ADDED: Maesn GmbH (New Row 6)

Original: Not listed. Finding: integrations/ap/maesn.clj sends full invoice structured data to api.maesn.dev for DATEV booking proposal creation. Data sent includes:

Maesn is a German company acting as intermediary to the DATEV accounting system.

Confidence: High. API base URL https://api.maesn.dev and full payload construction visible in code. Action: Added Maesn as sub-processor #6 with "No third country transfer" (German company).

7. REMOVED: OpenAI (Former Row 5)

Original claim: "OpenAI OpCo, LLC -- AI data extraction, classification (pre-approved, not yet active)" Finding: Zero references to OpenAI anywhere in the codebase. Grep for openai, open-ai, gpt-4, gpt-3 (case-insensitive) across the entire repository returned no results. No SSM parameters, no config entries, no dependencies, no code.

Confidence: High. Exhaustive search. Action: Removed from sub-processor table entirely. Removed from pre-approved AI providers list (Section 10). If OpenAI is desired as pre-approved, it should go through the approval procedure in Section 3 when actually needed.

8. CORRECTED: Amazon SES purpose and data categories (Former Row 4, now part of Row 1)

Original claim: Purpose "System and notification emails"; Data "Email addresses, sender name" Finding: SES is used for both inbound and outbound:

Inbound (primary function): Receipt rule store-to-s3 receives emails at documents@mail.{env}.getorcha.com, stores raw .eml to S3, triggers SQS. Data processed: full email contents (from, to, subject, body, all attachments, forwarding chain, message-id).

Outbound: aws/send-email! sends notification emails and admin alerts via SES v2.

Confidence: High. Infrastructure (foundation_stack.py lines 504-517) and application code (ses.clj, notifications.clj) both confirm. Action: Merged SES into the AWS row (Row 1) since it's the same legal entity (AWS EMEA SARL). Expanded purpose to "email receiving and sending (SES)".

9. CORRECTED: Section 8 -- Data Minimization claims

Original claim: "verification of masking of sensitive fields (IBAN, tax numbers)" Finding: False. No masking, redaction, or field-level filtering exists before AI API calls. Full document text with all PII is sent verbatim.

The only IBAN-related processing is:

Confidence: High. Comprehensive grep confirms no masking logic. Action: Rewrote Section 8 to accurately describe what data is sent to each provider, removed false masking claim, added TODO recommending PII masking implementation.

10. CORRECTED: Section 8 -- Zero data retention

Original claim: "Zero data retention policy applies" and "Binding provisions in O13 DPA Section 4.4" Finding: No API-level zero-retention settings are configured in the code. Specifically:

Confidence: Medium. The contractual provisions may exist but are not technically enforced at the API level. This is a gap worth investigating. Action: Added TODO flag: "Verify that zero-retention provisions are contractually binding in current DPAs -- no API-level zero-retention headers are currently set in the application code."

11. CORRECTED: Version number inconsistency

Original: Header says "Version 5.0" but change history only goes to 3.0. Action: Set to Version 4.0 to follow sequentially from the change history.

12. CORRECTED: Section 9 -- TIA references

Original claim: "TIAs for Anthropic, Google, and OpenAI are available as separate documents" Action: Removed OpenAI reference. Added TODO to verify TIA documents exist for Microsoft and Slack.

13. CORRECTED: Section 10 -- Pre-approved AI providers

Original: Listed Anthropic, Google, and OpenAI. Action: Removed OpenAI (not integrated). List now contains only Anthropic and Google.

14. ADDED: Notes on Gmail/Outlook/Drive data flows

New: Added explanatory section under the sub-processor table clarifying:


Confidence Assessment

Finding Confidence Basis
No CDN/CloudFront exists High Exhaustive search of infra/
No data minimization/masking before AI calls High Exhaustive grep of src/ for mask/redact/pii patterns
No OpenAI integration High Exhaustive search, zero hits
Microsoft used for auth + Outlook + Teams High Multiple source files, direct API URLs
Slack used for notifications High Direct slack.com/api calls in code
Maesn used for DATEV integration High Direct api.maesn.dev calls in code
Gemini uses global endpoint (not EU) High URL pattern in llm.clj uses generativelanguage.googleapis.com
Document AI uses EU endpoint High Config value location: "eu", endpoint pattern includes location
Vertex AI uses europe-west1 High Config value location: "europe-west1"
Full financial PII sent to Anthropic High Extraction prompt and transcription pass-through verified
No API-level zero-retention headers Medium No headers found, but contractual terms not verifiable from code
SCCs/DPA/TIA existence for each provider Unknown Legal documents not in codebase; cannot verify
ยง203 StGB contractual compliance Unknown Legal assessment not verifiable from code

Retractions

None. All claims in this review are supported by direct code evidence cited above.

  1. Implement PII masking before AI API calls -- IBANs, tax numbers, and bank details should be masked or tokenized before transmission to Anthropic and Google. This is stated as a principle in Section 8 but not implemented.
  2. Enable Anthropic zero-retention headers -- Set the appropriate anthropic-beta header to technically enforce zero data retention, rather than relying solely on contractual terms.
  3. Investigate Gemini data residency -- The Gemini API (generativelanguage.googleapis.com) is a global endpoint. If EU data residency is required, consider migrating to Vertex AI Gemini endpoints which support regional configuration.
  4. Verify/create TIA documents -- Ensure Transfer Impact Assessments exist and are current for all third-country sub-processors: Anthropic, Google, Microsoft, Slack.
  5. Verify DPA coverage -- Ensure DPAs are in place with Microsoft (for Graph API, Bot Framework), Slack, and Maesn.
  6. Review notification content -- Assess whether notification messages sent to Slack/Teams contain personal data that requires additional safeguards.