Date: 12.04.2026
Reviewer: Automated codebase analysis
Scope: Verified every claim in O3 v3.0 against the actual Orcha codebase (orcha/src/, orcha/infra/, orcha/resources/) and infrastructure (CDK stacks)
| Source | Access Method |
|---|---|
infra/stacks/foundation_stack.py |
Direct read -- CDK stack defining S3, SQS, SES, Cognito, KMS, SSM |
infra/stacks/data_stack.py |
Direct read -- RDS PostgreSQL configuration |
infra/stacks/compute_stack.py |
Direct read -- ALB, EC2, security groups, TLS certificates |
infra/app.py |
Direct read -- region configuration (eu-central-1) |
src/com/getorcha/ai/llm.clj |
Direct read -- Anthropic and Google Gemini HTTP clients |
src/com/getorcha/ai/agent.clj |
Direct read -- LangChain4j agent loop (Anthropic + Google) |
src/com/getorcha/workers/ap/ingestion/extraction.clj |
Direct read -- invoice extraction prompts sent to Anthropic |
src/com/getorcha/workers/ap/ingestion/classification.clj |
Direct read -- document classification via Anthropic/Google |
src/com/getorcha/workers/ap/ingestion/transcription.clj |
Direct read -- Google Document AI OCR + Gemini vision |
src/com/getorcha/workers/ap/acquisition/email/triage.clj |
Direct read -- email triage via Google Gemini |
src/com/getorcha/workers/ap/acquisition/email/ses.clj |
Direct read -- SES inbound email parsing |
src/com/getorcha/workers/ap/acquisition/email/outlook.clj |
Direct read -- Microsoft Graph API Outlook sync |
src/com/getorcha/workers/ap/acquisition/email/gmail.clj |
Direct read -- Gmail API + Pub/Sub integration |
src/com/getorcha/oauth/providers/microsoft.clj |
Direct read -- Microsoft Entra ID OAuth/OIDC |
src/com/getorcha/app/http/login.clj |
Direct read -- Microsoft SSO (bypasses Cognito) |
src/com/getorcha/notifications.clj |
Direct read -- Slack API, Microsoft Bot Framework, SES email sending |
src/com/getorcha/search.clj |
Direct read -- Google Vertex AI embeddings |
src/com/getorcha/integrations/ap/maesn.clj |
Direct read -- Maesn/DATEV API integration |
src/com/getorcha/workers/ap/matching/searchable_text.clj |
Direct read -- searchable text construction for embeddings |
src/com/getorcha/link/mcp/file_store/google_drive.clj |
Direct read -- Google Drive file access |
resources/com/getorcha/config.edn |
Direct read -- all service configuration |
deps.edn |
Direct read -- dependency declarations |
Full-text search across src/, infra/, resources/ |
Grep for: openai, anthropic, google, microsoft, slack, teams, maesn, datev, mask, redact, pii, cdn, cloudfront, zero-retention |
Original claim: "Hosting, database, storage, CDN, authentication"
Finding: No CloudFront or CDN configuration exists anywhere in infra/. Zero grep hits for cloudfront or cdn in the infra directory. Traffic flows directly Internet -> ALB -> EC2.
Confidence: High. Searched all infra files.
Action: Removed "CDN". Expanded AWS purpose to list actual services: RDS PostgreSQL, S3, SQS, SES, Cognito, KMS, Secrets Manager.
Original claim: "Document contents (minimized)"
Finding: No data minimization exists. Full OCR text -- including IBANs, VAT IDs, bank details (BIC, account numbers), addresses, personal names -- is sent verbatim to https://api.anthropic.com/v1/messages.
Source:
extraction.cljsends the completetranscriptionstring (full OCR output) plus optionallyemail-body-text. The extraction prompt requests all fields including "IBAN", "BIC", "bank_name", "account_name".
Source: Grep for
mask,redact,anonymiz,sanitiz.*pii,scrubacross all ofsrc/returned only one hit -- the word "fieldMask" in the Google Document AI request (an API field selector, not PII masking).
Confidence: High. Comprehensive search confirms no masking/redaction logic exists. Action: Removed "(minimized)". Updated to: "Document contents (full text including financial data, IBANs, VAT IDs, addresses)".
Original claim: DC Location "EU (configured)" for all Google services. Finding: Only partially true.
Source:
config.edn-- Document AI useslocation: "eu"(endpoint:eu-documentai.googleapis.com) Source:config.edn-- Vertex AI embeddings uselocation: "europe-west1"(endpoint:europe-west1-aiplatform.googleapis.com) Source:llm.clj-- Gemini usesgenerativelanguage.googleapis.com-- this is a global endpoint, not EU-configured.
Confidence: High. Verified endpoints directly in code. Action: Updated DC Location to: "EU for OCR (eu) and embeddings (europe-west1); global endpoint for Gemini AI".
Original claim: Purpose "OCR, AI post-processing, email triage, semantic search embeddings" Finding: Incomplete. Google is used for significantly more:
Confidence: High. Enumerated from actual callsites in code. Action: Expanded purpose list. Also expanded data categories to include document images and supplier data.
Original: Not listed at all. Finding: Microsoft services are used in three distinct ways:
SSO Authentication (Entra ID): oauth/providers/microsoft.clj implements full OAuth/OIDC against login.microsoftonline.com. Microsoft auth bypasses Cognito entirely due to multi-tenant issuer validation issues. Processes: user email, name, tenant ID.
Outlook Email Acquisition (Graph API): workers/ap/acquisition/email/outlook.clj uses graph.microsoft.com/v1.0 for delta sync of email messages and webhook subscriptions. Accesses: full email contents, attachments, metadata.
Teams Notifications (Bot Framework): notifications.clj sends adaptive card messages via Bot Framework. Sends: notification title and body containing document status, supplier names.
Confidence: High. Code clearly establishes all three integration points. Action: Added Microsoft as sub-processor #4.
Original: Not listed.
Finding: notifications.clj sends messages to Slack via slack.com/api/chat.postMessage using OAuth access tokens. Additionally, app/http/settings/notifications.clj implements Slack OAuth flow for channel setup. Notification messages contain document processing status which may include supplier names and document references.
Confidence: High. Direct API calls confirmed in code. Action: Added Slack as sub-processor #5.
Original: Not listed.
Finding: integrations/ap/maesn.clj sends full invoice structured data to api.maesn.dev for DATEV booking proposal creation. Data sent includes:
Maesn is a German company acting as intermediary to the DATEV accounting system.
Confidence: High. API base URL https://api.maesn.dev and full payload construction visible in code.
Action: Added Maesn as sub-processor #6 with "No third country transfer" (German company).
Original claim: "OpenAI OpCo, LLC -- AI data extraction, classification (pre-approved, not yet active)"
Finding: Zero references to OpenAI anywhere in the codebase. Grep for openai, open-ai, gpt-4, gpt-3 (case-insensitive) across the entire repository returned no results. No SSM parameters, no config entries, no dependencies, no code.
Confidence: High. Exhaustive search. Action: Removed from sub-processor table entirely. Removed from pre-approved AI providers list (Section 10). If OpenAI is desired as pre-approved, it should go through the approval procedure in Section 3 when actually needed.
Original claim: Purpose "System and notification emails"; Data "Email addresses, sender name" Finding: SES is used for both inbound and outbound:
Inbound (primary function): Receipt rule store-to-s3 receives emails at documents@mail.{env}.getorcha.com, stores raw .eml to S3, triggers SQS. Data processed: full email contents (from, to, subject, body, all attachments, forwarding chain, message-id).
Outbound: aws/send-email! sends notification emails and admin alerts via SES v2.
Confidence: High. Infrastructure (foundation_stack.py lines 504-517) and application code (ses.clj, notifications.clj) both confirm.
Action: Merged SES into the AWS row (Row 1) since it's the same legal entity (AWS EMEA SARL). Expanded purpose to "email receiving and sending (SES)".
Original claim: "verification of masking of sensitive fields (IBAN, tax numbers)" Finding: False. No masking, redaction, or field-level filtering exists before AI API calls. Full document text with all PII is sent verbatim.
The only IBAN-related processing is:
normalize-issuer-iban in extraction.clj -- strips spaces from extracted IBANs after LLM returns them (output normalization, not input masking)format-iban in components.clj -- display formatting (groups of 4 chars)searchable_text.clj -- explicitly excludes IBAN from embedding text (but this is for search, not for AI extraction)Confidence: High. Comprehensive grep confirms no masking logic. Action: Rewrote Section 8 to accurately describe what data is sent to each provider, removed false masking claim, added TODO recommending PII masking implementation.
Original claim: "Zero data retention policy applies" and "Binding provisions in O13 DPA Section 4.4" Finding: No API-level zero-retention settings are configured in the code. Specifically:
anthropic-beta headers for Anthropic's zero-retention featureConfidence: Medium. The contractual provisions may exist but are not technically enforced at the API level. This is a gap worth investigating. Action: Added TODO flag: "Verify that zero-retention provisions are contractually binding in current DPAs -- no API-level zero-retention headers are currently set in the application code."
Original: Header says "Version 5.0" but change history only goes to 3.0. Action: Set to Version 4.0 to follow sequentially from the change history.
Original claim: "TIAs for Anthropic, Google, and OpenAI are available as separate documents" Action: Removed OpenAI reference. Added TODO to verify TIA documents exist for Microsoft and Slack.
Original: Listed Anthropic, Google, and OpenAI. Action: Removed OpenAI (not integrated). List now contains only Anthropic and Google.
New: Added explanatory section under the sub-processor table clarifying:
| Finding | Confidence | Basis |
|---|---|---|
| No CDN/CloudFront exists | High | Exhaustive search of infra/ |
| No data minimization/masking before AI calls | High | Exhaustive grep of src/ for mask/redact/pii patterns |
| No OpenAI integration | High | Exhaustive search, zero hits |
| Microsoft used for auth + Outlook + Teams | High | Multiple source files, direct API URLs |
| Slack used for notifications | High | Direct slack.com/api calls in code |
| Maesn used for DATEV integration | High | Direct api.maesn.dev calls in code |
| Gemini uses global endpoint (not EU) | High | URL pattern in llm.clj uses generativelanguage.googleapis.com |
| Document AI uses EU endpoint | High | Config value location: "eu", endpoint pattern includes location |
| Vertex AI uses europe-west1 | High | Config value location: "europe-west1" |
| Full financial PII sent to Anthropic | High | Extraction prompt and transcription pass-through verified |
| No API-level zero-retention headers | Medium | No headers found, but contractual terms not verifiable from code |
| SCCs/DPA/TIA existence for each provider | Unknown | Legal documents not in codebase; cannot verify |
| ยง203 StGB contractual compliance | Unknown | Legal assessment not verifiable from code |
None. All claims in this review are supported by direct code evidence cited above.
anthropic-beta header to technically enforce zero data retention, rather than relying solely on contractual terms.generativelanguage.googleapis.com) is a global endpoint. If EU data residency is required, consider migrating to Vertex AI Gemini endpoints which support regional configuration.