O14 Transfer Impact Assessments -- Codebase Verification Review

Date: 12.04.2026 Reviewer: Automated codebase analysis Scope: Verified every claim in O14 v2.0 against the Orcha codebase (orcha/src/, orcha/infra/) and CDK infrastructure

Sources Consulted

Source	Key findings area
`infra/stacks/foundation_stack.py`	Cognito, S3, SQS, KMS, SSM parameters
`infra/stacks/data_stack.py`	RDS PostgreSQL, storage encryption
`infra/stacks/compute_stack.py`	ALB, EC2, TLS certificates
`src/com/getorcha/ai/llm.clj`	AI provider HTTP clients, endpoints
`src/com/getorcha/workers/ap/ingestion/extraction.clj`	Data actually sent to Anthropic
`src/com/getorcha/workers/ap/ingestion/transcription.clj`	Data sent to Google Document AI and Gemini Vision
`src/com/getorcha/workers/ap/acquisition/email/outlook.clj`	Microsoft Graph API usage
`src/com/getorcha/workers/ap/acquisition/email/gmail.clj`	Gmail API usage
`src/com/getorcha/oauth/providers/microsoft.clj`	Microsoft Entra ID authentication
`src/com/getorcha/notifications.clj`	Slack API, Microsoft Bot Framework
`src/com/getorcha/search.clj`	Google Vertex AI embeddings
`resources/com/getorcha/config.edn`	Provider endpoints and regions
Full-text grep across `src/`, `infra/`	Claim verification
Prior O3 and O1 review findings	Cross-referenced

Changes Made

1. REMOVED: OpenAI (entire Section 6 deleted)

Original: Section 6 was a full TIA for "OpenAI OpCo, LLC (Pre-Approved)" including company profile, data flow, transfer mechanism, US law exposure analysis, and risk determination. The document stated OpenAI was "pre-approved as a sub-processor but not yet active".

Finding: OpenAI is not integrated in the codebase. Exhaustive grep for openai, open-ai, gpt-4, gpt-3 across the entire repository returned zero matches. No SSM parameters, no config entries, no dependencies, no code.

Confidence: High. This matches the O3 review finding.

Action: Removed the entire OpenAI section. A TIA should assess actual data transfers. If OpenAI is desired as a future sub-processor, a TIA should be conducted as part of the activation procedure, not in advance on speculation. Removed OpenAI from scope, overview table, conclusion, and pre-approved list references.

2. ADDED: Microsoft Corporation (new Section 6)

Original: Microsoft was not covered. The original document treated Microsoft as though it was not a sub-processor.

Finding: Microsoft is a significant US-based sub-processor processing personal data in three distinct ways:

Microsoft Entra ID for SSO authentication (user identity, email, tenant ID)
Microsoft Graph API for Outlook email acquisition (full email contents and attachments)
Microsoft Bot Framework for Teams notification delivery

Confidence: High. Direct API URLs found in oauth/providers/microsoft.clj, workers/ap/acquisition/email/outlook.clj, and notifications.clj.

Action: Added a full Section 6 individual TIA for Microsoft Corporation with company profile, services used, data flow, transfer mechanism, US law exposure analysis, supplementary measures, and risk determination.

3. ADDED: Slack Technologies, LLC (new Section 7)

Original: Slack was not covered.

Finding: Slack Technologies (Salesforce subsidiary) receives notification messages containing document processing status, which may include supplier names and document references. Direct API calls to slack.com/api/chat.postMessage found in notifications.clj.

Confidence: High. Direct API usage confirmed in code.

Action: Added a full Section 7 individual TIA for Slack Technologies.

4. CORRECTED: AWS services list (Section 8)

Original claim: "Cloud hosting, compute (EC2), database (RDS/DynamoDB), object storage (S3), content delivery (CloudFront), authentication (Cognito), email delivery (SES), and backup services."

Findings:

DynamoDB: FALSE. Zero matches for dynamodb or DynamoDB across the codebase. Only RDS PostgreSQL is used.
CloudFront: FALSE. No CloudFront configuration anywhere. Zero matches in infra/. This matches the O1 review finding.
SQS, KMS, secrets management: These AWS services ARE used but were not listed.

Action: Removed DynamoDB and CloudFront. Added RDS, SQS, KMS, secrets management as accurate representations of used services.

5. CORRECTED: AWS encryption at rest claim (Section 8)

Original claim: "Encryption at rest: AES-256 encryption for all stored data, with customer-managed keys (AWS KMS)"

Finding: This overstates the actual encryption configuration.

RDS database uses default AWS-managed RDS key (no customer-managed key specified)
S3 buckets use S3-managed encryption (SSE-S3, not SSE-KMS with customer-managed keys)
SQS queues use SQS-managed encryption
Only specific sensitive application fields (authentication tokens, third-party credentials) are encrypted with a customer-managed KMS key

Evidence: data_stack.py uses storage_encrypted=True without storage_encryption_key. foundation_stack.py S3 buckets use encryption=s3.BucketEncryption.S3_MANAGED. Single customer-managed KMS key (db-secrets-key) is used for specific field-level encryption only.

Confidence: High.

Action: Corrected to distinguish between AWS-managed encryption (storage-level) and customer-managed KMS (sensitive application fields).

6. CORRECTED: TLS 1.3 claim throughout

Original claim: "TLS 1.3 encryption for all API calls" (appears multiple times)

Finding: TLS 1.3 is not explicitly configured. The ALB uses the AWS default TLS policy. Outbound HTTP calls use the JVM default TLS settings via the hato HTTP client. No TLS version is pinned in code or infrastructure. Actual TLS version negotiated depends on server support (likely TLS 1.3 for modern endpoints, TLS 1.2 fallback).

Confidence: High. Zero matches for TLS version configuration in code.

Action: Replaced "TLS 1.3" with "TLS" throughout to reflect the actual measure (transport encryption) without claiming a specific version that isn't enforced.

7. CORRECTED: PII masking / data minimization claims

Original claims:

"Data minimisation is applied where technically feasible: IBAN masking and tax ID redaction before API transmission"
"Data minimisation: IBAN masking and tax ID redaction applied before API transmission where technically feasible"
"Data minimisation: Where technically feasible, IBAN masking and tax ID redaction are applied before API transmission"

Finding: FALSE. No PII masking exists. The extraction prompt explicitly instructs the LLM: "IBAN & BIC: CRITICAL payment fields -- always extract the supplier's bank details". Full document text with all financial PII is transmitted unmasked. This matches the O3 and O1 review findings.

Confidence: High.

Action: Removed all PII masking / IBAN masking / tax ID redaction claims. The actual data minimization measures in place are: Vertex AI embeddings exclude IBANs (for search indexing only), email triage truncates body text, previews are low-resolution. Described these accurately.

8. CORRECTED: Zero-retention claims

Original claims:

"Zero data retention: Anthropic's Claude API is configured with zero-retention settings"
"Zero data retention: Google Document AI processes data in-memory without persistent storage"
"Orcha's zero data retention policy ensures that data sent to sub-processors is transient"
Multiple mentions of "zero-retention API configuration" as a technical measure

Finding: There is no API-level zero-retention enforcement in the code. Anthropic's anthropic-beta header for zero-retention is not set. No Google-specific data retention configuration is applied. Zero-retention, if it applies, is enforced only contractually (via DPA), not technically at the API level.

Confidence: Medium. The contractual provisions may exist but cannot be verified from code; the claim that this is a technical measure is not supported.

Action: Reclassified "zero retention" from a technical measure to a contractual commitment throughout the document. Rephrased risk mitigation language to attribute these protections to DPA provisions rather than API configuration.

9. CORRECTED: Google data flow and regions

Original claim: "Customer document is retrieved from AWS Frankfurt, transmitted over TLS 1.3 to Google's Document AI and Gemini endpoints, processed with no persistent storage"

Finding: This is partially accurate but conflates services that have very different data residency:

Document AI is configured for EU region (eu-documentai.googleapis.com)
Vertex AI embeddings use europe-west1
Gemini API uses generativelanguage.googleapis.com, which is a global endpoint with no EU guarantee

Confidence: High. Matches O3 review finding.

Action: Corrected the Google section to distinguish between EU-regional Google services (Document AI, Vertex AI) and global Gemini endpoint. Updated the overview table to reflect this split.

10. EXPANDED: Data scope for AI services

Original claim: "Orcha makes an API call to Anthropic or Google, transmitting only the document content and extracted fields"

Finding: The actual scope is broader:

For Anthropic: full document text plus optionally email body context
For Google Gemini vision: full PDF page images
For Google Gemini triage: email body (truncated), sender, subject, and thumbnail images of attachments
For Google Gemini supplier verification: supplier name, country, VAT ID, address (used as input to Google Search)
For Google Vertex AI: derived text fields (excluding IBANs)

Confidence: High. Matches O3 review findings.

Action: Expanded the data flow and data categories descriptions to reflect the actual scope of data transferred.

11. CORRECTED: DPF certifications list

Original: DPF certifications were mentioned for Google, OpenAI, and AWS only. Microsoft's DPF certification was not referenced.

Finding: Microsoft is DPF-certified (public information). Since Microsoft is now in scope, its DPF certification should be referenced. OpenAI's DPF certification is no longer relevant as OpenAI is not used.

Confidence: Medium. DPF certifications are external facts (on the DPF public list); I did not independently verify current certification status, but these statuses can be verified against dataprivacyframework.gov at document review time.

Action: Added Microsoft DPF reference; removed OpenAI DPF reference.

12. REMOVED: Specific risk-analysis claims that rely on removed measures

Throughout the document, the original risk analysis relied heavily on the "zero retention" technical claim to justify MEDIUM risk ratings. For example:

"Orcha's zero data retention policy ensures that data sent to sub-processors is transient; there is no persistent data store at the AI provider to produce in response to a 702 order."

This was the primary justification for mitigating FISA 702, EO 12333, and CLOUD Act risks. Since the zero-retention protection is contractual only (not technically enforced), the mitigation language has been softened accordingly.

Action: Rephrased FISA 702, EO 12333, and CLOUD Act mitigation language to attribute protection to contractual commitments rather than technical enforcement. The overall MEDIUM risk rating is retained but the justification is more accurate.

13. CORRECTED: Change history and version

Original: Version 2.0, April 2026. No change history table. Action: Set to Version 3.0, 12.04.2026. Added change history table documenting v1.0, v2.0, and v3.0 changes.

Confidence Assessment

Finding	Confidence	Basis
OpenAI not integrated	High	Exhaustive grep; matches O3 review
Microsoft is a US sub-processor	High	Direct API URLs to graph.microsoft.com, login.microsoftonline.com
Slack is a US sub-processor	High	Direct calls to slack.com/api
DynamoDB not used	High	Zero matches
CloudFront not used	High	Zero matches in infra/
TLS 1.3 not explicitly configured	High	No TLS version config in code
No PII masking before AI calls	High	Matches O3 and O1 findings; extraction prompt requests IBANs
No API-level zero-retention	Medium	No retention headers found; contractual terms not verifiable from code
Gemini uses global endpoint	High	Direct URL in llm.clj
Document AI uses EU region	High	Config value location=eu
Vertex AI uses europe-west1	High	Config value
AWS encryption uses mix of managed/customer keys	High	CDK configuration directly inspected
DPF certifications of Google/Microsoft/Amazon	Medium	Public facts; verifiable at DPF list but not from code
§203 StGB contractual coverage	Unknown	Legal content not verifiable from code
Specific SCCs and UK Addendum versions	Unknown	Contractual documents not in codebase

Retractions

None. All claims in this review are supported by code evidence or cross-referenced with prior reviews.

Recommended Actions (TODOs)

Critical

Remove OpenAI from operational planning unless actual integration begins -- A proactive TIA for a non-integrated provider creates maintenance burden and risk of stale commitments. If OpenAI is genuinely planned, the TIA should be produced at the point of activation as part of the O3 approval procedure.
Add Microsoft DPA coverage -- Ensure DPAs are in place with Microsoft covering Graph API, Entra ID, and Bot Framework usage. This has been flagged in the O3 review as well.
Add Slack DPA coverage -- Ensure a DPA with Salesforce (parent of Slack) covers Slack API usage.
Implement technical enforcement of zero-retention -- Set Anthropic anthropic-beta zero-retention header. Verify Google's default API retention and configure explicit no-retention where available. This converts the zero-retention guarantee from contract-only to technically enforced.
Implement PII masking before AI API calls -- Currently the document's supplementary measures cannot include data minimization of PII fields because none is implemented. Either build the masking, or the document should not claim it.

High Priority

Migrate Gemini to Vertex AI regional endpoint -- Using Vertex AI Gemini endpoints in europe-west1 instead of the global generativelanguage.googleapis.com would significantly reduce data residency risk and simplify the TIA.
Verify DPF certification status annually -- Google, Microsoft, and Amazon DPF certifications should be verified against dataprivacyframework.gov at each TIA review.
Document formal escalation path for government requests -- The escalation procedure names roles (DPO, Technical Lead, Managing Director) but should map to specific individuals and contact details.

Medium Priority

Pin TLS version explicitly -- Configure ALB TLS security policy explicitly and set minimum TLS 1.2 (or 1.3) in outbound HTTP client configuration, so the "TLS encryption" claim is auditable at a specific version.
Extend TIA scope to cover Maesn -- Maesn (DATEV integration) is an EU sub-processor so does not require US transfer analysis. If other third-country sub-processors are added in the future, extend this TIA to cover them.