O1 TOM Documentation -- Codebase Verification Review

Date: 12.04.2026 Reviewer: Automated codebase analysis Scope: Verified every technical claim in O1 v4.0 against the Orcha codebase (orcha/src/, orcha/infra/, orcha/resources/) and CDK infrastructure

Sources Consulted

Source Key findings area
infra/stacks/foundation_stack.py VPC, Cognito, S3, SQS, SES, KMS, SSM
infra/stacks/data_stack.py RDS PostgreSQL configuration
infra/stacks/compute_stack.py ALB, EC2, ASG, CloudWatch agent, security groups
infra/stacks/ops_stack.py CloudWatch alarms, SNS, CodePipeline, log groups
infra/future-improvements.md Planned but not-yet-deployed features
infra/app.py Environment/region configuration
src/com/getorcha/app/http/middleware/auth.clj Session handling, auth middleware
src/com/getorcha/admin/http/middleware/auth.clj Admin auth middleware
src/com/getorcha/ai/llm.clj AI provider HTTP clients
src/com/getorcha/workers/ap/ingestion/extraction.clj Extraction prompts sent to LLMs
src/com/getorcha/notifications.clj Slack/Teams/SES notification delivery
src/com/getorcha/link/api/middleware.clj API request audit logging
resources/com/getorcha/config.edn Runtime configuration
resources/migrations/init.sql Database schema
.prd/authentication.md Authentication PRD (MFA scope)
buildspec.yml CI/CD build specification
deps.edn Dependencies and aliases
Full-text grep across src/, infra/ Pattern matching for specific claims

Changes Made

Section 2.2: System Access Control -- MAJOR REWRITE

7 false claims corrected:

Original Claim Actual State Evidence
"MFA via AWS Cognito" MFA is OFF (CDK default). PRD explicitly says "MFA: Optional (can enable later)" and lists MFA as "Out of Scope" foundation_stack.py: no mfa parameter on either UserPool. .prd/authentication.md line 74, line 812
"TOTP as second factor" Not configured anywhere Zero matches for TOTP, software_token across all files
"Password min 12 chars + complexity" No custom password policy set. Irrelevant: all users authenticate via Google/Microsoft OAuth, never setting a Cognito password foundation_stack.py: no password_policy parameter. Both pools use supported_identity_providers=[GOOGLE]
"15 min inactivity timeout" Fixed 24-hour session cookie. No inactivity detection foundation_stack.py line 77: session_max_age_seconds = 86400. SSM parameter description: "24h"
"Max session 8 hours" 24 hours (see above) Same evidence
"Last 5 passwords not reused" Not configurable in Cognito; moot since OAuth-only auth No password_policy in CDK; Cognito has no password history feature
"Lockout after 6 failed attempts" No Advanced Security add-on enabled; moot since OAuth-only No advanced_security_mode in CDK

Action: Rewrote section to accurately describe OAuth-based authentication via Google and Microsoft, 24-hour session duration, KMS-encrypted refresh tokens. Added TODO for MFA evaluation.

Section 2.3: Data Access Control -- CORRECTED

Original claims:

Actual state: Tenant isolation via tenant_membership, super-admin boolean flag, separate admin Cognito pool restricted to @getorcha.com, API key JSONB permissions.

Evidence: init.sql line 77: is_super_admin BOOLEAN NOT NULL DEFAULT false. admin/http/middleware/auth.clj lines 48-52: @getorcha.com domain check.

Action: Rewrote to describe actual access model. Added TODO for formal RBAC and periodic reviews.

Section 2.4: Network Security -- MAJOR REWRITE

6 false claims removed:

Original Claim Actual State Evidence
"Private subnets and isolated network segmentation" EC2 instances run in PUBLIC subnets (nat_gateways=0). Only RDS is in private isolated subnets foundation_stack.py line 91: nat_gateways=0. compute_stack.py line 543: subnet_type=ec2.SubnetType.PUBLIC
"AWS WAF against DDoS and injection attacks" WAF not configured. Listed in future-improvements.md as "Priority: Medium" Zero aws_wafv2 imports in any CDK stack. future-improvements.md: "Consider adding when: Processing customer documents at scale"
"API Gateway with rate limiting (100 req/min)" No API Gateway used. Architecture is ALB -> EC2. No rate limiting anywhere Zero aws_apigateway imports. Zero rate-limiting middleware in src/
"VPN and bastion host" Neither exists. SSM Session Manager used instead No VPN gateway or bastion instance in CDK. CloudWatch log group /v1-orcha/ssm-sessions confirms SSM usage
"IDS/IPS with continuous monitoring" Not configured. No GuardDuty, Inspector, or traffic mirroring Zero matches for guardduty, inspector, ids, ips in infra/
"CloudFront CDN with edge DDoS protection" CloudFront not configured Zero aws_cloudfront imports in any CDK stack

Action: Rewrote to describe actual architecture: VPC with public/private subnets, ALB with HTTPS, SSM Session Manager, security groups. Moved unimplemented items to a note referencing future-improvements.md.

Section 3.1: Transfer Control -- CORRECTED

3 false claims removed:

Original Claim Actual State Evidence
"HSTS headers with long validity" No HSTS headers set. No security header middleware at all Zero matches for Strict-Transport, HSTS, x-frame-options, content-security-policy in src/
"Certificate pinning for critical API connections" Not implemented. Default JVM TLS with hato HTTP client Zero matches for pinning, TrustManager, SSLContext in src/
"Perfect Forward Secrecy through ephemeral key exchanges" No explicit TLS policy set. AWS ALB default may include ECDHE suites, but this is not an intentional configuration No TLS security policy named in compute_stack.py ALB listener config

Action: Simplified to describe actual TLS measures: ALB HTTPS termination, HTTP redirect, S3 SSL enforcement, secure cookies.

Section 3.2: Input Control -- MAJOR REWRITE

5 false claims corrected:

Original Claim Actual State Evidence
"AWS CloudTrail for all infrastructure operations" CloudTrail not configured in CDK Zero matches for cloudtrail in infra/
"12 months retention of all audit logs" Max 30 days (SSM/CI logs). App/admin logs: 14 days compute_stack.py lines 50,56,62,68: retention_in_days: 14. ops_stack.py line 121: ONE_MONTH
"Tamper-proof log storage with S3 Object Lock" Object Lock not configured on any bucket No object_lock_enabled on any S3 bucket in foundation_stack.py
"Automated anomaly detection in audit logs" Only cost anomaly detection exists (spending, not security) ops_stack.py lines 571-587: CfnAnomalyMonitor for cost only
"All data access logged with timestamp, user ID, and access purpose" Only Link API requests and DATEV exports are audited. No general user-action audit trail link/api/middleware.clj logs API requests. No audit table for main app user actions

Actual logging: API request log (Link API), DATEV export audit, email processing records, CloudWatch logs (14-30 day retention), ingestion statistics, MDC context in all log entries.

Action: Rewrote to accurately describe actual logging capabilities and their retention periods.

Section 4: Availability -- CORRECTED

4 false/misleading claims:

Original Claim Actual State Evidence
"Multi-AZ deployment" RDS: multi_az=False. EC2: ASG min=max=desired=1 data_stack.py line 112: multi_az=False. compute_stack.py lines 553-554: capacity=1
"Redundant database configuration with failover" No standby replica, no automatic failover Directly follows from multi_az=False
"DDoS protection via AWS Shield and CloudFront" Only Shield Standard (free, automatic). No CloudFront No aws_shield CDK module. No CloudFront
"Rate limiting and throttling at API level" No rate limiting in any middleware Zero rate-limiting code in request handling paths

Action: Described actual state: single-instance ASG, ALB across 2 AZs, no Multi-AZ DB, Shield Standard only. Added note acknowledging these as current trade-offs.

Section 5: Recoverability -- CORRECTED

4 false claims:

Original Claim Actual State Evidence
"RTO: maximum 4 hours / RPO: maximum 1 hour" No RTO/RPO targets documented anywhere Zero matches for RTO, RPO across entire repo
"Backup encryption with separate KMS keys" Default AWS-managed keys, not separate KMS keys data_stack.py: no storage_encryption_key parameter. foundation_stack.py: S3_MANAGED encryption
"Cross-region backup replication" Not configured Zero matches for cross-region replication in any CDK stack
"Backup retention at least 30 days" 14 days for RDS backups data_stack.py line 116: backup_retention=Duration.days(14)

Action: Described actual backup configuration: daily RDS backups, 14-day retention, AWS-managed encryption. Added TODOs.

Section 6.1: Encryption -- CORRECTED

2 false claims:

Original Claim Actual State Evidence
"AES-256-GCM via AWS KMS" KMS used for specific application fields only (refresh tokens, DATEV creds). S3 uses SSE-S3. RDS uses aws/rds managed key. GCM mode claim unverifiable foundation_stack.py line 966: one shared KMS key. S3: S3_MANAGED. RDS: storage_encrypted=True without custom key
"Separate encryption keys per customer" One shared KMS key for all tenants Single db-secrets-key construct. future-improvements.md discusses per-tenant as future goal

Action: Described actual encryption architecture accurately, distinguishing between managed-key storage encryption and application-level KMS field encryption.

Section 6.2: Pseudonymization in AI -- CORRECTED

Original claim: "Sensitive fields (bank data, tax numbers) are masked/redacted before transmission to AI services" Finding: FALSE. The extraction prompt in extraction.clj explicitly instructs the LLM to extract IBANs, BICs, bank names, VAT IDs, and tax numbers from documents. Full document text is sent without any masking.

Evidence: Extraction prompt: "7. IBAN & BIC: CRITICAL payment fields -- always extract the supplier's bank details." uncertain_validations.clj passes full structured data including IBAN to LLM with instruction "look at the actual IBAN printed on the invoice."

Action: Removed false masking claim. Documented current state honestly. Added TODO.

Section 7: Secure Development -- CORRECTED

4 overclaimed items:

Original Claim Actual State Evidence
"Four-eyes principle (two independent reviewers)" No branch protection rules. No .github/ directory. Pipeline has no approval stage No .github/ dir. ops_stack.py pipeline: Source -> Build -> Deploy, no approval
"SAST/DAST in CI/CD" Not configured. buildspec.yml only builds and pushes Docker image buildspec.yml contains no security scanning commands
"Dependency scanning weekly" nvd-clojure exists but not scheduled nvd-clojure.edn with fail-threshold 7 exists. No cron/scheduled job
"Strict separation of dev/test/prod" Only prod cloud env. Dev uses Docker. No staging infra/app.py: if env_name not in ("prod",):

Action: Described actual development practices: PR-based workflow (without enforced review), dependency scanning tools available but not automated, single cloud environment.

Section 8: Vulnerability Management -- SIMPLIFIED

Original claims without evidence: Bug bounty program, patching SLAs, automated security updates with tested rollout, documentation of all vulnerabilities.

Actual state: nvd-clojure for dependency scanning, RDS minor version auto-upgrade.

Action: Simplified to describe what actually exists. Added TODOs.

Section 10: Sub-Processors -- CORRECTED

Original claim: Lists "SendGrid (Email delivery)" Finding: FALSE. SendGrid is not used anywhere. Zero matches for sendgrid across entire codebase. Email is sent/received via AWS SES.

Evidence: deps.edn includes software.amazon.awssdk/sesv2. aws.clj implements send-email! via SES v2.

Action: Replaced sub-processor list with accurate categories referencing O3 document.

Section 14: AI-Specific Measures -- MAJOR REWRITE

Key corrections:

Original Claim Actual State Evidence
"OpenAI GPT" mentioned as AI system OpenAI not used. Zero references in codebase Exhaustive grep: zero matches for openai, gpt-4, gpt-3
"Sensitive fields masked/redacted before AI" FALSE (same as Section 6.2) Extraction prompt explicitly requests IBANs, bank details
"Complete documentation of all AI inputs and outputs" Outputs/token counts stored. Prompt inputs not stored ingestion.clj: stores extraction-response, token counts; no input prompts
"Audit log of all AI requests with timestamp and user" No AI-specific audit log with user attribution api_request_log covers REST API, not LLM calls

Confirmed: "No fully automated decisions with legal effect" -- TRUE. needs_human_review flag gates DATEV export. UI requires explicit hx-confirm before export.

Evidence: app/http/ap.clj: exportable? predicate checks (not needs-human-review). DATEV export has hx-confirm "Export N invoice(s) to DATEV?".

Action: Listed actual AI systems with specific models. Described actual data minimization measures (embedding text excludes IBAN, triage truncates email body). Described actual audit capabilities. Flagged masking gap.


Confidence Assessment

Finding Confidence Basis
MFA/TOTP not configured High CDK source + PRD explicitly says "Out of Scope"
Session duration is 24h, not 15min/8h High session_max_age_seconds = 86400 in foundation_stack.py and config.edn
No WAF, no CloudFront, no API Gateway High Exhaustive search of all CDK imports and constructs
EC2 in public subnet High SubnetType.PUBLIC in compute_stack.py
No IDS/IPS, no VPN/bastion High Exhaustive search, zero matches
No HSTS, no cert pinning High Exhaustive grep of src/
RDS multi_az=False, single EC2 instance High Direct CDK parameters
No cross-region backups High Zero replication config in any stack
14-day backup retention, not 30 High Duration.days(14) in data_stack.py
No per-tenant encryption keys High Single KMS key construct; future-improvements.md confirms planned
No PII masking before AI calls High Extraction prompt requests IBANs; grep finds no masking code
SendGrid not used High Zero matches; SES confirmed in deps and code
OpenAI not used High Zero matches across entire repo
CloudTrail not in CDK High Zero matches in infra/. May exist at account level outside CDK
No formal RBAC High Single boolean flag, no role definitions
needs_human_review gates DATEV export High Direct code in exportable? predicate
Employee measures (§203, training, offboarding) Unknown Organizational/procedural claims not verifiable from code
SLA 99.5% Unknown Contractual term, not verifiable from code
Incident response procedures Unknown Procedural claims not verifiable from code

Retractions

None. All claims in this review are supported by direct code evidence.

Critical (security/compliance gaps)

  1. Implement PII masking before AI API calls -- IBANs, bank details, VAT IDs, and tax numbers are sent in plaintext to US-based AI providers. For §203 StGB compliance, field-level masking/tokenization should be implemented.
  2. Enable MFA -- No multi-factor authentication is configured. Evaluate Cognito MFA or IdP-level enforcement.
  3. Add HSTS headers -- No HTTP security headers are set at all. Add HSTS, X-Frame-Options, Content-Security-Policy at minimum.
  4. Deploy WAF -- No web application firewall. Listed in future-improvements.md but critical for a platform handling financial data.
  5. Implement application-level rate limiting -- No rate limiting on any endpoint. Risk of abuse and denial-of-service.

High Priority

  1. Extend log retention -- Current max is 30 days. Financial/audit requirements typically need 12+ months.
  2. Enable CloudTrail -- Infrastructure API audit trail not configured in CDK.
  3. Implement general audit trail -- User actions in the main app (document approvals, settings changes, logins) are not audited.
  4. Configure branch protection -- No enforced code review before deployment.
  5. Add SAST/DAST to CI/CD -- No security scanning in the build pipeline.
  6. Move EC2 to private subnets -- Application instances are in public subnets. Add NAT Gateway.
  7. Enable RDS Multi-AZ -- Currently single-AZ with no failover.

Medium Priority

  1. Define RTO/RPO targets -- No documented recovery objectives.
  2. Extend backup retention -- 14 days may be insufficient for compliance.
  3. Evaluate cross-region backup replication -- No disaster recovery across regions.
  4. Automate dependency scanning -- nvd-clojure exists but is not on a schedule.
  5. Implement per-tenant encryption keys -- Currently one shared KMS key.
  6. Add S3 Object Lock for audit logs -- Tamper-proof storage not configured.
  7. Establish staging environment -- Only prod exists in AWS.
  8. Add PII filtering to log output -- VAT numbers appear in log statements.