O1 TOM Documentation -- Codebase Verification Review

Date: 12.04.2026 Reviewer: Automated codebase analysis Scope: Verified every technical claim in O1 v4.0 against the Orcha codebase (orcha/src/, orcha/infra/, orcha/resources/) and CDK infrastructure

Sources Consulted

Source	Key findings area
`infra/stacks/foundation_stack.py`	VPC, Cognito, S3, SQS, SES, KMS, SSM
`infra/stacks/data_stack.py`	RDS PostgreSQL configuration
`infra/stacks/compute_stack.py`	ALB, EC2, ASG, CloudWatch agent, security groups
`infra/stacks/ops_stack.py`	CloudWatch alarms, SNS, CodePipeline, log groups
`infra/future-improvements.md`	Planned but not-yet-deployed features
`infra/app.py`	Environment/region configuration
`src/com/getorcha/app/http/middleware/auth.clj`	Session handling, auth middleware
`src/com/getorcha/admin/http/middleware/auth.clj`	Admin auth middleware
`src/com/getorcha/ai/llm.clj`	AI provider HTTP clients
`src/com/getorcha/workers/ap/ingestion/extraction.clj`	Extraction prompts sent to LLMs
`src/com/getorcha/notifications.clj`	Slack/Teams/SES notification delivery
`src/com/getorcha/link/api/middleware.clj`	API request audit logging
`resources/com/getorcha/config.edn`	Runtime configuration
`resources/migrations/init.sql`	Database schema
`.prd/authentication.md`	Authentication PRD (MFA scope)
`buildspec.yml`	CI/CD build specification
`deps.edn`	Dependencies and aliases
Full-text grep across `src/`, `infra/`	Pattern matching for specific claims

Changes Made

Section 2.2: System Access Control -- MAJOR REWRITE

7 false claims corrected:

Original Claim	Actual State	Evidence
"MFA via AWS Cognito"	MFA is OFF (CDK default). PRD explicitly says "MFA: Optional (can enable later)" and lists MFA as "Out of Scope"	`foundation_stack.py`: no `mfa` parameter on either UserPool. `.prd/authentication.md` line 74, line 812
"TOTP as second factor"	Not configured anywhere	Zero matches for TOTP, software_token across all files
"Password min 12 chars + complexity"	No custom password policy set. Irrelevant: all users authenticate via Google/Microsoft OAuth, never setting a Cognito password	`foundation_stack.py`: no `password_policy` parameter. Both pools use `supported_identity_providers=[GOOGLE]`
"15 min inactivity timeout"	Fixed 24-hour session cookie. No inactivity detection	`foundation_stack.py` line 77: `session_max_age_seconds = 86400`. SSM parameter description: "24h"
"Max session 8 hours"	24 hours (see above)	Same evidence
"Last 5 passwords not reused"	Not configurable in Cognito; moot since OAuth-only auth	No `password_policy` in CDK; Cognito has no password history feature
"Lockout after 6 failed attempts"	No Advanced Security add-on enabled; moot since OAuth-only	No `advanced_security_mode` in CDK

Action: Rewrote section to accurately describe OAuth-based authentication via Google and Microsoft, 24-hour session duration, KMS-encrypted refresh tokens. Added TODO for MFA evaluation.

Section 2.3: Data Access Control -- CORRECTED

Original claims:

"Documented role matrix with defined permissions per role" -- FALSE. Single is_super_admin boolean flag, no named roles
"Quarterly access reviews" -- NO EVIDENCE in code or docs
"Audit trail for all permission changes" -- NO EVIDENCE

Actual state: Tenant isolation via tenant_membership, super-admin boolean flag, separate admin Cognito pool restricted to @getorcha.com, API key JSONB permissions.

Evidence: init.sql line 77: is_super_admin BOOLEAN NOT NULL DEFAULT false. admin/http/middleware/auth.clj lines 48-52: @getorcha.com domain check.

Action: Rewrote to describe actual access model. Added TODO for formal RBAC and periodic reviews.

Section 2.4: Network Security -- MAJOR REWRITE

6 false claims removed:

Original Claim	Actual State	Evidence
"Private subnets and isolated network segmentation"	EC2 instances run in PUBLIC subnets (`nat_gateways=0`). Only RDS is in private isolated subnets	`foundation_stack.py` line 91: `nat_gateways=0`. `compute_stack.py` line 543: `subnet_type=ec2.SubnetType.PUBLIC`
"AWS WAF against DDoS and injection attacks"	WAF not configured. Listed in `future-improvements.md` as "Priority: Medium"	Zero `aws_wafv2` imports in any CDK stack. `future-improvements.md`: "Consider adding when: Processing customer documents at scale"
"API Gateway with rate limiting (100 req/min)"	No API Gateway used. Architecture is ALB -> EC2. No rate limiting anywhere	Zero `aws_apigateway` imports. Zero rate-limiting middleware in `src/`
"VPN and bastion host"	Neither exists. SSM Session Manager used instead	No VPN gateway or bastion instance in CDK. CloudWatch log group `/v1-orcha/ssm-sessions` confirms SSM usage
"IDS/IPS with continuous monitoring"	Not configured. No GuardDuty, Inspector, or traffic mirroring	Zero matches for guardduty, inspector, ids, ips in infra/
"CloudFront CDN with edge DDoS protection"	CloudFront not configured	Zero `aws_cloudfront` imports in any CDK stack

Action: Rewrote to describe actual architecture: VPC with public/private subnets, ALB with HTTPS, SSM Session Manager, security groups. Moved unimplemented items to a note referencing future-improvements.md.

Section 3.1: Transfer Control -- CORRECTED

3 false claims removed:

Original Claim	Actual State	Evidence
"HSTS headers with long validity"	No HSTS headers set. No security header middleware at all	Zero matches for `Strict-Transport`, `HSTS`, `x-frame-options`, `content-security-policy` in src/
"Certificate pinning for critical API connections"	Not implemented. Default JVM TLS with hato HTTP client	Zero matches for `pinning`, `TrustManager`, `SSLContext` in src/
"Perfect Forward Secrecy through ephemeral key exchanges"	No explicit TLS policy set. AWS ALB default may include ECDHE suites, but this is not an intentional configuration	No TLS security policy named in `compute_stack.py` ALB listener config

Action: Simplified to describe actual TLS measures: ALB HTTPS termination, HTTP redirect, S3 SSL enforcement, secure cookies.

Section 3.2: Input Control -- MAJOR REWRITE

5 false claims corrected:

Original Claim	Actual State	Evidence
"AWS CloudTrail for all infrastructure operations"	CloudTrail not configured in CDK	Zero matches for `cloudtrail` in infra/
"12 months retention of all audit logs"	Max 30 days (SSM/CI logs). App/admin logs: 14 days	`compute_stack.py` lines 50,56,62,68: `retention_in_days: 14`. `ops_stack.py` line 121: `ONE_MONTH`
"Tamper-proof log storage with S3 Object Lock"	Object Lock not configured on any bucket	No `object_lock_enabled` on any S3 bucket in `foundation_stack.py`
"Automated anomaly detection in audit logs"	Only cost anomaly detection exists (spending, not security)	`ops_stack.py` lines 571-587: `CfnAnomalyMonitor` for cost only
"All data access logged with timestamp, user ID, and access purpose"	Only Link API requests and DATEV exports are audited. No general user-action audit trail	`link/api/middleware.clj` logs API requests. No audit table for main app user actions

Actual logging: API request log (Link API), DATEV export audit, email processing records, CloudWatch logs (14-30 day retention), ingestion statistics, MDC context in all log entries.

Action: Rewrote to accurately describe actual logging capabilities and their retention periods.

Section 4: Availability -- CORRECTED

4 false/misleading claims:

Original Claim	Actual State	Evidence
"Multi-AZ deployment"	RDS: `multi_az=False`. EC2: ASG `min=max=desired=1`	`data_stack.py` line 112: `multi_az=False`. `compute_stack.py` lines 553-554: capacity=1
"Redundant database configuration with failover"	No standby replica, no automatic failover	Directly follows from `multi_az=False`
"DDoS protection via AWS Shield and CloudFront"	Only Shield Standard (free, automatic). No CloudFront	No `aws_shield` CDK module. No CloudFront
"Rate limiting and throttling at API level"	No rate limiting in any middleware	Zero rate-limiting code in request handling paths

Action: Described actual state: single-instance ASG, ALB across 2 AZs, no Multi-AZ DB, Shield Standard only. Added note acknowledging these as current trade-offs.

Section 5: Recoverability -- CORRECTED

4 false claims:

Original Claim	Actual State	Evidence
"RTO: maximum 4 hours / RPO: maximum 1 hour"	No RTO/RPO targets documented anywhere	Zero matches for RTO, RPO across entire repo
"Backup encryption with separate KMS keys"	Default AWS-managed keys, not separate KMS keys	`data_stack.py`: no `storage_encryption_key` parameter. `foundation_stack.py`: `S3_MANAGED` encryption
"Cross-region backup replication"	Not configured	Zero matches for cross-region replication in any CDK stack
"Backup retention at least 30 days"	14 days for RDS backups	`data_stack.py` line 116: `backup_retention=Duration.days(14)`

Action: Described actual backup configuration: daily RDS backups, 14-day retention, AWS-managed encryption. Added TODOs.

Section 6.1: Encryption -- CORRECTED

2 false claims:

Original Claim	Actual State	Evidence
"AES-256-GCM via AWS KMS"	KMS used for specific application fields only (refresh tokens, DATEV creds). S3 uses SSE-S3. RDS uses aws/rds managed key. GCM mode claim unverifiable	`foundation_stack.py` line 966: one shared KMS key. S3: `S3_MANAGED`. RDS: `storage_encrypted=True` without custom key
"Separate encryption keys per customer"	One shared KMS key for all tenants	Single `db-secrets-key` construct. `future-improvements.md` discusses per-tenant as future goal

Action: Described actual encryption architecture accurately, distinguishing between managed-key storage encryption and application-level KMS field encryption.

Section 6.2: Pseudonymization in AI -- CORRECTED

Original claim: "Sensitive fields (bank data, tax numbers) are masked/redacted before transmission to AI services" Finding: FALSE. The extraction prompt in extraction.clj explicitly instructs the LLM to extract IBANs, BICs, bank names, VAT IDs, and tax numbers from documents. Full document text is sent without any masking.

Evidence: Extraction prompt: "7. IBAN & BIC: CRITICAL payment fields -- always extract the supplier's bank details." uncertain_validations.clj passes full structured data including IBAN to LLM with instruction "look at the actual IBAN printed on the invoice."

Action: Removed false masking claim. Documented current state honestly. Added TODO.

Section 7: Secure Development -- CORRECTED

4 overclaimed items:

Original Claim	Actual State	Evidence
"Four-eyes principle (two independent reviewers)"	No branch protection rules. No `.github/` directory. Pipeline has no approval stage	No `.github/` dir. `ops_stack.py` pipeline: Source -> Build -> Deploy, no approval
"SAST/DAST in CI/CD"	Not configured. `buildspec.yml` only builds and pushes Docker image	`buildspec.yml` contains no security scanning commands
"Dependency scanning weekly"	`nvd-clojure` exists but not scheduled	`nvd-clojure.edn` with `fail-threshold 7` exists. No cron/scheduled job
"Strict separation of dev/test/prod"	Only `prod` cloud env. Dev uses Docker. No staging	`infra/app.py`: `if env_name not in ("prod",):`

Action: Described actual development practices: PR-based workflow (without enforced review), dependency scanning tools available but not automated, single cloud environment.

Section 8: Vulnerability Management -- SIMPLIFIED

Original claims without evidence: Bug bounty program, patching SLAs, automated security updates with tested rollout, documentation of all vulnerabilities.

Actual state: nvd-clojure for dependency scanning, RDS minor version auto-upgrade.

Action: Simplified to describe what actually exists. Added TODOs.

Section 10: Sub-Processors -- CORRECTED

Original claim: Lists "SendGrid (Email delivery)" Finding: FALSE. SendGrid is not used anywhere. Zero matches for sendgrid across entire codebase. Email is sent/received via AWS SES.

Evidence: deps.edn includes software.amazon.awssdk/sesv2. aws.clj implements send-email! via SES v2.

Action: Replaced sub-processor list with accurate categories referencing O3 document.

Section 14: AI-Specific Measures -- MAJOR REWRITE

Key corrections:

Original Claim	Actual State	Evidence
"OpenAI GPT" mentioned as AI system	OpenAI not used. Zero references in codebase	Exhaustive grep: zero matches for openai, gpt-4, gpt-3
"Sensitive fields masked/redacted before AI"	FALSE (same as Section 6.2)	Extraction prompt explicitly requests IBANs, bank details
"Complete documentation of all AI inputs and outputs"	Outputs/token counts stored. Prompt inputs not stored	`ingestion.clj`: stores `extraction-response`, token counts; no input prompts
"Audit log of all AI requests with timestamp and user"	No AI-specific audit log with user attribution	`api_request_log` covers REST API, not LLM calls

Confirmed: "No fully automated decisions with legal effect" -- TRUE. needs_human_review flag gates DATEV export. UI requires explicit hx-confirm before export.

Evidence: app/http/ap.clj: exportable? predicate checks (not needs-human-review). DATEV export has hx-confirm "Export N invoice(s) to DATEV?".

Action: Listed actual AI systems with specific models. Described actual data minimization measures (embedding text excludes IBAN, triage truncates email body). Described actual audit capabilities. Flagged masking gap.

Confidence Assessment

Finding	Confidence	Basis
MFA/TOTP not configured	High	CDK source + PRD explicitly says "Out of Scope"
Session duration is 24h, not 15min/8h	High	`session_max_age_seconds = 86400` in foundation_stack.py and config.edn
No WAF, no CloudFront, no API Gateway	High	Exhaustive search of all CDK imports and constructs
EC2 in public subnet	High	`SubnetType.PUBLIC` in compute_stack.py
No IDS/IPS, no VPN/bastion	High	Exhaustive search, zero matches
No HSTS, no cert pinning	High	Exhaustive grep of src/
RDS multi_az=False, single EC2 instance	High	Direct CDK parameters
No cross-region backups	High	Zero replication config in any stack
14-day backup retention, not 30	High	`Duration.days(14)` in data_stack.py
No per-tenant encryption keys	High	Single KMS key construct; future-improvements.md confirms planned
No PII masking before AI calls	High	Extraction prompt requests IBANs; grep finds no masking code
SendGrid not used	High	Zero matches; SES confirmed in deps and code
OpenAI not used	High	Zero matches across entire repo
CloudTrail not in CDK	High	Zero matches in infra/. May exist at account level outside CDK
No formal RBAC	High	Single boolean flag, no role definitions
`needs_human_review` gates DATEV export	High	Direct code in `exportable?` predicate
Employee measures (§203, training, offboarding)	Unknown	Organizational/procedural claims not verifiable from code
SLA 99.5%	Unknown	Contractual term, not verifiable from code
Incident response procedures	Unknown	Procedural claims not verifiable from code

Retractions

None. All claims in this review are supported by direct code evidence.

Recommended Actions (TODOs)

Critical (security/compliance gaps)

Implement PII masking before AI API calls -- IBANs, bank details, VAT IDs, and tax numbers are sent in plaintext to US-based AI providers. For §203 StGB compliance, field-level masking/tokenization should be implemented.
Enable MFA -- No multi-factor authentication is configured. Evaluate Cognito MFA or IdP-level enforcement.
Add HSTS headers -- No HTTP security headers are set at all. Add HSTS, X-Frame-Options, Content-Security-Policy at minimum.
Deploy WAF -- No web application firewall. Listed in future-improvements.md but critical for a platform handling financial data.
Implement application-level rate limiting -- No rate limiting on any endpoint. Risk of abuse and denial-of-service.

High Priority

Extend log retention -- Current max is 30 days. Financial/audit requirements typically need 12+ months.
Enable CloudTrail -- Infrastructure API audit trail not configured in CDK.
Implement general audit trail -- User actions in the main app (document approvals, settings changes, logins) are not audited.
Configure branch protection -- No enforced code review before deployment.
Add SAST/DAST to CI/CD -- No security scanning in the build pipeline.
Move EC2 to private subnets -- Application instances are in public subnets. Add NAT Gateway.
Enable RDS Multi-AZ -- Currently single-AZ with no failover.

Medium Priority

Define RTO/RPO targets -- No documented recovery objectives.
Extend backup retention -- 14 days may be insufficient for compliance.
Evaluate cross-region backup replication -- No disaster recovery across regions.
Automate dependency scanning -- nvd-clojure exists but is not on a schedule.
Implement per-tenant encryption keys -- Currently one shared KMS key.
Add S3 Object Lock for audit logs -- Tamper-proof storage not configured.
Establish staging environment -- Only prod exists in AWS.
Add PII filtering to log output -- VAT numbers appear in log statements.