Debug Session: orcha autonomous bug hunt

Started: 2026-05-01 20:40 Scope: src/ + test/ (215 + 103 + 3 Clojure files) Iterations: 20 (bounded) Severity threshold: medium+ Mode: report-only

Confirmed Bugs

[MEDIUM] No backoff in SQS polling loops on persistent error

Location: Repeated across 5 worker namespaces:
- src/com/getorcha/workers/document_output.clj:236
- src/com/getorcha/workers/ap/acquisition.clj (catch in polling loop)
- src/com/getorcha/workers/ap/ingestion.clj (~line 1159+)
- src/com/getorcha/workers/diagnostics_recompute.clj (~line 108+)
- src/com/getorcha/workers/ap/processors/matching/worker.clj (~line 218+)

Pattern (all five):

(while @polling?
  (try
    (let [messages (aws/receive-messages ...)]
      (doseq [^Message message messages] (.submit executor ...)))
    (catch Exception e
      (when @polling?
        (log/error e "Error in X polling loop, will retry")))))

Hypothesis: On persistent AWS errors (network outage, IAM rotation, region-wide event), the polling loop spins with no Thread/sleep between failures. aws/receive-messages returns immediately on error — long-poll wait is server-side.
Evidence: aws.clj:99-117 receive-messages calls .receiveMessage which throws SdkException on transport errors before any wait; the outer while immediately re-enters.
Impact: During an AWS outage, 5 workers spin at full CPU, generating one error log per loop tick (~thousands/sec). Cascading log volume can overwhelm CloudWatch budget; CPU starves co-located components.
Root cause: No exponential backoff strategy on transient errors.
Suggested fix: Add (Thread/sleep (long backoff-ms)) in the catch with a small exponential schedule (e.g., 1s → 5s → 30s capped). Reset on successful receive. Extract into a shared helper (com.getorcha.workers.util/with-polling-backoff) since the pattern repeats verbatim.

[HIGH] Acquisition SQS message deleted before async handler commits — risk of lost messages

Location: src/com/getorcha/workers/ap/acquisition.clj:121-139
Hypothesis: The acquisition orchestrator submits the work runnable to a virtual-thread executor (line 122-138) and then immediately calls aws/delete-message! (line 139) on the same iteration. The runnable runs asynchronously — there is no .get / await before delete. If the worker process crashes between the SQS delete and the runnable's first durable side-effect (claim-inbox! upsert into ap-doc-source-email-pending-sync for :oauth-sync, or record-processed! for :ses), the trigger is lost.

Evidence:

(when (not= ::ignored q-message)
  (.submit executor (mdc/wrap-runnable (fn [] ...))))   ; non-blocking
(aws/delete-message! sqs-client queue-url message))     ; runs immediately

Executors/newVirtualThreadPerTaskExecutor never rejects, so submit always returns; the runnable is queued but not yet running.

Impact:
- :oauth-sync — Gmail/Microsoft push webhook. Lost trigger means the inbox is not synced until the NEXT webhook (next inbound email) arrives. Legitimate emails sit unprocessed in the user's inbox indefinitely if no further mail arrives.
- :ses — direct email ingestion via SES → S3 → SQS. Lost trigger leaves the .eml in S3 (7-day lifecycle) but unprocessed; recovery requires manual replay tooling. If no such tooling exists, the email is permanently missed at-least-once-becomes-at-most-once.
Root cause: Standard SQS pattern is delete-on-success; this codebase delete-before-handler decouples SQS retry from in-process failure modes.
Suggested fix: Move aws/delete-message! inside the runnable's success path (after claim-inbox! / record-processed! commits). Visibility timeout will keep messages safe during normal processing; failed work returns to the queue for natural retry. Pair with a per-message visibility heartbeat for long-running handlers (the document-output worker already does this — pattern can be lifted).

[HIGH] Output dispatch jobs orphaned when app crashes between commit and SQS send (no sweeper)

Location:
- src/com/getorcha/app/http/documents/view/approval.clj:124-156 — final approval flow
- src/com/getorcha/app/document_output.clj:75-87 — create-and-enqueue-job! (used by manual re-export from invoice view)
Hypothesis: A document_output_job row is INSERTed inside a DB transaction (status=pending); after the transaction commits, the app calls enqueue-job! (SQS send-message) outside the transaction. If the app process dies between commit and SQS send, the row exists in pending state but no SQS message is ever published. The worker only processes jobs that arrive via SQS — claim-job! is invoked with the job-id from the message body — so the orphaned row is never picked up.
Evidence:
- approval.clj:130-141 creates the job inside with-transaction tx, returning job from the tx body.
- approval.clj:144-155 calls enqueue-job! AFTER the tx (line 162 begins outside [tx db-pool]).
- document_output.clj:75-87 (create-and-enqueue-job!) follows the same pattern: (create-job! db-pool ...) commits immediately, then (enqueue-job! ...) runs separately. The try/catch only handles synchronous enqueue-job! errors — not process death.
- Searched for sweepers/reconcilers in src/com/getorcha/workers/ and src/com/getorcha/app/: only db.run/reap-stuck-running! exists for document_processor_run rows. No equivalent reaper exists for document_output_job rows stuck in pending without an SQS message.
Impact: When OOM/SIGKILL/container restart hits between transaction commit and SQS send, the user's UI shows "approved" but the DATEV export never happens. Operators must manually re-export. Silent data flow break — the kind of bug that's invisible until a customer complains.
Root cause: Classic transactional-outbox anti-pattern. The DB and message queue are independent durable stores; a write to one followed by a write to the other is not atomic.
Suggested fix (pick one):
1. Outbox pattern: Treat the document_output_job row itself as the outbox. Add a periodic sweeper in workers/document_output.clj that scans for status='pending' rows older than e.g. 60s and re-enqueues their id to SQS. Idempotent: the worker's claim-job! is already conflict-safe (status='pending' OR stale-running), so duplicate SQS messages are absorbed.
2. PG NOTIFY → re-enqueue: The document_output_job already fires a NOTIFY on insert (per app/ingestion.clj:291-303). Add a small listener on that channel that calls enqueue-job!. Ensures the SQS send happens after commit even if the original caller crashes — but loses any notify if the listener is down (still need a startup sweep).
3. Crash-recovery sweep at boot: On worker startup, scan for pending rows older than boot-grace-seconds and re-enqueue. Cheaper than a periodic sweeper.
Test gap: test/com/getorcha/workers/document_output_test.clj does not cover the orphaned-pending case.

[MEDIUM] SSELooper leaks subscriptions when initial "connected" write fails

Location: src/com/getorcha/app/http/sse.clj:60-91
Hypothesis: The looper writes connected as the first SSE frame at line 64. If that write fails (client disconnected between request handling and Jetty starting to write the response body), the (when ...) predicate is false → the io-thread is never started → the (finally (clean-up-fn) ...) block at line 88-91 is unreachable. Whatever subscription/state the caller registered before invoking looper is never cleaned up.

Evidence:

(write-body-to-stream [_ _response output-stream]
  (let [writer (io/writer output-stream)]
    (when (write! writer "connected" "{}")     ; if false, fall-through with no cleanup
      (a/io-thread
       (try
         (loop [...] ...)
         (finally
           (when clean-up-fn (clean-up-fn))
           (a/close! input-ch)))))))

write! returns false on IOException (line 55-57). When that happens before the io-thread starts, clean-up-fn is unreferenced.

Impact: Each disconnected-before-stream SSE attempt leaks one subscription. With document-events publisher (app/ingestion.clj:330+) using a/pub over a channel of capacity 100, leaked subs slowly fill capacity → eventually the publisher blocks on a/put! → backpressure spreads to the source NOTIFY listener thread, which can starve real-time updates for all users.
Root cause: Clean-up logic is nested inside the success branch of the initial write check.
Suggested fix: Hoist cleanup out of the (when (write! ...)) predicate. Restructure as:
```
(let [writer (io/writer output-stream)]
  (try
    (when (write! writer "connected" "{}")
      (a/io-thread (try (loop [...] ...) (finally (a/close! input-ch)))))
    (finally
      (when clean-up-fn (clean-up-fn)))))
```
This guarantees clean-up-fn runs on any path out of write-body-to-stream. Note: (a/close! input-ch) should still live inside the io-thread's finally, since the io-thread owns the channel for the duration of the loop.