Debug Session Summary — 2026-05-01

Scope: Whole codebase autonomous bug hunt Iterations: 20 (bounded) Severity threshold: medium+ Mode: report-only

Score

debug_score = bugs_found * 15
            + hypotheses_tested * 3
            + (files_investigated / files_in_scope) * 40
            + (techniques_used / 7) * 10

= 4 * 15  +  14 * 3  +  (~25/318) * 40  +  (5/7) * 10
= 60 + 42 + 3.1 + 7.1
= ~112

Findings

# Severity Title Location
1 HIGH Acquisition SQS message deleted before async handler commits workers/ap/acquisition.clj:121-139
2 HIGH Output dispatch jobs orphaned when app crashes between commit and SQS send (no sweeper) app/http/documents/view/approval.clj:124-156, app/document_output.clj:75-87
3 MEDIUM No backoff in SQS polling loops on persistent error (5 workers) workers/document_output.clj:236, workers/ap/acquisition.clj, workers/ap/ingestion.clj, workers/diagnostics_recompute.clj, workers/ap/processors/matching/worker.clj
4 MEDIUM SSELooper leaks subscriptions when initial "connected" write fails app/http/sse.clj:60-91

Full evidence and suggested fixes in findings.md. Disproven hypotheses (10) in eliminated.md. Per-iteration log in debug-results.tsv.

Common Theme

Three of the four bugs are at-least-once → at-most-once degradations in async pipelines:

The codebase has some awareness of this pattern (matching worker correctly delete-on-success at worker.clj:244; document-output processor uses with-completion-retry at engine.clj:171-189), so the fix is consistency rather than discovery — lift the existing patterns to the gaps.

Investigation Techniques Used

  1. Pattern search — grepped for (str "SELECT/INSERT/UPDATE/... and [:raw (str "...") to map SQL injection surface
  2. Direct inspection — read engine.clj, oauth.clj, acquisition.clj, sse.clj for control flow
  3. Differential — compared matching worker (correct delete-on-success) vs acquisition worker (delete-before-handler)
  4. Working backwards — traced from dispatch-job! symptom upward to approval.clj → discovered missing sweeper
  5. Library source dive — read buddy-sign and buddy-core source to verify JWT alg confusion was blocked by type error rather than design

User indicated report-only. Suggested follow-ups in priority order:

  1. Run /autoresearch:fix --from-debug (bounded ~5 iterations) to address the two HIGH severity items first. Both have concrete suggested fixes in findings.md.
  2. Add a regression test in test/com/getorcha/workers/ap/acquisition_test.clj for the lost-message scenario before fixing — the bug should be reproducible by killing the executor mid-task.
  3. The polling backoff fix is mechanical refactor — extract a shared helper into workers/util.clj and apply to all 5 polling loops in one PR.