Started: 2026-05-01 20:40 Scope: src/ + test/ (215 + 103 + 3 Clojure files) Iterations: 20 (bounded) Severity threshold: medium+ Mode: report-only
src/com/getorcha/workers/document_output.clj:236src/com/getorcha/workers/ap/acquisition.clj (catch in polling loop)src/com/getorcha/workers/ap/ingestion.clj (~line 1159+)src/com/getorcha/workers/diagnostics_recompute.clj (~line 108+)src/com/getorcha/workers/ap/processors/matching/worker.clj (~line 218+)(while @polling?
(try
(let [messages (aws/receive-messages ...)]
(doseq [^Message message messages] (.submit executor ...)))
(catch Exception e
(when @polling?
(log/error e "Error in X polling loop, will retry")))))
Thread/sleep between failures. aws/receive-messages returns immediately on error — long-poll wait is server-side.aws.clj:99-117 receive-messages calls .receiveMessage which throws SdkException on transport errors before any wait; the outer while immediately re-enters.(Thread/sleep (long backoff-ms)) in the catch with a small exponential schedule (e.g., 1s → 5s → 30s capped). Reset on successful receive. Extract into a shared helper (com.getorcha.workers.util/with-polling-backoff) since the pattern repeats verbatim.src/com/getorcha/workers/ap/acquisition.clj:121-139aws/delete-message! (line 139) on the same iteration. The runnable runs asynchronously — there is no .get / await before delete. If the worker process crashes between the SQS delete and the runnable's first durable side-effect (claim-inbox! upsert into ap-doc-source-email-pending-sync for :oauth-sync, or record-processed! for :ses), the trigger is lost.(when (not= ::ignored q-message)
(.submit executor (mdc/wrap-runnable (fn [] ...)))) ; non-blocking
(aws/delete-message! sqs-client queue-url message)) ; runs immediately
Executors/newVirtualThreadPerTaskExecutor never rejects, so submit always returns; the runnable is queued but not yet running.:oauth-sync — Gmail/Microsoft push webhook. Lost trigger means the inbox is not synced until the NEXT webhook (next inbound email) arrives. Legitimate emails sit unprocessed in the user's inbox indefinitely if no further mail arrives.:ses — direct email ingestion via SES → S3 → SQS. Lost trigger leaves the .eml in S3 (7-day lifecycle) but unprocessed; recovery requires manual replay tooling. If no such tooling exists, the email is permanently missed at-least-once-becomes-at-most-once.aws/delete-message! inside the runnable's success path (after claim-inbox! / record-processed! commits). Visibility timeout will keep messages safe during normal processing; failed work returns to the queue for natural retry. Pair with a per-message visibility heartbeat for long-running handlers (the document-output worker already does this — pattern can be lifted).src/com/getorcha/app/http/documents/view/approval.clj:124-156 — final approval flowsrc/com/getorcha/app/document_output.clj:75-87 — create-and-enqueue-job! (used by manual re-export from invoice view)document_output_job row is INSERTed inside a DB transaction (status=pending); after the transaction commits, the app calls enqueue-job! (SQS send-message) outside the transaction. If the app process dies between commit and SQS send, the row exists in pending state but no SQS message is ever published. The worker only processes jobs that arrive via SQS — claim-job! is invoked with the job-id from the message body — so the orphaned row is never picked up.approval.clj:130-141 creates the job inside with-transaction tx, returning job from the tx body.approval.clj:144-155 calls enqueue-job! AFTER the tx (line 162 begins outside [tx db-pool]).document_output.clj:75-87 (create-and-enqueue-job!) follows the same pattern: (create-job! db-pool ...) commits immediately, then (enqueue-job! ...) runs separately. The try/catch only handles synchronous enqueue-job! errors — not process death.src/com/getorcha/workers/ and src/com/getorcha/app/: only db.run/reap-stuck-running! exists for document_processor_run rows. No equivalent reaper exists for document_output_job rows stuck in pending without an SQS message.document_output_job row itself as the outbox. Add a periodic sweeper in workers/document_output.clj that scans for status='pending' rows older than e.g. 60s and re-enqueues their id to SQS. Idempotent: the worker's claim-job! is already conflict-safe (status='pending' OR stale-running), so duplicate SQS messages are absorbed.document_output_job already fires a NOTIFY on insert (per app/ingestion.clj:291-303). Add a small listener on that channel that calls enqueue-job!. Ensures the SQS send happens after commit even if the original caller crashes — but loses any notify if the listener is down (still need a startup sweep).pending rows older than boot-grace-seconds and re-enqueue. Cheaper than a periodic sweeper.test/com/getorcha/workers/document_output_test.clj does not cover the orphaned-pending case.src/com/getorcha/app/http/sse.clj:60-91connected as the first SSE frame at line 64. If that write fails (client disconnected between request handling and Jetty starting to write the response body), the (when ...) predicate is false → the io-thread is never started → the (finally (clean-up-fn) ...) block at line 88-91 is unreachable. Whatever subscription/state the caller registered before invoking looper is never cleaned up.(write-body-to-stream [_ _response output-stream]
(let [writer (io/writer output-stream)]
(when (write! writer "connected" "{}") ; if false, fall-through with no cleanup
(a/io-thread
(try
(loop [...] ...)
(finally
(when clean-up-fn (clean-up-fn))
(a/close! input-ch)))))))
write! returns false on IOException (line 55-57). When that happens before the io-thread starts, clean-up-fn is unreferenced.app/ingestion.clj:330+) using a/pub over a channel of capacity 100, leaked subs slowly fill capacity → eventually the publisher blocks on a/put! → backpressure spreads to the source NOTIFY listener thread, which can starve real-time updates for all users.(when (write! ...)) predicate. Restructure as:
(let [writer (io/writer output-stream)]
(try
(when (write! writer "connected" "{}")
(a/io-thread (try (loop [...] ...) (finally (a/close! input-ch)))))
(finally
(when clean-up-fn (clean-up-fn)))))
This guarantees clean-up-fn runs on any path out of write-body-to-stream. Note: (a/close! input-ch) should still live inside the io-thread's finally, since the io-thread owns the channel for the duration of the loop.