Date: 2026-02-26 Branch: matching-candidate-retrieval
The matching worker deletes SQS messages in a finally block — regardless of success or failure. This means:
Fix: move delete-message! out of the finally block. Messages are only deleted on definitive outcomes (success or non-transient failure). Transient failures use ChangeMessageVisibility to implement backoff before SQS redelivers the message.
Per-message flow:
Receive message (SQS tracks ApproximateReceiveCount)
→ Mark document matching_status = :in-progress
→ Run matching (with in-process LLM retries)
→ SUCCESS → mark :succeeded, delete message
→ NON-TRANSIENT → mark :failed, notify admins, delete message
→ TRANSIENT (in-process retries exhausted):
receive-count = 1 → ChangeMessageVisibility to 5min
receive-count = 2 → ChangeMessageVisibility to 30min
receive-count = 3 → mark :failed, notify admins, do not delete → SQS sends to DLQ
maxReceiveCount remains at 3. The DLQ receives the message on the final failed attempt, preserving it for debugging. The 60s visibility timeout on the main queue is only relevant during active processing; explicit extension overrides it for retry delays.
In-process retry: up to 3 attempts for each LLM call, with exponential backoff: 2s → 4s → 8s. Applied to llm-match-decision only.
Silent degradation removed: llm-match-decision currently returns {:matches []} on any failure, conflating two distinct outcomes:
succeeded, document has no match edgesUnder the new design, llm-match-decision either returns a valid response (which may legitimately contain {:matches []}) or throws. The caller propagates the exception into the retry machinery.
Error classification:
| Error | Classification |
|---|---|
| HTTP 429 (rate limit) | Transient |
| HTTP 5xx (server error) | Transient |
| Network timeout / connection refused | Transient |
| LLM response parse failure | Transient |
| LLM response schema validation failure | Transient |
| HTTP 400 context-window exceeded | Non-transient |
| HTTP 401 / 403 | Non-transient |
LLM schema validation failures are transient because LLMs are non-deterministic — the same input may produce a valid response on a subsequent attempt.
One migration adds four columns to document:
ALTER TABLE document ADD COLUMN matching_status text;
ALTER TABLE document ADD COLUMN matching_error text;
ALTER TABLE document ADD COLUMN matching_attempts integer NOT NULL DEFAULT 0;
ALTER TABLE document ADD COLUMN matching_failed_at timestamptz;
State transitions:
| Status | Set when |
|---|---|
NULL |
Document exists but not yet queued (non-matchable type, or pre-feature) |
pending |
Ingestion publishes document ID to matching queue |
in-progress |
Worker picks up message and begins processing |
succeeded |
Matching completed (with or without match edges) |
failed |
Permanently failed after all retries |
matching_attempts increments on each SQS-level delivery (not in-process LLM retries). matching_error stores the last error message. matching_failed_at is set on transition to failed.
in-progress records that persist longer than ~10 minutes indicate a crashed worker — useful as a future monitoring alarm, not addressed in this design.
The "write results" phase (clear previous matches → insert new match edges → update cluster assignments) is wrapped in a single DB transaction. A failure at any step rolls back the entire phase, leaving the previous match state intact.
Pre-processing writes (normalize counterparty, generate embedding, persist searchable_text / normalized_counterparty / normalized_references / embedding) are outside this transaction. They are safe to commit independently: if matching fails after them, a retry overwrites them with the same values.
Transaction boundary:
BEGIN
DELETE FROM document_match WHERE document_a_id = ? OR document_b_id = ?
INSERT INTO document_match (new edges)
UPDATE document SET cluster_id = ? WHERE id IN (affected cluster members)
COMMIT
When a document transitions to failed, a notification is dispatched to admins. Notification config lives under the matching orchestrator config in config.edn:
:com.getorcha.workers.matching.worker/orchestrator
{...
:notifications {:slack {:webhook-url "https://hooks.slack.com/..."}
:email {:to ["ops@getorcha.com"]}}}
Channel presence enables that channel; both are opt-in. Notification payload includes: document ID, tenant, error message, attempt count, and a link to the admin document page.
Slack is the primary channel (HTTP POST to webhook, no infrastructure required). Email is an extension point — the dispatch function accepts additional channel implementations without design changes.
in-progress detection / alarming