Matching Error Handling Design

Date: 2026-02-26 Branch: matching-candidate-retrieval

Problem

The matching worker deletes SQS messages in a finally block — regardless of success or failure. This means:

Design

1. Message Lifecycle & SQS Retry

Fix: move delete-message! out of the finally block. Messages are only deleted on definitive outcomes (success or non-transient failure). Transient failures use ChangeMessageVisibility to implement backoff before SQS redelivers the message.

Per-message flow:

Receive message (SQS tracks ApproximateReceiveCount)
  → Mark document matching_status = :in-progress
  → Run matching (with in-process LLM retries)
  → SUCCESS        → mark :succeeded, delete message
  → NON-TRANSIENT  → mark :failed, notify admins, delete message
  → TRANSIENT (in-process retries exhausted):
      receive-count = 1 → ChangeMessageVisibility to 5min
      receive-count = 2 → ChangeMessageVisibility to 30min
      receive-count = 3 → mark :failed, notify admins, do not delete → SQS sends to DLQ

maxReceiveCount remains at 3. The DLQ receives the message on the final failed attempt, preserving it for debugging. The 60s visibility timeout on the main queue is only relevant during active processing; explicit extension overrides it for retry delays.

2. LLM Retry & Error Classification

In-process retry: up to 3 attempts for each LLM call, with exponential backoff: 2s → 4s → 8s. Applied to llm-match-decision only.

Silent degradation removed: llm-match-decision currently returns {:matches []} on any failure, conflating two distinct outcomes:

Under the new design, llm-match-decision either returns a valid response (which may legitimately contain {:matches []}) or throws. The caller propagates the exception into the retry machinery.

Error classification:

Error Classification
HTTP 429 (rate limit) Transient
HTTP 5xx (server error) Transient
Network timeout / connection refused Transient
LLM response parse failure Transient
LLM response schema validation failure Transient
HTTP 400 context-window exceeded Non-transient
HTTP 401 / 403 Non-transient

LLM schema validation failures are transient because LLMs are non-deterministic — the same input may produce a valid response on a subsequent attempt.

3. Database Schema

One migration adds four columns to document:

ALTER TABLE document ADD COLUMN matching_status     text;
ALTER TABLE document ADD COLUMN matching_error      text;
ALTER TABLE document ADD COLUMN matching_attempts   integer NOT NULL DEFAULT 0;
ALTER TABLE document ADD COLUMN matching_failed_at  timestamptz;

State transitions:

Status Set when
NULL Document exists but not yet queued (non-matchable type, or pre-feature)
pending Ingestion publishes document ID to matching queue
in-progress Worker picks up message and begins processing
succeeded Matching completed (with or without match edges)
failed Permanently failed after all retries

matching_attempts increments on each SQS-level delivery (not in-process LLM retries). matching_error stores the last error message. matching_failed_at is set on transition to failed.

in-progress records that persist longer than ~10 minutes indicate a crashed worker — useful as a future monitoring alarm, not addressed in this design.

4. Cluster Merge Atomicity

The "write results" phase (clear previous matches → insert new match edges → update cluster assignments) is wrapped in a single DB transaction. A failure at any step rolls back the entire phase, leaving the previous match state intact.

Pre-processing writes (normalize counterparty, generate embedding, persist searchable_text / normalized_counterparty / normalized_references / embedding) are outside this transaction. They are safe to commit independently: if matching fails after them, a retry overwrites them with the same values.

Transaction boundary:

BEGIN
  DELETE FROM document_match WHERE document_a_id = ? OR document_b_id = ?
  INSERT INTO document_match (new edges)
  UPDATE document SET cluster_id = ? WHERE id IN (affected cluster members)
COMMIT

5. Failure Notifications

When a document transitions to failed, a notification is dispatched to admins. Notification config lives under the matching orchestrator config in config.edn:

:com.getorcha.workers.matching.worker/orchestrator
{...
 :notifications {:slack {:webhook-url "https://hooks.slack.com/..."}
                 :email {:to ["ops@getorcha.com"]}}}

Channel presence enables that channel; both are opt-in. Notification payload includes: document ID, tenant, error message, attempt count, and a link to the admin document page.

Slack is the primary channel (HTTP POST to webhook, no infrastructure required). Email is an extension point — the dispatch function accepts additional channel implementations without design changes.

Out of Scope