Matching Error Handling Design

Date: 2026-02-26 Branch: matching-candidate-retrieval

Problem

The matching worker deletes SQS messages in a finally block — regardless of success or failure. This means:

Processing failures silently lose the message; the document is never re-matched
The configured DLQ is never populated (messages are always deleted)
LLM failures degrade silently to "no matches", indistinguishable from a legitimate "no match" result
The cluster merge phase (clear matches → insert edges → update clusters) is not atomic — partial failures leave inconsistent state
There is no visibility into matching failures for admins

Design

1. Message Lifecycle & SQS Retry

Fix: move delete-message! out of the finally block. Messages are only deleted on definitive outcomes (success or non-transient failure). Transient failures use ChangeMessageVisibility to implement backoff before SQS redelivers the message.

Per-message flow:

Receive message (SQS tracks ApproximateReceiveCount)
  → Mark document matching_status = :in-progress
  → Run matching (with in-process LLM retries)
  → SUCCESS        → mark :succeeded, delete message
  → NON-TRANSIENT  → mark :failed, notify admins, delete message
  → TRANSIENT (in-process retries exhausted):
      receive-count = 1 → ChangeMessageVisibility to 5min
      receive-count = 2 → ChangeMessageVisibility to 30min
      receive-count = 3 → mark :failed, notify admins, do not delete → SQS sends to DLQ

maxReceiveCount remains at 3. The DLQ receives the message on the final failed attempt, preserving it for debugging. The 60s visibility timeout on the main queue is only relevant during active processing; explicit extension overrides it for retry delays.

2. LLM Retry & Error Classification

In-process retry: up to 3 attempts for each LLM call, with exponential backoff: 2s → 4s → 8s. Applied to llm-match-decision only.

Silent degradation removed: llm-match-decision currently returns {:matches []} on any failure, conflating two distinct outcomes:

Legitimate no-match: LLM responded successfully and determined no candidates match → matching succeeded, document has no match edges
LLM failure: API error, parse failure, schema validation failure → should retry

Under the new design, llm-match-decision either returns a valid response (which may legitimately contain {:matches []}) or throws. The caller propagates the exception into the retry machinery.

Error classification:

Error	Classification
HTTP 429 (rate limit)	Transient
HTTP 5xx (server error)	Transient
Network timeout / connection refused	Transient
LLM response parse failure	Transient
LLM response schema validation failure	Transient
HTTP 400 context-window exceeded	Non-transient
HTTP 401 / 403	Non-transient

LLM schema validation failures are transient because LLMs are non-deterministic — the same input may produce a valid response on a subsequent attempt.

3. Database Schema

One migration adds four columns to document:

ALTER TABLE document ADD COLUMN matching_status     text;
ALTER TABLE document ADD COLUMN matching_error      text;
ALTER TABLE document ADD COLUMN matching_attempts   integer NOT NULL DEFAULT 0;
ALTER TABLE document ADD COLUMN matching_failed_at  timestamptz;

State transitions:

Status	Set when
`NULL`	Document exists but not yet queued (non-matchable type, or pre-feature)
`pending`	Ingestion publishes document ID to matching queue
`in-progress`	Worker picks up message and begins processing
`succeeded`	Matching completed (with or without match edges)
`failed`	Permanently failed after all retries

matching_attempts increments on each SQS-level delivery (not in-process LLM retries). matching_error stores the last error message. matching_failed_at is set on transition to failed.

in-progress records that persist longer than ~10 minutes indicate a crashed worker — useful as a future monitoring alarm, not addressed in this design.

4. Cluster Merge Atomicity

The "write results" phase (clear previous matches → insert new match edges → update cluster assignments) is wrapped in a single DB transaction. A failure at any step rolls back the entire phase, leaving the previous match state intact.

Pre-processing writes (normalize counterparty, generate embedding, persist searchable_text / normalized_counterparty / normalized_references / embedding) are outside this transaction. They are safe to commit independently: if matching fails after them, a retry overwrites them with the same values.

Transaction boundary:

BEGIN
  DELETE FROM document_match WHERE document_a_id = ? OR document_b_id = ?
  INSERT INTO document_match (new edges)
  UPDATE document SET cluster_id = ? WHERE id IN (affected cluster members)
COMMIT

5. Failure Notifications

When a document transitions to failed, a notification is dispatched to admins. Notification config lives under the matching orchestrator config in config.edn:

:com.getorcha.workers.matching.worker/orchestrator
{...
 :notifications {:slack {:webhook-url "https://hooks.slack.com/..."}
                 :email {:to ["ops@getorcha.com"]}}}

Channel presence enables that channel; both are opt-in. Notification payload includes: document ID, tenant, error message, attempt count, and a link to the admin document page.

Slack is the primary channel (HTTP POST to webhook, no infrastructure required). Email is an extension point — the dispatch function accepts additional channel implementations without design changes.

Out of Scope

Admin UI / API endpoint for manual re-trigger (can be done via SQL + direct SQS publish when needed)
Stale in-progress detection / alarming
Matching quality improvements (better models, re-extraction fallbacks)
Tenant-facing failure notifications