Document Output Dispatcher Design

Context

AP approvals currently execute DATEV export from the app service after the final approval. The export itself is implemented in com.getorcha.integrations.ap.maesn and records DATEV-specific progress in ap_datev_export_audit.

We want the app to own user-facing workflow state and output intent, while a worker owns external output execution. This first implementation is intentionally narrow: AP final approval should enqueue a document output job, and an SQS worker should execute the existing DATEV export path. The design uses generic names so AR outputs can use the same mechanism later, but AR policy resolution and multi-target output auditing are out of scope.

Goals

Move AP final-approval DATEV export execution out of the app service.
Move manual single-document DATEV re-export execution out of the app service.
Keep SQS as the operational worker mechanism, matching existing worker patterns.
Add a document_output_job table as durable workflow/audit state.
Keep the worker SQS-driven for now; do not add DB polling.
Remove AP batch DATEV export and the bulk selection UI that only exists for it.
Review existing export-related tables and fields for redundancy before adding the new schema.

Non-Goals

Build AR output dispatch.
Build tenant-configured output policies.
Add per-output target/task audit tables.
Replace ap_datev_export_audit.
Add DB polling or a repair sweeper for pending output jobs.
Rework virtual-thread concurrency boundaries. That concern is tracked in GitHub issue #372.

Data Model

Add document_output_job as the generic record of output intent and worker execution state.

Columns:

id UUID PRIMARY KEY DEFAULT gen_random_uuid()
tenant_id UUID NOT NULL REFERENCES tenant(id) ON DELETE CASCADE
document_id UUID NOT NULL REFERENCES document(id) ON DELETE CASCADE
document_domain document_output_domain NOT NULL
trigger document_output_trigger NOT NULL
status document_output_status NOT NULL DEFAULT 'pending'
document_version INTEGER
requested_by UUID REFERENCES identity(id) ON DELETE SET NULL
started_at TIMESTAMPTZ
completed_at TIMESTAMPTZ
last_error TEXT
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()

Enums:

document_output_domain: initially ap; later ar.
document_output_trigger: initially approval_completed and manual; later scheduled or AR-specific triggers as needed.
document_output_status: pending, running, dispatched, failed.

The job table must not include a target. A job means "dispatch this document's outputs." Which systems receive output is a worker/code decision in this phase and later a tenant policy decision. Auditing individual output targets is useful, but out of scope for this implementation.

Useful indexes:

(document_id, created_at DESC) for document detail/debugging.
(tenant_id, created_at DESC) for operational views.
(status, created_at) for future repair tooling.
A partial unique index preventing concurrent dispatches for the same document: CREATE UNIQUE INDEX idx_document_output_job_document_active ON document_output_job (document_id) WHERE status IN ('pending', 'running'). This rejects a second manual re-export while one is already in flight and also protects against final-approval dispatch racing another enqueue path. Later, when AR or scheduled outputs need concurrent dispatch semantics, this can be relaxed or replaced with a more specific uniqueness key.

SQS Configuration

Add a fifth queue to resources/com/getorcha/config.edn under :com.getorcha/aws :queues:

key: :document-output
queue name: v1-orcha-global-document-output
DLQ name: v1-orcha-global-document-output-dlq

Queue attributes should match the current local queue helper posture unless production infrastructure requires the same setting elsewhere:

VisibilityTimeout = 60 seconds on the queue.
Worker immediately extends visibility to the configured heartbeat-extension-seconds before starting work.
MessageRetentionPeriod = 604800 seconds, 7 days.
Redrive policy: DLQ after maxReceiveCount = 3.

Worker Integrant config should mirror the existing SQS consumers:

max-queue-messages = 10
wait-time-seconds = 20
heartbeat-extension-seconds = 300
heartbeat-rate-seconds = 60
stale-running-seconds = 600

Message body for V1 is the job id as a plain UUID string. This matches the ingestion queue's simple id message style and avoids introducing a message schema before the worker has multiple message kinds. Future message versions can switch to JSON if needed.

The stale-running threshold is intentionally longer than the visibility extension window and several heartbeat periods. It lets a replacement worker recover from a dead worker, while avoiding duplicate export initiation during normal heartbeat jitter.

Flow

Final AP Approval

The approval handler continues to lock approval rows and update the final row in the existing transaction. When the approval transition makes the document fully approved, the same transaction inserts a document_output_job with:

document_domain = 'ap'
trigger = 'approval_completed'
status = 'pending'
current document_id, tenant_id, document.version, and approver identity

After the transaction commits, the handler sends an SQS message to the document output queue with the job-id.

If the transaction fails, no approval or job is committed. If the transaction commits but SQS send fails, update the job to failed with last_error and surface that dispatch failure through the document detail UI. Stale pending jobs should therefore mean the app committed the job and attempted dispatch, but the process stopped before it could mark send failure; operators can find these rows directly, and a future sweeper can enqueue stale pending jobs if this proves common.

The failure update after an SQS send error must be guarded: UPDATE document_output_job SET status = 'failed', last_error = ?, updated_at = now(), completed_at = now() WHERE id = ? AND status = 'pending'. If the update affects zero rows, the app must not overwrite the current state; it should re-read the job and render the current dispatch state. This prevents an ambiguous send error from marking a job failed after a worker has already claimed it.

Manual DATEV Re-Export

The document detail re-export action should use the same dispatcher path. The handler validates user access and basic document eligibility, inserts a document_output_job with:

document_domain = 'ap'
trigger = 'manual'
status = 'pending'
current document_id, tenant_id, document.version, and requester identity

After creating the job, the handler sends the SQS message with job-id. If SQS send fails, it marks the job failed with last_error and returns a UI state that reflects the dispatch failure. It should not call maesn/create-booking-proposal! directly.

Manual re-export must reject or surface "already in flight" when the partial unique index prevents creating a second pending job for the same document. It must not enqueue another SQS message for a document with an active pending or running dispatch job.

Document Output Worker

Add an SQS worker namespace following existing worker patterns:

Poll the configured document output queue.
Submit one virtual-thread task per SQS message, consistent with current worker style.
Parse the message body as job-id.
Extend SQS visibility immediately and then on a heartbeat schedule, following the pattern in workers/ap/ingestion.clj.
On each heartbeat, also touch document_output_job.updated_at while the job remains running. The stale-running claim uses this timestamp as the worker liveness marker.
Atomically claim the job with a single update: UPDATE document_output_job SET status = 'running', started_at = COALESCE(started_at, now()), updated_at = now() WHERE id = ? AND (status = 'pending' OR (status = 'running' AND updated_at < now() - interval '600 seconds')) RETURNING *.
If claim returns no row, load the job to decide whether to delete or leave the message:
- dispatched: delete the message.
- fresh running: leave the message to be retried after visibility timeout, without deleting it. It must not silently convert a potentially stuck job into a permanently stuck job.
- missing job: log and delete the message.
Load the document and tenant context.
Resolve outputs in code. For this implementation:
- document_domain = ap
- trigger = approval_completed or manual
- connected DATEV integration exists
- document is eligible according to the existing DATEV eligibility check
- run maesn/create-booking-proposal!
Mark the job dispatched once DATEV export has been handed to DATEV/Maesn far enough that ap_datev_export_audit owns the remaining async outcome tracking.
Mark the job failed with last_error when export initiation fails.

DATEV completion remains asynchronous and continues to be represented by ap_datev_export_audit. The output job records that the output dispatcher accepted and dispatched the configured output work.

The worker should not invent a new concurrency convention in this plan. It should reuse the existing SQS worker pattern: virtual-thread-per-task executor plus SQS visibility extension during long-running work.

DATEV Audit Link

Add a nullable dispatch_job_id UUID REFERENCES document_output_job(id) column to ap_datev_export_audit.

When the output worker calls the DATEV connector, the connector should record the job id on the audit row it creates. This gives operators a direct link from the generic output dispatch attempt to the DATEV-specific async task state. Existing manual exports or historical rows can have NULL.

UI and Events

Dispatch-level failures must be visible even when no DATEV audit row exists. Examples include SQS send failure after job creation, missing DATEV integration, or worker-side eligibility failure before maesn/create-booking-proposal! creates an audit row.

Add a Postgres NOTIFY trigger for document_output_job changes on status insert/update. The payload should include:

event/type = output-dispatch
document-output-job/id
document-output-job/status
document/id
tenant/id
organization/id
old-status

The document detail SSE handler should react to :output-dispatch by re-rendering the DATEV export section using both:

latest ap_datev_export_audit
latest relevant document_output_job

The DATEV export section should surface dispatch state when it is relevant:

pending or running: show a dispatching/export-requested state and keep the SSE subscription active.
failed with no newer DATEV audit: show an inline failure using document_output_job.last_error and allow retry when the regular approval and DATEV eligibility rules allow it.
dispatched: prefer the DATEV audit state when present; otherwise show a transient dispatched/exporting state until the audit event arrives.

Manual re-export should also use this path. If the app creates a job but fails to send SQS, it should mark the job failed, render the DATEV export section with that dispatch failure, and not pretend that DATEV export has started.

Idempotency

The worker must tolerate duplicate SQS delivery.

Rules:

Missing job: log and delete the message.
dispatched: delete the message.
fresh running: do not start another external export.
stale running: claim atomically and retry, using the same claim function as pending jobs.
failed: delete the message. Re-dispatch of failed jobs is future manual or repair tooling, not SQS redelivery in this implementation.

Before initiating DATEV export, the worker should preserve existing DATEV eligibility behavior and avoid creating a new export when the document already has a non-retryable successful export state.

If a stale running job already has a linked DATEV audit with a task id or a terminal status, the worker should not create another DATEV export. It should complete the dispatch job according to the linked audit state, or leave it for manual repair if the audit state is ambiguous.

Batch Export Removal

Remove AP batch DATEV export entirely:

Remove the "Export to DATEV" bulk action.
Remove AP row checkboxes, select-all checkbox, and selection count UI.
Remove selection session helpers if no other AP bulk actions use them.
Remove /toggle, /toggle-all, /deselect-all, and /export-datev routes when they only support batch export.
Remove bulk-export-datev.
Remove or rewrite tests that assert batch export behavior.

Manual single-document re-export remains available in the document detail UI, but its execution moves to the document output worker. The required behavior changes for this plan are final-approval export via worker, manual re-export via worker, and removal of batch export.

Schema Redundancy Review

Before adding the migration, review current export-related schema and code. Initial expectations:

Keep ap_datev_export_audit. It tracks DATEV-specific request payloads, task IDs, statuses, errors, payload hashes, and powers the existing UI/SSE status updates. Add dispatch_job_id rather than replacing this table.
Keep tenant_datev_integration. It stores DATEV connection state, credentials, config, and metadata.
Confirm no legacy tenant-level DATEV export columns remain active.
Confirm no batch-export-only state remains after UI/route removal.

Only remove schema that is proven unused and unrelated to DATEV task auditing or DATEV connection state.

Testing

Add or update focused tests:

Final approval inserts a document_output_job and sends an SQS message after the transaction commits.
Non-final approval does not enqueue output.
If final approval authorization fails, no job is created.
Manual single-document re-export inserts a document_output_job, sends SQS, and does not call DATEV directly from the app handler.
Manual single-document re-export rejects or surfaces an in-flight state when the document already has a pending or running output job.
SQS send failure marks the job failed only with a guarded WHERE status = 'pending' update and never overwrites running or dispatched.
SQS config adds :document-output with queue name, DLQ, visibility timeout, retention, and max receive count defined in this spec.
Worker claims jobs atomically with UPDATE ... WHERE status = 'pending' OR stale running older than 600 seconds ... RETURNING *.
Worker extends SQS visibility before and during DATEV initiation.
Worker handles a pending AP approval-completed job by invoking DATEV export and marking the job dispatched.
Worker marks failed export initiation as failed with last_error.
Duplicate message for a dispatched/fresh-running job does not create another DATEV export.
Stale running jobs can be reclaimed without permanently deleting the only SQS redelivery path.
DATEV audit rows created through the worker include dispatch_job_id.
Document detail UI surfaces document output job failures when no DATEV audit row exists.
AP batch export routes/UI/tests are removed or updated.

Open Future Work

AR output dispatch.
Tenant-configured output policy resolution.
Per-output task/audit records for cases such as DATEV plus Agicap plus email.
Repair tooling for stale pending jobs where SQS send failed.
Rich stuck-running recovery and operator controls beyond the stale-running claim guard.
Holistic virtual-thread/external-side-effect concurrency review.