Orcha Ingestion: Processing Pipeline

Worker architecture and job processing details.

Concurrency Architecture

The worker system is designed for high concurrency on I/O-bound work while properly isolating CPU-intensive tasks.

Components

SQS Poller

A single virtual thread dedicated to polling SQS.

Performs long-polling with WaitTimeSeconds=20 and MaxNumberOfMessages=10
Long-polling is efficient: the call blocks until messages arrive or timeout expires, avoiding wasteful rapid polling
For each received message, submits a job task to the main executor
Controlled by an atomic shutdown flag; when false, the poller exits its loop cleanly
The MaxNumberOfMessages limit provides natural backpressure on intake rate

Main Job Executor

Uses Java 21's Executors.newVirtualThreadPerTaskExecutor().

Creates a lightweight virtual thread for each job
Virtual threads are cheap (~1KB vs ~1MB for platform threads), enabling thousands of concurrent jobs
Ideal for I/O-bound work: when a virtual thread blocks on HTTP/S3/DB calls, it yields its carrier thread automatically
Each job task executes the full processing pipeline independently

Heartbeat Scheduler

A shared ScheduledThreadPoolExecutor with 2-4 platform threads.

All jobs share this single scheduler for their heartbeat tasks
Each job schedules a recurring task that extends SQS visibility timeout via ChangeMessageVisibility
Heartbeat interval: 30 seconds
Visibility timeout extension: 5 minutes per heartbeat
When a job completes (success or failure), it cancels its scheduled heartbeat task
This prevents the proliferation of per-job schedulers that would waste OS threads

CPU Pool

A fixed-size ThreadPoolExecutor sized to Runtime.getRuntime().availableProcessors().

Dedicated to CPU-intensive OpenCV preprocessing work
When a job requires image preprocessing, it submits that work to this pool and waits for the result
The calling virtual thread blocks (yielding its carrier), so other I/O-bound jobs continue unimpeded
Isolates CPU-bound work from starving the virtual thread carrier pool

Why This Design

Workload	~% of Jobs	Handled By
I/O-bound (API calls, S3, DB)	90%	Virtual threads in main executor
CPU-bound (OpenCV)	10%	Dedicated CPU pool

Virtual threads excel at I/O-bound concurrency but don't magically speed up CPU-bound work—they just prevent it from blocking other work. By isolating CPU tasks to a bounded pool sized to actual cores, we get predictable CPU utilization without interference.

Job Processing Pipeline

Each job executes the following pipeline within its virtual thread:

Step 1: Validate Message

Parse the SQS message body as a UUID. The message body contains the ingestion ID.

If invalid UUID: Log the error and delete the message immediately. This handles malformed messages that cannot be processed.

If valid UUID: Proceed to Step 2.

Step 2: Claim Ingestion

Perform an atomic conditional update to claim the ingestion:

UPDATE ingestion
SET started_at = now()
WHERE id = {ingestion-id}
  AND status = 'in-progress'
  AND (started_at IS NULL OR started_at < now() - interval '5 minutes')
RETURNING *

If the update affects 0 rows, either:

Another worker already claimed this ingestion, OR
The ingestion has already completed/failed

In either case, skip processing and delete the message.

Why the stale check? The started_at timestamp serves as a lock:

If started_at IS NULL: ingestion was never claimed, safe to claim
If started_at is recent: another worker is actively processing, skip
If started_at is old (exceeds configurable threshold): previous worker crashed, safe to reclaim

This single check provides both duplicate prevention and crash recovery. The threshold is configurable via [:com.getorcha.workers/orchestrator :stale-ingestion-threshold-minutes].

Step 3: Load Document

After successfully claiming, load the associated document from the database:

SELECT * FROM document WHERE id = {ingestion.document_id}

This returns the ingestion map structure used throughout the pipeline:

{:id         ingestion-id
 :document   {:document/id ...
              :document/file-path ...
              ...}}

Step 4: Schedule Heartbeat

Schedule the heartbeat task on the shared scheduler. This ensures visibility timeout is extended during processing.

The heartbeat is only started for claimed ingestions because:

The claim is a fast operation (< 100ms), well within visibility timeout
Avoiding unnecessary scheduler work for already-claimed ingestions
Simpler cleanup—no heartbeat to cancel on skip path

Step 5: Fetch Document from S3

Retrieve the original document from S3 using the document's file_path:

s3://{bucket}/documents/{document-id}.{ext}

The file contents and mime-type are added to the ingestion map:

{:id       ingestion-id
 :document {...}
 :file     {:contents  <byte-array>
            :mime-type "application/pdf"}}

Step 6: Transcription

The transcription phase extracts text from the document, automatically preprocessing low-quality images when needed.

The transcription provider:

Runs OCR using Google Document AI Enterprise with:
- processOptions.ocrConfig.enableImageQualityScores: true — Returns page-level quality assessment (0-1 score) and defect breakdown (blurry, dark, glare, small fonts, etc.)
- processOptions.ocrConfig.enableNativePdfParsing: true — Extracts embedded text from digital PDFs when available
Evaluates quality against a configurable threshold (e.g., 0.7)
If quality is below threshold and document hasn't been preprocessed:
- Submits to CPU pool for OpenCV preprocessing (contrast enhancement, deblurring, noise reduction, rotation correction)
- Re-runs OCR on the enhanced image
- Includes the preprocessed file in the result for storage and auditability
Returns a result containing:
- Extracted text
- Quality score (0-1)
- Page count
- Preprocessed file (if preprocessing was performed)

Note: The extraction quality gate (Step 9) is the ultimate decider of document processing success—even low-quality transcription may produce acceptable extraction results, so processing proceeds regardless of transcription quality score.

Step 7: Store Transcription Output

Upload transcription result to S3 as EDN at an ingestion-specific path:

s3://{bucket}/ingestions/{ingestion-id}/transcription-output.edn

If preprocessing was performed, also upload:

s3://{bucket}/ingestions/{ingestion-id}/preprocessed.{ext}

The EDN format preserves the full transcription response structure including extracted text, quality score, page count, per-page quality breakdowns, and any other metadata returned by the transcription provider.

Update ingestion record with transcription metadata:

UPDATE ingestion
SET transcription_file_path = 'ingestions/{id}/transcription-output.edn',
    transcription_quality_score = {quality-score},
    preprocessed_file_path = 'ingestions/{id}/preprocessed.pdf'  -- if applicable
WHERE id = {ingestion-id}

Step 8: Extraction

Send the transcribed text to the LLM with a prompt requesting structured data extraction. The LLM returns JSON containing:

extraction_successful: boolean
confidence: "high" | "medium" | "low"
missing_fields: array of field names that could not be extracted
Extracted fields: invoice_number, invoice_date, supplier_name, supplier_address, line_items, subtotal, vat_amount, total, etc.

Step 9: Extraction Quality Gate

Evaluate the LLM response:

Success criteria:

extraction_successful: true
Critical fields present (at minimum: invoice_number, invoice_date, total)
confidence not "low"

If success: Proceed to Step 10.

If failure: Branch to failure handling (see Failure Handling section below).

Step 10: Store Result

Update the ingestion record with the extraction result:

UPDATE ingestion
SET status = 'completed',
    structured_data = {json-result},
    valid_structured_data = true,
    extraction_input_tokens = {tokens},
    extraction_output_tokens = {tokens},
    extraction_model = {model},
    completed_at = now()
WHERE id = {ingestion-id}

A database trigger automatically propagates the result to the document:

-- Trigger: update_document_from_ingestion()
UPDATE document
SET structured_data = NEW.structured_data,
    needs_human_review = NOT COALESCE(NEW.valid_structured_data, true),
    updated_at = now()
WHERE id = NEW.document_id

Step 11: Cleanup (Success Path)

Cancel the heartbeat task
Delete the SQS message

The job is complete.

Failure Handling

All failures—whether soft (extraction quality gate) or hard (S3/OCR/LLM exceptions)—are handled uniformly.

Unified Error Handling

When any exception occurs after claiming, the system:

Increments attempt_count on the ingestion (always, regardless of failure type)
Checks if max attempts reached (configured as 3)
Updates ingestion status if max attempts reached

UPDATE ingestion
SET attempt_count = attempt_count + 1,
    status = CASE
      WHEN attempt_count + 1 >= 3 THEN 'failed'
      ELSE status
    END,
    completed_at = CASE
      WHEN attempt_count + 1 >= 3 THEN now()
      ELSE completed_at
    END
WHERE id = {ingestion-id}
RETURNING attempt_count, status

Failure Outcomes

Based on the returned status after the update:

Status	Action	Reason
`failed`	Delete SQS message	Max attempts reached, route to human review
`in-progress`	Let visibility expire	Will retry from beginning

When an ingestion fails, the database trigger updates the document:

-- needs_human_review is set to true when valid_structured_data is false/null
UPDATE document SET needs_human_review = true WHERE id = {document-id}

Transcription Caching Within Same Ingestion

When a worker crashes after transcription but before extraction (or any later stage), the SQS message will be redelivered and another attempt begins. To avoid redundant transcription calls:

Before running transcription, check if the transcription output file exists in S3 at ingestions/{id}/transcription-output.edn
If it exists, download and reuse it
If not, run transcription and store the result

This is a simple file existence check, not a status-based checkpoint system. Each ingestion stores its own transcription results, providing full auditability.

Design Rationale

This unified approach was chosen over separate soft/hard failure handling because:

Simpler code: Single catch block, single database query
Consistent retry tracking: attempt_count always reflects total attempts
Predictable behavior: Ingestions always fail after exactly 3 attempts
Full auditability: Each ingestion has complete transcription and extraction records

Failure Scenarios and Retry Strategies

Failure Type	Likely Cause	Retry Strategy
Transcription quality low	Blurry/dark image	Preprocess and retry
Extraction timeout	Transient API issue	Retry as-is
Extraction rate limited	Quota exceeded	Retry with backoff (SQS delay)
Extraction incomplete	Ambiguous document	Retry once, then human review
Critical fields missing	Non-standard format	Human review
S3/DB errors	Infrastructure issue	Retry as-is
Transcription API error	Transient API issue	Retry as-is

Dead Letter Queue

Configure SQS with a dead-letter queue (DLQ) as a safety net. Messages that fail repeatedly beyond SQS's own retry policy land in the DLQ for investigation. This catches edge cases not handled by application-level retry logic.

Stale Ingestion Handling

If a worker crashes after claiming but before the catch block executes:

The started_at timestamp remains set
The SQS message is redelivered after visibility timeout
Another worker can claim it because started_at < now() - 5 minutes

This is handled directly in the claim query's stale check—no separate cleanup job required.

Graceful Shutdown

The system supports graceful shutdown to prevent work loss during deployments.

Shutdown Sequence

JVM shutdown hook triggers (SIGTERM, SIGINT, or programmatic shutdown)
Stop the poller: Set the atomic shutdown flag to false. The poller completes its current long-poll cycle (up to 20 seconds) and exits without fetching more messages.
Shutdown main executor: Call executor.shutdown(). No new tasks are accepted, but in-flight jobs continue.
Await termination: Wait for all in-flight jobs to complete with a generous timeout (e.g., 10-15 minutes). Jobs continue running, including their heartbeats maintaining SQS visibility.
Force shutdown if needed: If timeout expires and jobs are still running, call executor.shutdownNow(). Any interrupted jobs will have their messages become visible again in SQS after visibility timeout expires, enabling reprocessing by another worker.
Cleanup schedulers and pools: Shut down the heartbeat scheduler and CPU pool.

Deployment Considerations

Set Kubernetes/ECS termination grace period to exceed the await timeout
Monitor for jobs that consistently exceed expected duration—they may indicate issues
Unfinished jobs are safe: SQS visibility timeout ensures they'll be redelivered

Future Considerations

See Potential Improvements for detailed proposals on: