Orcha

A document processing and ERP platform for extracting structured data from business documents. Orcha combines an asynchronous document ingestion pipeline with a web-based interface for managing and viewing extraction results.

The system uses Google Document AI for OCR and Anthropic Claude for LLM-based data extraction from invoices and other business documents.

Prerequisites

Required Software

Tool Version Purpose
Java 21+ JVM runtime
Clojure CLI 1.12+ Build and run Clojure
Babashka Latest Task runner
Docker Latest Container runtime
Docker Compose v2+ Local infrastructure
AWS CLI v2 MiniStack initialization

macOS Installation

Using Homebrew:

# Java (Temurin/Eclipse Adoptium recommended)
brew install --cask temurin@21

# Clojure CLI
brew install clojure/tools/clojure

# Babashka
brew install borkdude/brew/babashka

# Docker Desktop (includes Docker Compose)
brew install --cask docker

# AWS CLI
brew install awscli

After installing Docker Desktop, launch it from Applications to start the Docker daemon.

Linux Installation (Arch)

sudo pacman -S jdk21-openjdk clojure babashka docker docker-compose aws-cli-v2

# Enable Docker and add your user to the docker group
sudo systemctl enable --now docker
sudo usermod -aG docker $USER
# Log out and back in for group membership to take effect

Verify Installation

java -version          # Should show 21+
clj --version          # Should show 1.12+
bb --version           # Any recent version
docker --version       # Any recent version
docker compose version # v2+
aws --version          # v2+

Google Cloud Credentials (OCR)

Orcha uses Google Document AI for OCR. For local development, you need a service account credentials file.

  1. Go to Google Cloud Console
  2. Create or select a project
  3. Enable the Document AI API
  4. Create a Document AI processor (Form Parser or OCR processor)
  5. Create a service account with Document AI permissions
  6. Download the JSON key file

Place the credentials file at:

mkdir -p credentials
# Copy your downloaded JSON key file
cp ~/Downloads/your-project-xxxxx.json credentials/google-docai.json

The config expects this file at credentials/google-docai.json for local development.

Setup

1. Clone and Enter the Project

cd /path/to/orcha

2. Start Infrastructure

This starts PostgreSQL and MiniStack (AWS emulator) in Docker containers:

bb dev:up

This command:

MiniStack does not auto-seed. Run bb dev:seed after bb dev:up on the first boot (or after bb dev:reset) to create the S3 buckets, SQS queues, KMS keys, SSM parameters, and secrets Orcha needs. The seed is idempotent, so re-running it is safe and only creates what's missing.

Verify containers are running:

bb dev:status

3. Run Database Migrations

bb migrate migrate

This creates all required database tables and seeds initial data for local development.

4. Download Dependencies

Clojure downloads dependencies on first run. Pre-fetch them:

clj -P              # Main dependencies
clj -P -A:dev       # Dev dependencies
clj -P -A:test      # Test dependencies

Running the Application

Start a REPL with dev configuration:

clj -A:dev

In the REPL:

;; Start the system
(reset)

;; System is now running:
;; - HTTP server on http://localhost:8888
;; - Workers processing SQS queue

;; After code changes, reload:
(reset)

;; Suspend without stopping:
(ig.repl/suspend)

;; Resume:
(ig.repl/resume)

Option 2: Run Directly

clj -M -m com.getorcha.system

The application runs on http://localhost:8888.

Development Tasks

Babashka provides convenient task shortcuts:

bb dev:up        # Start MiniStack + PostgreSQL
bb dev:seed      # Seed MiniStack with Orcha's AWS resources (idempotent)
bb dev:down      # Stop containers (data persists)
bb dev:reset     # Stop containers and delete all data
bb dev:status    # Check container status
bb dev:logs      # Tail all container logs

bb db:psql       # Connect to PostgreSQL with psql
bb db:logs       # Tail PostgreSQL logs

bb migrate migrate              # Run pending migrations
bb migrate create "add-users"   # Create new migration files
bb migrate rollback             # Rollback last migration

Database Reset

In the REPL, to completely reset the database:

(reset-db!)

Or from the command line:

bb dev:reset     # Deletes all Docker volumes
bb dev:up        # Recreates containers
bb dev:seed      # Re-seed MiniStack AWS resources
bb migrate migrate

Document Ingestion

To ingest a document for processing:

bb ingest /path/to/document.pdf

Running Tests

Tests use Testcontainers (auto-starts PostgreSQL and MiniStack):

# All tests
clj -X:test

# Specific namespace
clj -X:test :nses '[com.getorcha.workers-test]'

# Single test
clj -X:test :vars '[com.getorcha.workers-test/test-process-message]'

Tests automatically:

Configuration

Configuration lives in resources/com/getorcha/config.edn using Aero profiles:

Profile Description
:local-dev MiniStack endpoints, embedded credentials, debug logging
:test Test configuration (used by test runner)
:default Production AWS endpoints, environment variable credentials

Key configuration values for local development are pre-configured. For production deployment, set these environment variables:

Variable Description
ORCHA_DATABASE_URI PostgreSQL connection string
ANTHROPIC_API_KEY Claude API key for LLM extraction
GOOGLE_GENAI_API_KEY Gemini API key for vision model
GOOGLE_CLOUD_PROJECT GCP project ID
GOOGLE_DOCAI_PROCESSOR_ID Document AI processor ID

Project Structure

orcha/
├── src/com/getorcha/
│   ├── system.clj          # Application entry point
│   ├── db.clj              # Database layer
│   ├── aws.clj             # AWS client utilities
│   ├── erp/                # Web UI and API
│   │   ├── http.clj        # HTTP routing (Reitit)
│   │   └── ...
│   └── workers/            # Document processing pipeline
│       ├── workers.clj     # SQS orchestrator
│       ├── transcription.clj  # OCR processing
│       └── extraction.clj  # LLM extraction
├── resources/
│   ├── com/getorcha/config.edn  # Configuration
│   └── migrations/         # Database migrations
├── dev/user.clj            # REPL development namespace
├── test/                   # Test namespaces
├── scripts/                # Babashka scripts
├── credentials/            # Local credentials (gitignored)
└── volumes/                # Docker data (gitignored)

Troubleshooting

Docker containers won't start

Ensure Docker daemon is running:

# macOS: Launch Docker Desktop from Applications

# Linux
sudo systemctl start docker

Port already in use

Check what's using the port:

lsof -i :8888    # Application
lsof -i :5432    # PostgreSQL
lsof -i :4566    # MiniStack

MiniStack not ready

Wait for health check or manually verify:

curl http://localhost:4566/_ministack/health

Database connection refused

Ensure PostgreSQL container is healthy:

bb dev:status
docker logs orcha-postgres-1

Google Document AI errors

Verify credentials file exists and has correct permissions:

ls -la credentials/google-docai.json

Ensure the service account has Document AI API permissions in GCP.

Out of memory

Increase JVM heap for large documents:

clj -J-Xmx4g -A:dev