A document processing and ERP platform for extracting structured data from business documents. Orcha combines an asynchronous document ingestion pipeline with a web-based interface for managing and viewing extraction results.
The system uses Google Document AI for OCR and Anthropic Claude for LLM-based data extraction from invoices and other business documents.
| Tool | Version | Purpose |
|---|---|---|
| Java | 21+ | JVM runtime |
| Clojure CLI | 1.12+ | Build and run Clojure |
| Babashka | Latest | Task runner |
| Docker | Latest | Container runtime |
| Docker Compose | v2+ | Local infrastructure |
| AWS CLI | v2 | MiniStack initialization |
Using Homebrew:
# Java (Temurin/Eclipse Adoptium recommended)
brew install --cask temurin@21
# Clojure CLI
brew install clojure/tools/clojure
# Babashka
brew install borkdude/brew/babashka
# Docker Desktop (includes Docker Compose)
brew install --cask docker
# AWS CLI
brew install awscli
After installing Docker Desktop, launch it from Applications to start the Docker daemon.
sudo pacman -S jdk21-openjdk clojure babashka docker docker-compose aws-cli-v2
# Enable Docker and add your user to the docker group
sudo systemctl enable --now docker
sudo usermod -aG docker $USER
# Log out and back in for group membership to take effect
java -version # Should show 21+
clj --version # Should show 1.12+
bb --version # Any recent version
docker --version # Any recent version
docker compose version # v2+
aws --version # v2+
Orcha uses Google Document AI for OCR. For local development, you need a service account credentials file.
Place the credentials file at:
mkdir -p credentials
# Copy your downloaded JSON key file
cp ~/Downloads/your-project-xxxxx.json credentials/google-docai.json
The config expects this file at credentials/google-docai.json for local development.
cd /path/to/orcha
This starts PostgreSQL and MiniStack (AWS emulator) in Docker containers:
bb dev:up
This command:
./volumes/ directoryMiniStack does not auto-seed. Run bb dev:seed after bb dev:up on the
first boot (or after bb dev:reset) to create the S3 buckets, SQS queues,
KMS keys, SSM parameters, and secrets Orcha needs. The seed is idempotent,
so re-running it is safe and only creates what's missing.
Verify containers are running:
bb dev:status
bb migrate migrate
This creates all required database tables and seeds initial data for local development.
Clojure downloads dependencies on first run. Pre-fetch them:
clj -P # Main dependencies
clj -P -A:dev # Dev dependencies
clj -P -A:test # Test dependencies
Start a REPL with dev configuration:
clj -A:dev
In the REPL:
;; Start the system
(reset)
;; System is now running:
;; - HTTP server on http://localhost:8888
;; - Workers processing SQS queue
;; After code changes, reload:
(reset)
;; Suspend without stopping:
(ig.repl/suspend)
;; Resume:
(ig.repl/resume)
clj -M -m com.getorcha.system
The application runs on http://localhost:8888.
Babashka provides convenient task shortcuts:
bb dev:up # Start MiniStack + PostgreSQL
bb dev:seed # Seed MiniStack with Orcha's AWS resources (idempotent)
bb dev:down # Stop containers (data persists)
bb dev:reset # Stop containers and delete all data
bb dev:status # Check container status
bb dev:logs # Tail all container logs
bb db:psql # Connect to PostgreSQL with psql
bb db:logs # Tail PostgreSQL logs
bb migrate migrate # Run pending migrations
bb migrate create "add-users" # Create new migration files
bb migrate rollback # Rollback last migration
In the REPL, to completely reset the database:
(reset-db!)
Or from the command line:
bb dev:reset # Deletes all Docker volumes
bb dev:up # Recreates containers
bb dev:seed # Re-seed MiniStack AWS resources
bb migrate migrate
To ingest a document for processing:
bb ingest /path/to/document.pdf
Tests use Testcontainers (auto-starts PostgreSQL and MiniStack):
# All tests
clj -X:test
# Specific namespace
clj -X:test :nses '[com.getorcha.workers-test]'
# Single test
clj -X:test :vars '[com.getorcha.workers-test/test-process-message]'
Tests automatically:
Configuration lives in resources/com/getorcha/config.edn using Aero profiles:
| Profile | Description |
|---|---|
:local-dev |
MiniStack endpoints, embedded credentials, debug logging |
:test |
Test configuration (used by test runner) |
:default |
Production AWS endpoints, environment variable credentials |
Key configuration values for local development are pre-configured. For production deployment, set these environment variables:
| Variable | Description |
|---|---|
ORCHA_DATABASE_URI |
PostgreSQL connection string |
ANTHROPIC_API_KEY |
Claude API key for LLM extraction |
GOOGLE_GENAI_API_KEY |
Gemini API key for vision model |
GOOGLE_CLOUD_PROJECT |
GCP project ID |
GOOGLE_DOCAI_PROCESSOR_ID |
Document AI processor ID |
orcha/
├── src/com/getorcha/
│ ├── system.clj # Application entry point
│ ├── db.clj # Database layer
│ ├── aws.clj # AWS client utilities
│ ├── erp/ # Web UI and API
│ │ ├── http.clj # HTTP routing (Reitit)
│ │ └── ...
│ └── workers/ # Document processing pipeline
│ ├── workers.clj # SQS orchestrator
│ ├── transcription.clj # OCR processing
│ └── extraction.clj # LLM extraction
├── resources/
│ ├── com/getorcha/config.edn # Configuration
│ └── migrations/ # Database migrations
├── dev/user.clj # REPL development namespace
├── test/ # Test namespaces
├── scripts/ # Babashka scripts
├── credentials/ # Local credentials (gitignored)
└── volumes/ # Docker data (gitignored)
Ensure Docker daemon is running:
# macOS: Launch Docker Desktop from Applications
# Linux
sudo systemctl start docker
Check what's using the port:
lsof -i :8888 # Application
lsof -i :5432 # PostgreSQL
lsof -i :4566 # MiniStack
Wait for health check or manually verify:
curl http://localhost:4566/_ministack/health
Ensure PostgreSQL container is healthy:
bb dev:status
docker logs orcha-postgres-1
Verify credentials file exists and has correct permissions:
ls -la credentials/google-docai.json
Ensure the service account has Document AI API permissions in GCP.
Increase JVM heap for large documents:
clj -J-Xmx4g -A:dev