Google Document AI Setup

Guide for setting up Google Document AI OCR for document ingestion.

Processor Type: Document OCR

We use Document OCR (not Invoice Parser) because:

Feature Document OCR Invoice Parser
Cost per 1,000 pages $1.50 $10.00
Quality scores ✅ Yes ❌ No
Native PDF parsing ✅ Yes ❌ No
Entity extraction ❌ No ✅ Yes

Since we use Claude for structured data extraction, Invoice Parser's entity extraction is redundant. Document OCR gives us quality scores at 6.7x lower cost.

Setup Steps

1. Create Service Account

# Create service account
gcloud iam service-accounts create orcha-docai \
  --display-name="Orcha Document AI" \
  --project=YOUR_PROJECT_ID

# Grant Document AI API User role
gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \
  --member="serviceAccount:orcha-docai@YOUR_PROJECT_ID.iam.gserviceaccount.com" \
  --role="roles/documentai.apiUser"

# Create and download key
gcloud iam service-accounts keys create credentials/google-docai.json \
  --iam-account=orcha-docai@YOUR_PROJECT_ID.iam.gserviceaccount.com

2. Enable Document AI API

gcloud services enable documentai.googleapis.com --project=YOUR_PROJECT_ID

3. Create Document OCR Processor

Option A: Via REST API

# Get access token
TOKEN=$(gcloud auth print-access-token)

# Create processor (change LOCATION to 'us' or 'eu')
curl -X POST \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "type": "OCR_PROCESSOR",
    "displayName": "orcha-document-ocr"
  }' \
  "https://LOCATION-documentai.googleapis.com/v1/projects/YOUR_PROJECT_ID/locations/LOCATION/processors"

Response includes the processor ID:

{
  "name": "projects/123456789/locations/us/processors/abc123def456",
  "type": "OCR_PROCESSOR",
  "displayName": "orcha-document-ocr",
  "state": "ENABLED",
  ...
}

Extract processor ID: abc123def456 (last segment of name)

Option B: Via Google Cloud Console

  1. Go to Document AI Console
  2. Click "Create Processor"
  3. Select "Document OCR" under "General" category
  4. Name it (e.g., orcha-document-ocr)
  5. Select region (us or eu)
  6. Click "Create"
  7. Copy the Processor ID from the processor details page

4. List Existing Processors

TOKEN=$(gcloud auth print-access-token)
curl -s -H "Authorization: Bearer $TOKEN" \
  "https://LOCATION-documentai.googleapis.com/v1/projects/YOUR_PROJECT_ID/locations/LOCATION/processors" \
  | jq '.processors[] | {name, type, displayName, state}'

5. Configure Orcha

Update resources/com/getorcha/config.edn:

:ocr-config {:provider         :google-document-ai
               :project-id       #profile {:local-dev "your-dev-project"
                                           :default   #orcha/param "GOOGLE_CLOUD_PROJECT"}
               :location         #profile {:local-dev "us"
                                           :default   "eu"}
               :processor-id     #profile {:local-dev "your-dev-processor-id"
                                           :default   #orcha/param "GOOGLE_DOCAI_PROCESSOR_ID"}
               :credentials-file #profile {:local-dev "credentials/google-docai.json"
                                           :default   nil}}

For production, set environment variables or use #orcha/param to fetch from SSM Parameter Store:

6. Verify Setup

# Test OCR endpoint directly
TOKEN=$(gcloud auth print-access-token)
echo '{"rawDocument":{"content":"'$(base64 -w0 test.pdf)'","mimeType":"application/pdf"},"processOptions":{"ocrConfig":{"enableImageQualityScores":true}}}' | \
curl -X POST \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d @- \
  "https://LOCATION-documentai.googleapis.com/v1/projects/YOUR_PROJECT_ID/locations/LOCATION/processors/PROCESSOR_ID:process"

Pricing Reference

Volume Cost per 1,000 pages
1 - 5,000,000 pages/month $1.50
5,000,001+ pages/month $0.60

See official pricing for current rates.

OCR Config Options

Our implementation enables:

Quality scores below the threshold (default 0.7) trigger document rejection for manual review.