Google Document AI Setup

Guide for setting up Google Document AI OCR for document ingestion.

Processor Type: Document OCR

We use Document OCR (not Invoice Parser) because:

Feature	Document OCR	Invoice Parser
Cost per 1,000 pages	$1.50	$10.00
Quality scores	✅ Yes	❌ No
Native PDF parsing	✅ Yes	❌ No
Entity extraction	❌ No	✅ Yes

Since we use Claude for structured data extraction, Invoice Parser's entity extraction is redundant. Document OCR gives us quality scores at 6.7x lower cost.

Setup Steps

1. Create Service Account

# Create service account
gcloud iam service-accounts create orcha-docai \
  --display-name="Orcha Document AI" \
  --project=YOUR_PROJECT_ID

# Grant Document AI API User role
gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \
  --member="serviceAccount:orcha-docai@YOUR_PROJECT_ID.iam.gserviceaccount.com" \
  --role="roles/documentai.apiUser"

# Create and download key
gcloud iam service-accounts keys create credentials/google-docai.json \
  --iam-account=orcha-docai@YOUR_PROJECT_ID.iam.gserviceaccount.com

2. Enable Document AI API

gcloud services enable documentai.googleapis.com --project=YOUR_PROJECT_ID

3. Create Document OCR Processor

Option A: Via REST API

# Get access token
TOKEN=$(gcloud auth print-access-token)

# Create processor (change LOCATION to 'us' or 'eu')
curl -X POST \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "type": "OCR_PROCESSOR",
    "displayName": "orcha-document-ocr"
  }' \
  "https://LOCATION-documentai.googleapis.com/v1/projects/YOUR_PROJECT_ID/locations/LOCATION/processors"

Response includes the processor ID:

{
  "name": "projects/123456789/locations/us/processors/abc123def456",
  "type": "OCR_PROCESSOR",
  "displayName": "orcha-document-ocr",
  "state": "ENABLED",
  ...
}

Extract processor ID: abc123def456 (last segment of name)

Option B: Via Google Cloud Console

Go to Document AI Console
Click "Create Processor"
Select "Document OCR" under "General" category
Name it (e.g., orcha-document-ocr)
Select region (us or eu)
Click "Create"
Copy the Processor ID from the processor details page

4. List Existing Processors

TOKEN=$(gcloud auth print-access-token)
curl -s -H "Authorization: Bearer $TOKEN" \
  "https://LOCATION-documentai.googleapis.com/v1/projects/YOUR_PROJECT_ID/locations/LOCATION/processors" \
  | jq '.processors[] | {name, type, displayName, state}'

5. Configure Orcha

Update resources/com/getorcha/config.edn:

:ocr-config {:provider         :google-document-ai
               :project-id       #profile {:local-dev "your-dev-project"
                                           :default   #orcha/param "GOOGLE_CLOUD_PROJECT"}
               :location         #profile {:local-dev "us"
                                           :default   "eu"}
               :processor-id     #profile {:local-dev "your-dev-processor-id"
                                           :default   #orcha/param "GOOGLE_DOCAI_PROCESSOR_ID"}
               :credentials-file #profile {:local-dev "credentials/google-docai.json"
                                           :default   nil}}

For production, set environment variables or use #orcha/param to fetch from SSM Parameter Store:

GOOGLE_CLOUD_PROJECT - GCP project ID
GOOGLE_DOCAI_PROCESSOR_ID - Processor ID from step 3

6. Verify Setup

# Test OCR endpoint directly
TOKEN=$(gcloud auth print-access-token)
echo '{"rawDocument":{"content":"'$(base64 -w0 test.pdf)'","mimeType":"application/pdf"},"processOptions":{"ocrConfig":{"enableImageQualityScores":true}}}' | \
curl -X POST \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d @- \
  "https://LOCATION-documentai.googleapis.com/v1/projects/YOUR_PROJECT_ID/locations/LOCATION/processors/PROCESSOR_ID:process"

Pricing Reference

Volume	Cost per 1,000 pages
1 - 5,000,000 pages/month	$1.50
5,000,001+ pages/month	$0.60

See official pricing for current rates.

OCR Config Options

Our implementation enables:

enableImageQualityScores - Returns quality score (0.0-1.0) per page
enableNativePdfParsing - Extracts text from digital PDFs without OCR when possible

Quality scores below the threshold (default 0.7) trigger document rejection for manual review.