Guide for setting up Google Document AI OCR for document ingestion.
We use Document OCR (not Invoice Parser) because:
| Feature | Document OCR | Invoice Parser |
|---|---|---|
| Cost per 1,000 pages | $1.50 | $10.00 |
| Quality scores | ✅ Yes | ❌ No |
| Native PDF parsing | ✅ Yes | ❌ No |
| Entity extraction | ❌ No | ✅ Yes |
Since we use Claude for structured data extraction, Invoice Parser's entity extraction is redundant. Document OCR gives us quality scores at 6.7x lower cost.
# Create service account
gcloud iam service-accounts create orcha-docai \
--display-name="Orcha Document AI" \
--project=YOUR_PROJECT_ID
# Grant Document AI API User role
gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \
--member="serviceAccount:orcha-docai@YOUR_PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/documentai.apiUser"
# Create and download key
gcloud iam service-accounts keys create credentials/google-docai.json \
--iam-account=orcha-docai@YOUR_PROJECT_ID.iam.gserviceaccount.com
gcloud services enable documentai.googleapis.com --project=YOUR_PROJECT_ID
Option A: Via REST API
# Get access token
TOKEN=$(gcloud auth print-access-token)
# Create processor (change LOCATION to 'us' or 'eu')
curl -X POST \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"type": "OCR_PROCESSOR",
"displayName": "orcha-document-ocr"
}' \
"https://LOCATION-documentai.googleapis.com/v1/projects/YOUR_PROJECT_ID/locations/LOCATION/processors"
Response includes the processor ID:
{
"name": "projects/123456789/locations/us/processors/abc123def456",
"type": "OCR_PROCESSOR",
"displayName": "orcha-document-ocr",
"state": "ENABLED",
...
}
Extract processor ID: abc123def456 (last segment of name)
Option B: Via Google Cloud Console
orcha-document-ocr)us or eu)TOKEN=$(gcloud auth print-access-token)
curl -s -H "Authorization: Bearer $TOKEN" \
"https://LOCATION-documentai.googleapis.com/v1/projects/YOUR_PROJECT_ID/locations/LOCATION/processors" \
| jq '.processors[] | {name, type, displayName, state}'
Update resources/com/getorcha/config.edn:
:ocr-config {:provider :google-document-ai
:project-id #profile {:local-dev "your-dev-project"
:default #orcha/param "GOOGLE_CLOUD_PROJECT"}
:location #profile {:local-dev "us"
:default "eu"}
:processor-id #profile {:local-dev "your-dev-processor-id"
:default #orcha/param "GOOGLE_DOCAI_PROCESSOR_ID"}
:credentials-file #profile {:local-dev "credentials/google-docai.json"
:default nil}}
For production, set environment variables or use #orcha/param to fetch from SSM Parameter Store:
GOOGLE_CLOUD_PROJECT - GCP project IDGOOGLE_DOCAI_PROCESSOR_ID - Processor ID from step 3# Test OCR endpoint directly
TOKEN=$(gcloud auth print-access-token)
echo '{"rawDocument":{"content":"'$(base64 -w0 test.pdf)'","mimeType":"application/pdf"},"processOptions":{"ocrConfig":{"enableImageQualityScores":true}}}' | \
curl -X POST \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d @- \
"https://LOCATION-documentai.googleapis.com/v1/projects/YOUR_PROJECT_ID/locations/LOCATION/processors/PROCESSOR_ID:process"
| Volume | Cost per 1,000 pages |
|---|---|
| 1 - 5,000,000 pages/month | $1.50 |
| 5,000,001+ pages/month | $0.60 |
See official pricing for current rates.
Our implementation enables:
enableImageQualityScores - Returns quality score (0.0-1.0) per pageenableNativePdfParsing - Extracts text from digital PDFs without OCR when possibleQuality scores below the threshold (default 0.7) trigger document rejection for manual review.