Guide for configuring Google Cloud Document AI for OCR processing.
gcloud CLI installed and authenticatedThe setup script automates all configuration steps.
# Production setup
./scripts/env-setup/ingestion/doc-ai.sh \
--project-id "getorcha-prod" \
--output "credentials/google-docai-prod.json"
# Development setup
./scripts/env-setup/ingestion/doc-ai.sh \
--project-id "getorcha-dev" \
--output "credentials/google-docai-dev.json"
| Parameter | Required | Description |
|---|---|---|
--project-id |
Yes | GCP project ID |
--output |
Yes | Output path for service account key JSON |
--location |
No | Processor region (default: eu) |
--processor-name |
No | Display name (default: orcha-ocr) |
docai-processor@PROJECT.iam.gserviceaccount.com)roles/documentai.apiUser to the service accountIf you prefer to configure manually, follow these steps.
gcloud services enable documentai.googleapis.com --project=YOUR_PROJECT_ID
Using the REST API (gcloud doesn't have documentai commands):
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
-d '{"type": "OCR_PROCESSOR", "displayName": "orcha-ocr"}' \
"https://eu-documentai.googleapis.com/v1/projects/YOUR_PROJECT_ID/locations/eu/processors"
Note the processor-id from the response (e.g., 2ce14f950a811b13).
gcloud iam service-accounts create docai-processor \
--display-name="Document AI Processor" \
--description="Service account for Document AI OCR processing" \
--project=YOUR_PROJECT_ID
gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \
--member="serviceAccount:docai-processor@YOUR_PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/documentai.apiUser"
If your organization blocks key creation, override the policy first:
cat > /tmp/allow-sa-keys.yaml << 'EOF'
name: projects/YOUR_PROJECT_ID/policies/iam.disableServiceAccountKeyCreation
spec:
rules:
- enforce: false
EOF
gcloud org-policies set-policy /tmp/allow-sa-keys.yaml --project=YOUR_PROJECT_ID
Then create the key:
gcloud iam service-accounts keys create credentials/google-docai.json \
--iam-account=docai-processor@YOUR_PROJECT_ID.iam.gserviceaccount.com
:transcription {:ocr {:provider :ocr
:project-id #profile {:local-dev "getorcha-dev"
:default #orcha/param "GOOGLE_CLOUD_PROJECT"}
:location "eu"
:processor-id #profile {:local-dev "2ce14f950a811b13"
:default #orcha/param "GOOGLE_DOCAI_PROCESSOR_ID"}
:credentials-file #profile {:local-dev "credentials/google-docai-dev.json"
:default nil}}}
| Variable | Description |
|---|---|
GOOGLE_CLOUD_PROJECT |
GCP project ID |
GOOGLE_DOCAI_PROCESSOR_ID |
Processor ID from setup |
GOOGLE_APPLICATION_CREDENTIALS |
Path to credentials file (if using file-based auth) |
Option 1: Service Account Key File
GOOGLE_APPLICATION_CREDENTIALS env varOption 2: Workload Identity Federation (recommended for AWS)
| Type | Category | Use Case |
|---|---|---|
OCR_PROCESSOR |
General | Text extraction from images/PDFs |
FORM_PARSER_PROCESSOR |
General | Form field extraction |
INVOICE_PROCESSOR |
Specialized | Invoice data extraction |
EXPENSE_PROCESSOR |
Specialized | Receipt/expense extraction |
LAYOUT_PARSER_PROCESSOR |
General | Document structure analysis |
We use OCR_PROCESSOR for general text extraction. Consider INVOICE_PROCESSOR for better invoice-specific extraction.
Document AI is available in these regions:
| Region | Location |
|---|---|
us |
United States |
eu |
European Union |
We use eu for GDPR compliance.
Your organization has iam.disableServiceAccountKeyCreation policy enabled. Override it at project level:
cat > /tmp/allow-sa-keys.yaml << 'EOF'
name: projects/YOUR_PROJECT_ID/policies/iam.disableServiceAccountKeyCreation
spec:
rules:
- enforce: false
EOF
gcloud org-policies set-policy /tmp/allow-sa-keys.yaml --project=YOUR_PROJECT_ID
Ensure you have roles/documentai.editor or roles/owner on the project.
eu vs us)roles/documentai.apiUserDefault quotas (can be increased via GCP Console):
| Resource | Limit |
|---|---|
| Pages per minute | 1,000 |
| Requests per minute | 300 |
| Pages per request | 15 |