Document processing pipeline for extracting structured data from Gübelin historical inventory cards (Warenstammkarte).
This project processes ~600,000 historical inventory slips from scanned PDFs:
# Create virtual environment
python -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
Create .env file:
GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
DOCUMENT_AI_PROCESSOR_ID=your-processor-id
GCP_PROJECT_ID=your-project-id
GCP_LOCATION=us # or eu
LLM_PROVIDER=gemini # default
# Basic split
python split.py
# With image preprocessing (CLAHE + denoising)
python split.py --preprocess
# Auto-preprocess only blurry images
python split.py --preprocess=auto
# Parallel processing
python split.py --workers=4
# Process single slip
python parse.py invoices/WStk/WStk_0001.jpg
# Process with specific output
python parse.py invoices/WStk/WStk_0001.jpg -o result.json
# Batch processing (10 slips per LLM call - REQUIRED for rate limits)
python parse.py invoices/WStk/WStk_*.jpg --batch-size 10
# Import existing JSON files to database
python db.py --import
# View statistics
python db.py --stats
# Search by customer
python db.py --search "Müller"
# Get specific slip
python db.py --get WStk_0001
{
"slip_number": "WStk 0001",
"date": "1974-03-15",
"customer": "Meier AG",
"supplier": "Omega SA",
"items": [
{
"article_number": "12345",
"description": "Herrenuhr Gold",
"quantity": 1,
"unit_price": 1250.00,
"total_price": 1250.00
}
],
"_meta": {
"ocr_time_ms": 1523,
"llm_time_ms": 2341,
"input_tokens": 4521,
"output_tokens": 892,
"preprocessed": false
}
}
See COST_COMPARISON_REPORT.md for detailed analysis.
| Component | Cost |
|---|---|
| Document AI OCR | ~$900 |
| Gemini 2.0 Flash | ~$119 |
| Compute (Cloud Run) | ~$50-100 |
| Total | ~$1,060-$1,090 |
Gemini API has strict RPD (Requests Per Day) limits:
Batching is mandatory - group 10 slips per LLM call to meet timeline requirements.
Unlike AWS CDK, GCP doesn't have a native CDK. Here are the alternatives:
Pros:
Cons:
# Example: Cloud Run service
resource "google_cloud_run_service" "processor" {
name = "slip-processor"
location = "us-central1"
template {
spec {
containers {
image = "gcr.io/project/processor:latest"
}
}
}
}
Pros:
Cons:
# Example: Cloud Run service
import pulumi_gcp as gcp
service = gcp.cloudrun.Service("processor",
location="us-central1",
template=gcp.cloudrun.ServiceTemplateArgs(
spec=gcp.cloudrun.ServiceTemplateSpecArgs(
containers=[gcp.cloudrun.ServiceTemplateSpecContainerArgs(
image="gcr.io/project/processor:latest",
)],
),
),
)
Pros:
Cons:
# Example: Cloud Run service
from cdktf_cdktf_provider_google import cloud_run_service
CloudRunService(self, "processor",
name="slip-processor",
location="us-central1",
template=CloudRunServiceTemplate(
spec=CloudRunServiceTemplateSpec(
containers=[CloudRunServiceTemplateSpecContainers(
image="gcr.io/project/processor:latest"
)]
)
)
)
Pros:
Cons:
For this project, Terraform or Pulumi are the best choices:
Start with a simple setup:
Internal project - Orcha AG