Gübelin Slip Processing Pipeline

Document processing pipeline for extracting structured data from Gübelin historical inventory cards (Warenstammkarte).

Overview

This project processes ~600,000 historical inventory slips from scanned PDFs:

Split large PDFs into individual slip images (front+back combined)
OCR using Google Document AI
Extract structured data using Gemini 2.0 Flash LLM
Store in SQLite with normalized fields + full JSON

Setup

Prerequisites

Python 3.11+
GCP Project with Document AI and Vertex AI APIs enabled
Service account with appropriate permissions

Installation

# Create virtual environment
python -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Configuration

Create .env file:

GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
DOCUMENT_AI_PROCESSOR_ID=your-processor-id
GCP_PROJECT_ID=your-project-id
GCP_LOCATION=us  # or eu
LLM_PROVIDER=gemini  # default

Usage

1. Split PDFs into slip images

# Basic split
python split.py

# With image preprocessing (CLAHE + denoising)
python split.py --preprocess

# Auto-preprocess only blurry images
python split.py --preprocess=auto

# Parallel processing
python split.py --workers=4

2. Process slips (OCR + LLM extraction)

# Process single slip
python parse.py invoices/WStk/WStk_0001.jpg

# Process with specific output
python parse.py invoices/WStk/WStk_0001.jpg -o result.json

# Batch processing (10 slips per LLM call - REQUIRED for rate limits)
python parse.py invoices/WStk/WStk_*.jpg --batch-size 10

3. Database operations

# Import existing JSON files to database
python db.py --import

# View statistics
python db.py --stats

# Search by customer
python db.py --search "Müller"

# Get specific slip
python db.py --get WStk_0001

Output Schema

{
  "slip_number": "WStk 0001",
  "date": "1974-03-15",
  "customer": "Meier AG",
  "supplier": "Omega SA",
  "items": [
    {
      "article_number": "12345",
      "description": "Herrenuhr Gold",
      "quantity": 1,
      "unit_price": 1250.00,
      "total_price": 1250.00
    }
  ],
  "_meta": {
    "ocr_time_ms": 1523,
    "llm_time_ms": 2341,
    "input_tokens": 4521,
    "output_tokens": 892,
    "preprocessed": false
  }
}

Cost Estimate (600k slips)

See COST_COMPARISON_REPORT.md for detailed analysis.

Component	Cost
Document AI OCR	~$900
Gemini 2.0 Flash	~$119
Compute (Cloud Run)	~$50-100
Total	~$1,060-$1,090

Critical: API Rate Limits

Gemini API has strict RPD (Requests Per Day) limits:

Tier 1: 1,000 RPD = 600 days without batching
Tier 2: 10,000 RPD = 60 days without batching, 6 days with 10x batching

Batching is mandatory - group 10 slips per LLM call to meet timeline requirements.

Production TODOs

Infrastructure

GCP Infrastructure as Code - Choose and implement IaC solution (see options below)
Cloud Run deployment - Containerize and deploy processing workers
Cloud Storage buckets - Input PDFs, processed images, results
Pub/Sub queues - Job orchestration and progress tracking
Secret Manager - API keys and credentials
Cloud Monitoring - Alerts for errors, rate limits, costs

Application

Batch processing CLI - Process all slips with resume capability
Progress tracking - Track which slips are processed, failed, pending
Error handling - Retry logic, dead letter queue for failures
Rate limit handling - Automatic backoff when hitting RPD limits
Validation - Verify extracted data quality
Export functionality - CSV, Excel, or other formats for downstream systems

Data Quality

Sample validation - Manual review of ~100 slips for accuracy
Confidence thresholds - Flag low-confidence extractions for review
Schema validation - Ensure all required fields are present
Duplicate detection - Handle re-processing of same slips

Cost Optimization

Tier 2 qualification - Spend $250 on Gemini API to unlock higher limits
Batch size tuning - Optimize slips per LLM call (currently 10)
Image compression - Optimize JPEG quality vs file size
Caching - Cache OCR results to avoid reprocessing

GCP Infrastructure as Code Options

Unlike AWS CDK, GCP doesn't have a native CDK. Here are the alternatives:

1. Terraform (Recommended)

Pros:

Industry standard, massive community
Excellent GCP provider (maintained by Google)
Mature ecosystem, well-documented
State management built-in

Cons:

HCL syntax (not Python)
Learning curve if unfamiliar

# Example: Cloud Run service
resource "google_cloud_run_service" "processor" {
  name     = "slip-processor"
  location = "us-central1"

  template {
    spec {
      containers {
        image = "gcr.io/project/processor:latest"
      }
    }
  }
}

2. Pulumi

Pros:

Native Python/TypeScript support
Similar feel to AWS CDK
Multi-cloud support
Strong typing and IDE support

Cons:

Smaller community than Terraform
Pulumi Cloud for state (or self-hosted)

# Example: Cloud Run service
import pulumi_gcp as gcp

service = gcp.cloudrun.Service("processor",
    location="us-central1",
    template=gcp.cloudrun.ServiceTemplateArgs(
        spec=gcp.cloudrun.ServiceTemplateSpecArgs(
            containers=[gcp.cloudrun.ServiceTemplateSpecContainerArgs(
                image="gcr.io/project/processor:latest",
            )],
        ),
    ),
)

3. CDKTF (CDK for Terraform)

Pros:

AWS CDK developer experience
Uses Terraform providers under the hood
Python, TypeScript, Go support
Best of both worlds

Cons:

Additional abstraction layer
Newer, less mature

# Example: Cloud Run service
from cdktf_cdktf_provider_google import cloud_run_service

CloudRunService(self, "processor",
    name="slip-processor",
    location="us-central1",
    template=CloudRunServiceTemplate(
        spec=CloudRunServiceTemplateSpec(
            containers=[CloudRunServiceTemplateSpecContainers(
                image="gcr.io/project/processor:latest"
            )]
        )
    )
)

4. Google Cloud Infrastructure Manager

Pros:

Native GCP service (GA Dec 2023)
Uses Terraform configs
Integrated with GCP console
No separate state management

Cons:

Very new, limited documentation
Less flexible than standalone Terraform

Recommendation

For this project, Terraform or Pulumi are the best choices:

Terraform if team is familiar with HCL or wants maximum ecosystem support
Pulumi if Python-native experience is preferred (closer to AWS CDK feel)

Start with a simple setup:

Cloud Storage buckets (input, processed, output)
Cloud Run service for processing
Pub/Sub for job queue
Cloud Scheduler for batch triggers

License

Internal project - Orcha AG