Gübelin Slip Processing Pipeline

Document processing pipeline for extracting structured data from Gübelin historical inventory cards (Warenstammkarte).

Overview

This project processes ~600,000 historical inventory slips from scanned PDFs:

  1. Split large PDFs into individual slip images (front+back combined)
  2. OCR using Google Document AI
  3. Extract structured data using Gemini 2.0 Flash LLM
  4. Store in SQLite with normalized fields + full JSON

Setup

Prerequisites

Installation

# Create virtual environment
python -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Configuration

Create .env file:

GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
DOCUMENT_AI_PROCESSOR_ID=your-processor-id
GCP_PROJECT_ID=your-project-id
GCP_LOCATION=us  # or eu
LLM_PROVIDER=gemini  # default

Usage

1. Split PDFs into slip images

# Basic split
python split.py

# With image preprocessing (CLAHE + denoising)
python split.py --preprocess

# Auto-preprocess only blurry images
python split.py --preprocess=auto

# Parallel processing
python split.py --workers=4

2. Process slips (OCR + LLM extraction)

# Process single slip
python parse.py invoices/WStk/WStk_0001.jpg

# Process with specific output
python parse.py invoices/WStk/WStk_0001.jpg -o result.json

# Batch processing (10 slips per LLM call - REQUIRED for rate limits)
python parse.py invoices/WStk/WStk_*.jpg --batch-size 10

3. Database operations

# Import existing JSON files to database
python db.py --import

# View statistics
python db.py --stats

# Search by customer
python db.py --search "Müller"

# Get specific slip
python db.py --get WStk_0001

Output Schema

{
  "slip_number": "WStk 0001",
  "date": "1974-03-15",
  "customer": "Meier AG",
  "supplier": "Omega SA",
  "items": [
    {
      "article_number": "12345",
      "description": "Herrenuhr Gold",
      "quantity": 1,
      "unit_price": 1250.00,
      "total_price": 1250.00
    }
  ],
  "_meta": {
    "ocr_time_ms": 1523,
    "llm_time_ms": 2341,
    "input_tokens": 4521,
    "output_tokens": 892,
    "preprocessed": false
  }
}

Cost Estimate (600k slips)

See COST_COMPARISON_REPORT.md for detailed analysis.

Component Cost
Document AI OCR ~$900
Gemini 2.0 Flash ~$119
Compute (Cloud Run) ~$50-100
Total ~$1,060-$1,090

Critical: API Rate Limits

Gemini API has strict RPD (Requests Per Day) limits:

Batching is mandatory - group 10 slips per LLM call to meet timeline requirements.


Production TODOs

Infrastructure

Application

Data Quality

Cost Optimization


GCP Infrastructure as Code Options

Unlike AWS CDK, GCP doesn't have a native CDK. Here are the alternatives:

Pros:

Cons:

# Example: Cloud Run service
resource "google_cloud_run_service" "processor" {
  name     = "slip-processor"
  location = "us-central1"

  template {
    spec {
      containers {
        image = "gcr.io/project/processor:latest"
      }
    }
  }
}

2. Pulumi

Pros:

Cons:

# Example: Cloud Run service
import pulumi_gcp as gcp

service = gcp.cloudrun.Service("processor",
    location="us-central1",
    template=gcp.cloudrun.ServiceTemplateArgs(
        spec=gcp.cloudrun.ServiceTemplateSpecArgs(
            containers=[gcp.cloudrun.ServiceTemplateSpecContainerArgs(
                image="gcr.io/project/processor:latest",
            )],
        ),
    ),
)

3. CDKTF (CDK for Terraform)

Pros:

Cons:

# Example: Cloud Run service
from cdktf_cdktf_provider_google import cloud_run_service

CloudRunService(self, "processor",
    name="slip-processor",
    location="us-central1",
    template=CloudRunServiceTemplate(
        spec=CloudRunServiceTemplateSpec(
            containers=[CloudRunServiceTemplateSpecContainers(
                image="gcr.io/project/processor:latest"
            )]
        )
    )
)

4. Google Cloud Infrastructure Manager

Pros:

Cons:

Recommendation

For this project, Terraform or Pulumi are the best choices:

Start with a simple setup:

  1. Cloud Storage buckets (input, processed, output)
  2. Cloud Run service for processing
  3. Pub/Sub for job queue
  4. Cloud Scheduler for batch triggers

License

Internal project - Orcha AG