Project: Process 600,000 historical Warenstammkarte (inventory cards) Date: December 2025 Author: Generated for cost analysis
| Provider | Total Estimated Cost | Processing Time (with batching) |
|---|---|---|
| GCP (Recommended) | $1,040 - $1,050 | 6-12 days |
| AWS | $1,040 - $1,050 | 6-12 days |
⚠️ CRITICAL: Gemini API rate limits (RPD - Requests Per Day) are the primary constraint. Without batching, processing would take 60-600 days. Batching multiple slips per LLM call is mandatory.
Recommendation: GCP offers native integration with both Document AI OCR and Gemini Flash 2.0, reducing complexity. Costs are nearly identical between providers.
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Source PDFs │────▶│ Split Images │────▶│ OCR + LLM │
│ (~240 GB) │ │ (~240 GB JPG) │ │ Processing │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│
┌─────────────────────────┘
▼
┌─────────────────┐ ┌─────────────────┐
│ JSON Output │────▶│ SQLite DB │
│ (~2 GB) │ │ + Logs │
└─────────────────┘ └─────────────────┘
Processing Pipeline:
Rate limits are the primary constraint on processing time, not compute or API costs.
| Limit | Default Value | 600K Slips | Bottleneck? |
|---|---|---|---|
| Requests/minute | 120 | 83 hours (~3.5 days) | ❌ No |
| Pages/minute | 120 | 83 hours (~3.5 days) | ❌ No |
| Batch concurrent jobs | 5 | N/A | ❌ No |
Source: Document AI Quotas
| Tier | RPM | TPM | RPD | 600K Slips (no batching) | With 10x Batching |
|---|---|---|---|---|---|
| Free | 5 | 32K | 25 | ❌ 65 years | ❌ 6.5 years |
| Tier 1 (billing enabled) | 300 | 1M | 1,000 | ❌ 600 days | ⚠️ 60 days |
| Tier 2 ($250 spend + 30 days) | 1,000 | 2M | 10,000 | ⚠️ 60 days | ✅ 6 days |
Legend:
Source: Gemini API Rate Limits
| Scenario | LLM Calls | At Tier 1 (1K RPD) | At Tier 2 (10K RPD) |
|---|---|---|---|
| No batching (1 slip/call) | 600,000 | 600 days | 60 days |
| 5 slips/call | 120,000 | 120 days | 12 days |
| 10 slips/call | 60,000 | 60 days | 6 days |
| 20 slips/call | 30,000 | 30 days | 3 days |
Recommendation: Use 10 slips per LLM call with Tier 2 for optimal balance of speed and reliability.
Source: Gemini API Rate Limits Guide
| Limit | Default Value | 600K Slips |
|---|---|---|
| Requests/second | 150 | ~1.1 hours |
| Concurrent async jobs | 200 | N/A |
Source: AWS Textract Limits
Note: Textract has no daily request limits, making it less constrained than Gemini for high-volume processing.
Both providers offer equivalent OCR capabilities with word-level confidence scores.
| Service | Price per 1,000 pages | 600K Slips | Confidence Scores |
|---|---|---|---|
| GCP Document AI OCR | $1.50 | $900 | ✅ Word-level |
| AWS Textract (Detect Text) | $1.50 | $900 | ✅ Word-level (0-100) |
Sources:
Gemini 2.0 Flash is used regardless of cloud provider. Based on measured token usage from test runs:
| Metric | Per Slip | 600K Slips | Cost |
|---|---|---|---|
| Input tokens | 980 | 588,000,000 | $58.80 |
| Output tokens | 252 | 151,200,000 | $60.48 |
| Total LLM | $119.28 |
Pricing:
Note: Batching does not significantly change token costs (same content processed), but reduces API call overhead.
Source: Gemini API Pricing
Comparable instances for running the Python processing pipeline:
| Provider | Instance | vCPUs | RAM | Price/Hour | Est. Hours | Total |
|---|---|---|---|---|---|---|
| GCP | n2-standard-4 | 4 | 16 GB | $0.19 | 144-288 | $27 - $55 |
| AWS | c5.xlarge | 4 | 8 GB | $0.17 | 144-288 | $24 - $49 |
Updated Assumptions (with rate limits):
Sources:
| Item | Size | GCP ($/GB/mo) | GCP Total | AWS ($/GB/mo) | AWS Total |
|---|---|---|---|---|---|
| Source PDFs | 240 GB | $0.020 | $4.80 | $0.023 | $5.52 |
| Split Images (JPG) | 240 GB | $0.020 | $4.80 | $0.023 | $5.52 |
| Output (JSON, TXT, DB) | 3 GB | $0.020 | $0.06 | $0.023 | $0.07 |
| Total (1 month) | $9.66 | $11.11 |
Notes:
Sources:
| Operation | Count | GCP ($/10K ops) | GCP Total | AWS ($/1K ops) | AWS Total |
|---|---|---|---|---|---|
| PUT (uploads) | 600,000 | $0.05 | $3.00 | $0.005 | $3.00 |
| GET (reads) | 1,200,000 | $0.004 | $0.48 | $0.0004 | $0.48 |
| Total | $3.48 | $3.48 |
Downloading final results (JSON files, SQLite DB, logs):
| Provider | Data Out | Free Tier | Price/GB | Billable | Total |
|---|---|---|---|---|---|
| GCP | ~5 GB | 100 GB/mo | $0.12 | 0 GB | $0.00 |
| AWS | ~5 GB | 100 GB/mo | $0.09 | 0 GB | $0.00 |
Note: Results are small (~3-5 GB). Both providers include 100 GB free egress monthly.
Sources:
Estimated log volume: ~50-100 MB (processing logs, errors, stats)
| Provider | Service | Ingestion ($/GB) | Free Tier | Storage ($/GB/mo) | Total |
|---|---|---|---|---|---|
| GCP | Cloud Logging | $0.50 | 50 GB/mo | $0.01 (after 30d) | $0.00 |
| AWS | CloudWatch Logs | $0.50 | 5 GB/mo | $0.03 | $0.00 |
Notes:
Sources:
| Category | Low Estimate | High Estimate |
|---|---|---|
| Document AI OCR | $900.00 | $900.00 |
| Gemini 2.0 Flash | $119.28 | $119.28 |
| Compute (n2-standard-4, 6-12 days) | $27.36 | $54.72 |
| Storage (1 month) | $9.66 | $9.66 |
| Storage Operations | $3.48 | $3.48 |
| Network Egress | $0.00 | $0.00 |
| Logging | $0.00 | $0.00 |
| TOTAL | $1,059.78 | $1,087.14 |
| Category | Low Estimate | High Estimate |
|---|---|---|
| Textract (Detect Text) | $900.00 | $900.00 |
| Gemini 2.0 Flash | $119.28 | $119.28 |
| Compute (c5.xlarge, 6-12 days) | $24.48 | $48.96 |
| Storage (1 month) | $11.11 | $11.11 |
| Storage Operations | $3.48 | $3.48 |
| Network Egress | $0.00 | $0.00 |
| Logging | $0.00 | $0.00 |
| TOTAL | $1,058.35 | $1,082.83 |
| Gemini Tier | Batching | LLM Calls | Time (RPD limited) | Feasible? |
|---|---|---|---|---|
| Tier 1 | None | 600,000 | 600 days | ❌ |
| Tier 1 | 10x | 60,000 | 60 days | ⚠️ Slow |
| Tier 2 | 10x | 60,000 | 6 days | ✅ Recommended |
| Tier 2 | 20x | 30,000 | 3 days | ✅ Aggressive |
| Phase | Constraint | Time Estimate |
|---|---|---|
| PDF Splitting | Compute (local) | ~2-4 hours |
| OCR (Document AI) | 120 RPM | ~83 hours (~3.5 days) |
| LLM Extraction | 10,000 RPD (Tier 2) | ~6 days |
| Total | LLM is bottleneck | ~6-12 days |
Note: OCR and LLM can run in parallel (OCR ahead of LLM), but LLM rate limits dominate the timeline.
| Criterion | GCP | AWS | Winner |
|---|---|---|---|
| OCR Cost | $900 | $900 | Tie |
| LLM Cost | $119 | $119 | Tie |
| Compute Cost (6-12 days) | $27-55 | $24-49 | AWS (marginal) |
| Storage Cost | $10 | $11 | GCP (marginal) |
| OCR Rate Limits | 120 RPM | 150 RPS | AWS |
| LLM Rate Limits | Same (Gemini) | Same (Gemini) | Tie |
| Native Gemini Integration | ✅ Yes | ❌ Cross-cloud | GCP |
| Native OCR Integration | ✅ Document AI | ❌ Textract | GCP |
| Setup Complexity | Lower | Higher | GCP |
| Existing Project Setup | ✅ Already configured | ❌ New setup needed | GCP |
Prerequisites:
Expected Timeline: 6-12 days Expected Cost: ~$1,060-$1,090
Without batching: 600,000 calls → 600 days (Tier 1) or 60 days (Tier 2)
With 10x batching: 60,000 calls → 60 days (Tier 1) or 6 days (Tier 2)
The gubelin_parse.py script already includes extract_batch_with_llm() function for this purpose.
| Optimization | Savings | Effort |
|---|---|---|
| Delete intermediate JPGs after processing | $4.80/month | Low |
| Use Spot/Preemptible instances | $15-30 | Medium |
| Request Gemini quota increase | Faster processing | Medium |
| Use Vertex AI instead of Gemini API | Higher limits, enterprise SLA | High |
| Item | Source | Date Accessed |
|---|---|---|
| GCP Document AI | cloud.google.com/document-ai/pricing | Dec 2025 |
| AWS Textract | aws.amazon.com/textract/pricing | Dec 2025 |
| Gemini API Pricing | ai.google.dev/gemini-api/docs/pricing | Dec 2025 |
| GCP Compute | cloud.google.com/compute/vm-instance-pricing | Dec 2025 |
| AWS EC2 | aws.amazon.com/ec2/pricing/on-demand | Dec 2025 |
| GCP Storage | cloud.google.com/storage/pricing | Dec 2025 |
| AWS S3 | aws.amazon.com/s3/pricing | Dec 2025 |
| GCP Logging | cloud.google.com/stackdriver/pricing | Dec 2025 |
| AWS CloudWatch | aws.amazon.com/cloudwatch/pricing | Dec 2025 |
| Item | Source | Date Accessed |
|---|---|---|
| Document AI Quotas | cloud.google.com/document-ai/quotas | Dec 2025 |
| Gemini API Rate Limits | ai.google.dev/gemini-api/docs/rate-limits | Dec 2025 |
| Gemini Rate Limits Guide | blog.laozhang.ai/ai-tools/gemini-api-rate-limits-guide | Dec 2025 |
| Gemini Tier Breakdown | aifreeapi.com/en/posts/gemini-api-rate-limit | Dec 2025 |
| AWS Textract Limits | docs.aws.amazon.com/textract/latest/dg/limits.html | Dec 2025 |
| Phase | Duration | Constraint |
|---|---|---|
| Setup & Upload | 1 day | Network speed |
| PDF Splitting | 4 hours | Compute |
| OCR Processing | 3.5 days | 120 RPM |
| LLM Extraction | 6 days | 10,000 RPD |
| Total | ~10-12 days | LLM rate limits |
| Category | Cost |
|---|---|
| OCR | $900 |
| LLM | $119 |
| Compute | $27-55 |
| Storage | $10 |
| Other | $3-4 |
| Total | ~$1,060-$1,090 |
Report generated for Gübelin historical document processing project