Phase 2: Embedding Generation - Research

Researched: 2026-02-20 Domain: Text embeddings, vector databases, ML evaluation methodology Confidence: HIGH

Summary

Phase 2 generates embeddings for ~6K line items using three models: Google text-multilingual-embedding-002 (768 dimensions, via Vertex AI SDK), Jina embeddings-v3 (1024 dimensions, via REST API), and all-MiniLM-L6-v2 (384 dimensions, local via sentence-transformers). The existing database schema already has pre-allocated vector columns with correct dimensions. Before embedding, the data must be split into 80% train and 20% test sets with a persistent is_test_set column to prevent data leakage.

Synthetic query variations for the test set require German-aware augmentation. Since the data is German accounting terminology (supplier names, line item descriptions), we need strategies that handle German compound words and accounting jargon. Options include: nlpaug with back-translation, German keyboard typo simulation, and word reordering for descriptions. Synonym replacement is challenging for German accounting terms, so LLM-based paraphrasing (using Gemini) is recommended for high-quality variations.

Primary recommendation: Add is_test_set boolean column before any embedding work, use stratified split by debit_account for balanced evaluation, store query variations in a separate test_query_variation table, and create HNSW indexes only after all embeddings are populated.

<user_constraints>

User Constraints (from CONTEXT.md)

Locked Decisions

20% test set as specified in requirements
Pre-compute embeddings with Google text-multilingual-embedding-002
Pre-compute embeddings with Jina embeddings-v3
Pre-compute embeddings with all-MiniLM-L6-v2 (local)
Generate synthetic query variations for test set entries
Keep metadata tracking simple and generic
Store metadata at table/config level rather than per-embedding row

Claude's Discretion

Train/test split strategy (random vs stratified by supplier/account)
Which fields to combine for embedding text
Specific synthetic query generation approach
Metadata schema design

Deferred Ideas (OUT OF SCOPE)

None - discussion stayed within phase scope </user_constraints>

<phase_requirements>

Phase Requirements

ID	Description	Research Support
INFRA-04	Pre-compute embeddings with Google text-multilingual-embedding-002	Use google-genai SDK with Vertex AI. Model outputs 768 dimensions. Batch up to 250 texts per request. See Code Examples section.
INFRA-05	Pre-compute embeddings with Jina embeddings	Use Jina REST API at api.jina.ai/v1/embeddings. embeddings-v3 outputs 1024 dimensions by default. No batch size limit but rate limited by tier.
INFRA-06	Pre-compute embeddings with all-MiniLM-L6-v2 (local)	Use sentence-transformers SentenceTransformer class. Model outputs 384 dimensions. Batch encode is native and fast (~1000 texts/sec on CPU).
INFRA-07	Store embedding model metadata alongside vectors	Create `embedding_model_config` table with model name, dimensions, distance metric. Store at config level per user decision.
EVAL-01	Train/test split (80/20) with held-out test set	Add `is_test_set` boolean column to line_item. Use stratified split by debit_account for balanced evaluation. Split BEFORE embedding.
EVAL-02	Generate synthetic query variations (synonyms, reordering, typos)	Create `test_query_variation` table. Use keyboard typos (nlpaug KeyboardAug), word reordering, and LLM paraphrasing for German accounting context.
</phase_requirements>

Standard Stack

Core

Library	Version	Purpose	Why Standard
google-genai	1.64+	Google/Vertex AI embeddings	Official unified SDK for Gemini/Vertex AI, replaces deprecated google-generativeai
sentence-transformers	5.2+	Local embedding models	Standard library for transformer embeddings, supports Jina v3 with trust_remote_code
sklearn	(via pandas)	Train/test split	`train_test_split` with `stratify` parameter for balanced splits
requests	(stdlib)	Jina API calls	Simple HTTP client for REST API, no additional dependency

Supporting

Library	Version	Purpose	When to Use
numpy	2.x	Vector operations	Embedding array manipulation before DB storage
tqdm	4.x	Progress bars	Visual feedback during batch embedding (~6K items)
nlpaug	1.1+	Text augmentation	Keyboard typos, back-translation for German query variations

Alternatives Considered

Instead of	Could Use	Tradeoff
Jina REST API	sentence-transformers local	REST is simpler setup but has rate limits; local requires downloading 570M model
nlpaug keyboard	random character substitution	nlpaug simulates realistic QWERTZ keyboard typos
Gemini paraphrasing	nlpaug back-translation	Gemini is higher quality but costs API calls; back-translation is free

Installation:

# Already in pyproject.toml from Phase 1
uv add tqdm nlpaug  # Add for Phase 2

Architecture Patterns

Recommended Project Structure

src/
├── embeddings/
│   ├── __init__.py
│   ├── google_embed.py     # Vertex AI embedding functions
│   ├── jina_embed.py       # Jina API embedding functions
│   ├── minilm_embed.py     # Local sentence-transformers
│   └── batch_processor.py  # Common batching logic with progress
├── evaluation/
│   ├── __init__.py
│   ├── train_test_split.py # Stratified splitting
│   └── query_variations.py # Synthetic query generation
├── db.py                   # Existing - add update functions
├── normalize.py            # Existing
└── import_csv.py           # Existing

Pattern 1: Stratified Train/Test Split

What: Split data ensuring test set has proportional representation of each debit_account When to use: When class distribution matters for evaluation (it does for GL account prediction) Example:

# Source: sklearn documentation
from sklearn.model_selection import train_test_split
import pandas as pd

def create_train_test_split(conn, test_size: float = 0.20, random_state: int = 42):
    """Add is_test_set column with stratified split by debit_account."""
    # Fetch all IDs and their debit_accounts
    df = pd.read_sql(
        "SELECT id, debit_account FROM line_item",
        conn
    )

    # Stratified split - ensures proportional debit_account distribution
    train_ids, test_ids = train_test_split(
        df['id'],
        test_size=test_size,
        random_state=random_state,
        stratify=df['debit_account']
    )

    # Update database
    with conn.cursor() as cur:
        cur.execute("UPDATE line_item SET is_test_set = FALSE")
        cur.execute(
            "UPDATE line_item SET is_test_set = TRUE WHERE id = ANY(%s)",
            (list(test_ids),)
        )
    conn.commit()

    return len(train_ids), len(test_ids)

Pattern 2: Batch Embedding with Progress

What: Process embeddings in batches with rate limiting and progress tracking When to use: Any embedding operation on 1000+ texts Example:

from tqdm import tqdm
import time

def batch_embed_with_progress(
    texts: list[str],
    embed_fn,
    batch_size: int = 100,
    delay_between_batches: float = 0.1
) -> list[list[float]]:
    """Generic batch embedding with progress bar."""
    all_embeddings = []

    for i in tqdm(range(0, len(texts), batch_size), desc="Embedding"):
        batch = texts[i:i + batch_size]
        embeddings = embed_fn(batch)
        all_embeddings.extend(embeddings)
        time.sleep(delay_between_batches)  # Rate limit respect

    return all_embeddings

Pattern 3: Embedding Text Preparation

What: Combine relevant fields into embedding-ready text When to use: Before calling any embedding model Recommendation: Combine supplier_name_normalized and description_normalized Example:

def prepare_embedding_text(supplier_name: str, description: str) -> str:
    """Combine fields for embedding. Use normalized versions."""
    # Keep it simple: concatenate with separator
    # The embedding models handle context internally
    return f"{supplier_name} | {description}"

Anti-Patterns to Avoid

Embedding before split: Never embed then split - test set must be isolated from training
Updating embeddings one-by-one: Use batch UPDATE with arrays, not individual UPDATEs
Creating HNSW indexes during population: Index AFTER all embeddings are inserted
Using different text preparation per model: Use same input text for fair comparison

Don't Hand-Roll

Problem	Don't Build	Use Instead	Why
Train/test split	Random selection	sklearn train_test_split with stratify	Handles edge cases, reproducible with random_state
Batch progress	Print statements	tqdm	Accurate ETA, handles terminals, minimal overhead
German keyboard typos	Character substitution	nlpaug KeyboardAug	QWERTZ layout aware, realistic typo patterns
Rate limiting	sleep() calls	Built-in retry with exponential backoff	Handles 429 responses properly
Vector serialization	Manual string formatting	pgvector register_vector	Type safety, proper encoding

Key insight: Embedding generation involves API rate limits, batch processing, and progress tracking - all well-solved problems. Focus on the domain-specific query variation generation where custom logic adds value.

Common Pitfalls

Pitfall 1: Data Leakage in Train/Test Split

What goes wrong: Model sees test data during training, inflating accuracy metrics Why it happens: Splitting after embedding, or using test set for model selection How to avoid: Add is_test_set column FIRST, before any embedding. Filter on this column for all training operations. Warning signs: Suspiciously high accuracy (>90% on semantic search)

Pitfall 2: Rate Limit Exhaustion

What goes wrong: 429 errors from Google/Jina APIs, incomplete embedding runs Why it happens: Sending requests too fast, no backoff strategy How to avoid: Batch appropriately (250 for Google, no limit for Jina but respect RPM), add delays between batches, implement retry with exponential backoff Warning signs: Intermittent failures, partial data

Pitfall 3: Embedding Dimension Mismatch

What goes wrong: ERROR: expected N dimensions, not M when inserting Why it happens: Model configuration differs from schema definition How to avoid: Schema already correct from Phase 1 (768, 1024, 384). Verify model output dimensions match before batch insert. Warning signs: First insert fails

Pitfall 4: Inconsistent Embedding Text

What goes wrong: Different models get different input text, invalid comparison Why it happens: Ad-hoc text preparation in each embedding script How to avoid: Single prepare_embedding_text() function used by all models Warning signs: Different embedding column NULL patterns

Pitfall 5: HNSW Index Build During Population

What goes wrong: Each embedding insert triggers index update, 100x slower Why it happens: Creating index before populating data How to avoid: Comment out index creation in init.sql (already done). Create indexes AFTER all embeddings populated. Warning signs: Embedding insertion takes hours instead of minutes

Pitfall 6: Synthetic Queries Too Clean

What goes wrong: Query variations don't represent real user typos/variations Why it happens: Using only programmatic augmentation without domain knowledge How to avoid: Mix approaches: keyboard typos for realism, word reordering for variation, LLM paraphrasing for semantic equivalents Warning signs: All synthetic queries look similar, no realistic misspellings

Code Examples

Google text-multilingual-embedding-002 (Vertex AI)

# Source: Google Cloud Vertex AI documentation
import os
from google import genai
from google.genai.types import EmbedContentConfig

# Set up for Vertex AI
os.environ['GOOGLE_GENAI_USE_VERTEXAI'] = 'True'
os.environ['GOOGLE_CLOUD_PROJECT'] = 'your-project-id'
os.environ['GOOGLE_CLOUD_LOCATION'] = 'us-central1'  # or 'europe-west1'

client = genai.Client()

def embed_google(texts: list[str]) -> list[list[float]]:
    """Embed texts using text-multilingual-embedding-002 via Vertex AI."""
    response = client.models.embed_content(
        model='text-multilingual-embedding-002',
        contents=texts,
        config=EmbedContentConfig(
            task_type='RETRIEVAL_DOCUMENT',  # For stored documents
            # output_dimensionality=768,  # Default for this model
        ),
    )
    return [embedding.values for embedding in response.embeddings]

# Batch size: max 250 texts, 20,000 tokens per request

Jina embeddings-v3 (REST API)

# Source: Jina AI Embeddings API documentation
import os
import requests

JINA_API_KEY = os.environ.get('JINA_API_KEY')
JINA_API_URL = 'https://api.jina.ai/v1/embeddings'

def embed_jina(texts: list[str]) -> list[list[float]]:
    """Embed texts using Jina embeddings-v3 API."""
    response = requests.post(
        JINA_API_URL,
        headers={
            'Content-Type': 'application/json',
            'Authorization': f'Bearer {JINA_API_KEY}',
        },
        json={
            'input': texts,
            'model': 'jina-embeddings-v3',
            'dimensions': 1024,  # Default, matches schema
            'task': 'retrieval.passage',  # For stored documents
        },
    )
    response.raise_for_status()
    return [d['embedding'] for d in response.json()['data']]

# Rate limits: Free tier 100 RPM, Paid 500 RPM
# No batch size limit - batches internally

all-MiniLM-L6-v2 (Local)

# Source: sentence-transformers documentation
from sentence_transformers import SentenceTransformer

# Load model once, reuse for all embeddings
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

def embed_minilm(texts: list[str]) -> list[list[float]]:
    """Embed texts using local MiniLM model."""
    # Returns numpy array, convert to list for DB storage
    embeddings = model.encode(
        texts,
        batch_size=128,  # Adjust based on GPU/CPU memory
        show_progress_bar=False,  # We use our own tqdm
        normalize_embeddings=True,  # For cosine similarity
    )
    return embeddings.tolist()

# Performance: ~1000 texts/sec on CPU, ~5000/sec on GPU
# Model size: 22MB, 384 dimensions

Batch Update Embeddings to Database

# Source: psycopg3 documentation + pgvector-python
import psycopg
from pgvector.psycopg import register_vector
import numpy as np

def update_embeddings_batch(
    conn,
    column_name: str,
    id_embedding_pairs: list[tuple[int, list[float]]]
):
    """Batch update embedding column for multiple rows."""
    register_vector(conn)

    with conn.cursor() as cur:
        # Use executemany for batch updates
        cur.executemany(
            f"UPDATE line_item SET {column_name} = %s WHERE id = %s",
            [(np.array(emb), id_) for id_, emb in id_embedding_pairs]
        )
    conn.commit()

Synthetic Query Variation Generation

# Source: nlpaug documentation + custom
import nlpaug.augmenter.char as nac
import random

# German QWERTZ keyboard typo augmenter
typo_aug = nac.KeyboardAug(
    aug_char_p=0.1,  # 10% of characters
    aug_word_p=0.3,  # 30% of words
    include_numeric=False,
    lang='de',  # German QWERTZ layout
)

def generate_typo_variation(text: str) -> str:
    """Generate realistic keyboard typo variation."""
    return typo_aug.augment(text)[0]

def generate_word_reorder_variation(text: str) -> str:
    """Reorder words for description-like text."""
    words = text.split()
    if len(words) <= 2:
        return text
    # Shuffle middle words, keep first and last
    middle = words[1:-1]
    random.shuffle(middle)
    return ' '.join([words[0]] + middle + [words[-1]])

def generate_llm_paraphrase(text: str, client) -> str:
    """Use Gemini to paraphrase German accounting text."""
    response = client.models.generate_content(
        model='gemini-2.0-flash',
        contents=f"""Paraphrase this German invoice description in a different way
while keeping the same meaning. Only output the paraphrase, nothing else.

Original: {text}""",
    )
    return response.text.strip()

Embedding Model Metadata Schema

-- Add to init.sql or run as migration
CREATE TABLE embedding_model_config (
    id SERIAL PRIMARY KEY,
    model_name TEXT NOT NULL UNIQUE,
    column_name TEXT NOT NULL,      -- e.g., 'embedding_google'
    dimensions INT NOT NULL,
    distance_metric TEXT NOT NULL,  -- 'cosine', 'l2', 'inner_product'
    task_type TEXT,                 -- e.g., 'RETRIEVAL_DOCUMENT'
    created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Populate with our 3 models
INSERT INTO embedding_model_config (model_name, column_name, dimensions, distance_metric, task_type) VALUES
    ('text-multilingual-embedding-002', 'embedding_google', 768, 'cosine', 'RETRIEVAL_DOCUMENT'),
    ('jina-embeddings-v3', 'embedding_jina', 1024, 'cosine', 'retrieval.passage'),
    ('all-MiniLM-L6-v2', 'embedding_minilm', 384, 'cosine', NULL);

Test Query Variation Schema

-- Stores synthetic query variations for test set evaluation
CREATE TABLE test_query_variation (
    id BIGSERIAL PRIMARY KEY,
    line_item_id BIGINT NOT NULL REFERENCES line_item(id),
    variation_type TEXT NOT NULL,  -- 'typo', 'reorder', 'paraphrase'
    original_text TEXT NOT NULL,   -- The embedding text that was varied
    varied_text TEXT NOT NULL,     -- The synthetic query
    created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_test_query_variation_line_item ON test_query_variation(line_item_id);

State of the Art

Old Approach	Current Approach	When Changed	Impact
google-generativeai library	google-genai SDK	2025	Unified SDK for Gemini API and Vertex AI
text-embedding-004	gemini-embedding-001	Jan 2026	text-embedding-004 deprecated, but text-multilingual-embedding-002 still supported
Fixed embedding dimensions	Matryoshka dimensions	2024-2025	Jina v3 and Gemini support flexible dimensions
IVFFlat indexes	HNSW indexes	2023	HNSW has better recall at same query speed

Deprecated/outdated:

google-generativeai package: Use google-genai instead
text-embedding-004: Deprecated Jan 2026, use gemini-embedding-001 or text-multilingual-embedding-002
IVFFlat indexes: HNSW generally preferred unless index build time is critical

Open Questions

Vertex AI vs Gemini API for embeddings
- What we know: text-multilingual-embedding-002 is available on Vertex AI but NOT on Gemini Developer API
- What's unclear: Whether project has Vertex AI access configured
- Recommendation: Use Vertex AI SDK with GOOGLE_GENAI_USE_VERTEXAI=True. Requires GCP project and service account or ADC.
Jina API tier and rate limits
- What we know: Free tier is 100 RPM, 100K TPM
- What's unclear: Whether project has API key, which tier
- Recommendation: Get free API key from jina.ai, monitor rate limits. For 6K items at 100 RPM, expect ~1 hour for Jina embeddings.
Number of synthetic variations per test item
- What we know: Requirements say "synthetic query variations" (plural) for test set entries
- What's unclear: How many variations per item? 3? 5? 10?
- Recommendation: Start with 3 variations per test item (1 typo, 1 reorder, 1 paraphrase). Can increase later if needed.

Sources

Primary (HIGH confidence)

Google Vertex AI Embeddings Docs - text-multilingual-embedding-002 API, batch limits
google-genai Python SDK - embed_content method, authentication
Jina Embeddings API - REST API, rate limits, dimensions
sentence-transformers docs - MiniLM usage, dimensions
pgvector GitHub - HNSW index parameters, distance operators
sklearn train_test_split - stratify parameter

Secondary (MEDIUM confidence)

Jina embeddings-v3 announcement - 1024 default dimensions, Matryoshka support
nlpaug documentation - KeyboardAug with German layout
pgvector HNSW guide - m, ef_construction parameters

Tertiary (LOW confidence)

OdeNet/GermaNet for German synonyms - Not verified if compatible with nlpaug, may need custom integration

Metadata

Confidence breakdown:

Standard stack: HIGH - All libraries verified with official docs and currently installed
Architecture: HIGH - Patterns from official examples and documentation
Pitfalls: HIGH - Based on official docs and common issues from multiple sources
Query variations: MEDIUM - German-specific augmentation approaches need validation

Research date: 2026-02-20 Valid until: 2026-03-20 (30 days - stable technologies, though Google may deprecate text-multilingual-embedding-002 soon)