Phase 2: Embedding Generation - Research

Researched: 2026-02-20 Domain: Text embeddings, vector databases, ML evaluation methodology Confidence: HIGH

Summary

Phase 2 generates embeddings for ~6K line items using three models: Google text-multilingual-embedding-002 (768 dimensions, via Vertex AI SDK), Jina embeddings-v3 (1024 dimensions, via REST API), and all-MiniLM-L6-v2 (384 dimensions, local via sentence-transformers). The existing database schema already has pre-allocated vector columns with correct dimensions. Before embedding, the data must be split into 80% train and 20% test sets with a persistent is_test_set column to prevent data leakage.

Synthetic query variations for the test set require German-aware augmentation. Since the data is German accounting terminology (supplier names, line item descriptions), we need strategies that handle German compound words and accounting jargon. Options include: nlpaug with back-translation, German keyboard typo simulation, and word reordering for descriptions. Synonym replacement is challenging for German accounting terms, so LLM-based paraphrasing (using Gemini) is recommended for high-quality variations.

Primary recommendation: Add is_test_set boolean column before any embedding work, use stratified split by debit_account for balanced evaluation, store query variations in a separate test_query_variation table, and create HNSW indexes only after all embeddings are populated.

<user_constraints>

User Constraints (from CONTEXT.md)

Locked Decisions

Claude's Discretion

Deferred Ideas (OUT OF SCOPE)

None - discussion stayed within phase scope </user_constraints>

<phase_requirements>

Phase Requirements

ID Description Research Support
INFRA-04 Pre-compute embeddings with Google text-multilingual-embedding-002 Use google-genai SDK with Vertex AI. Model outputs 768 dimensions. Batch up to 250 texts per request. See Code Examples section.
INFRA-05 Pre-compute embeddings with Jina embeddings Use Jina REST API at api.jina.ai/v1/embeddings. embeddings-v3 outputs 1024 dimensions by default. No batch size limit but rate limited by tier.
INFRA-06 Pre-compute embeddings with all-MiniLM-L6-v2 (local) Use sentence-transformers SentenceTransformer class. Model outputs 384 dimensions. Batch encode is native and fast (~1000 texts/sec on CPU).
INFRA-07 Store embedding model metadata alongside vectors Create embedding_model_config table with model name, dimensions, distance metric. Store at config level per user decision.
EVAL-01 Train/test split (80/20) with held-out test set Add is_test_set boolean column to line_item. Use stratified split by debit_account for balanced evaluation. Split BEFORE embedding.
EVAL-02 Generate synthetic query variations (synonyms, reordering, typos) Create test_query_variation table. Use keyboard typos (nlpaug KeyboardAug), word reordering, and LLM paraphrasing for German accounting context.
</phase_requirements>

Standard Stack

Core

Library Version Purpose Why Standard
google-genai 1.64+ Google/Vertex AI embeddings Official unified SDK for Gemini/Vertex AI, replaces deprecated google-generativeai
sentence-transformers 5.2+ Local embedding models Standard library for transformer embeddings, supports Jina v3 with trust_remote_code
sklearn (via pandas) Train/test split train_test_split with stratify parameter for balanced splits
requests (stdlib) Jina API calls Simple HTTP client for REST API, no additional dependency

Supporting

Library Version Purpose When to Use
numpy 2.x Vector operations Embedding array manipulation before DB storage
tqdm 4.x Progress bars Visual feedback during batch embedding (~6K items)
nlpaug 1.1+ Text augmentation Keyboard typos, back-translation for German query variations

Alternatives Considered

Instead of Could Use Tradeoff
Jina REST API sentence-transformers local REST is simpler setup but has rate limits; local requires downloading 570M model
nlpaug keyboard random character substitution nlpaug simulates realistic QWERTZ keyboard typos
Gemini paraphrasing nlpaug back-translation Gemini is higher quality but costs API calls; back-translation is free

Installation:

# Already in pyproject.toml from Phase 1
uv add tqdm nlpaug  # Add for Phase 2

Architecture Patterns

src/
├── embeddings/
│   ├── __init__.py
│   ├── google_embed.py     # Vertex AI embedding functions
│   ├── jina_embed.py       # Jina API embedding functions
│   ├── minilm_embed.py     # Local sentence-transformers
│   └── batch_processor.py  # Common batching logic with progress
├── evaluation/
│   ├── __init__.py
│   ├── train_test_split.py # Stratified splitting
│   └── query_variations.py # Synthetic query generation
├── db.py                   # Existing - add update functions
├── normalize.py            # Existing
└── import_csv.py           # Existing

Pattern 1: Stratified Train/Test Split

What: Split data ensuring test set has proportional representation of each debit_account When to use: When class distribution matters for evaluation (it does for GL account prediction) Example:

# Source: sklearn documentation
from sklearn.model_selection import train_test_split
import pandas as pd

def create_train_test_split(conn, test_size: float = 0.20, random_state: int = 42):
    """Add is_test_set column with stratified split by debit_account."""
    # Fetch all IDs and their debit_accounts
    df = pd.read_sql(
        "SELECT id, debit_account FROM line_item",
        conn
    )

    # Stratified split - ensures proportional debit_account distribution
    train_ids, test_ids = train_test_split(
        df['id'],
        test_size=test_size,
        random_state=random_state,
        stratify=df['debit_account']
    )

    # Update database
    with conn.cursor() as cur:
        cur.execute("UPDATE line_item SET is_test_set = FALSE")
        cur.execute(
            "UPDATE line_item SET is_test_set = TRUE WHERE id = ANY(%s)",
            (list(test_ids),)
        )
    conn.commit()

    return len(train_ids), len(test_ids)

Pattern 2: Batch Embedding with Progress

What: Process embeddings in batches with rate limiting and progress tracking When to use: Any embedding operation on 1000+ texts Example:

from tqdm import tqdm
import time

def batch_embed_with_progress(
    texts: list[str],
    embed_fn,
    batch_size: int = 100,
    delay_between_batches: float = 0.1
) -> list[list[float]]:
    """Generic batch embedding with progress bar."""
    all_embeddings = []

    for i in tqdm(range(0, len(texts), batch_size), desc="Embedding"):
        batch = texts[i:i + batch_size]
        embeddings = embed_fn(batch)
        all_embeddings.extend(embeddings)
        time.sleep(delay_between_batches)  # Rate limit respect

    return all_embeddings

Pattern 3: Embedding Text Preparation

What: Combine relevant fields into embedding-ready text When to use: Before calling any embedding model Recommendation: Combine supplier_name_normalized and description_normalized Example:

def prepare_embedding_text(supplier_name: str, description: str) -> str:
    """Combine fields for embedding. Use normalized versions."""
    # Keep it simple: concatenate with separator
    # The embedding models handle context internally
    return f"{supplier_name} | {description}"

Anti-Patterns to Avoid

Don't Hand-Roll

Problem Don't Build Use Instead Why
Train/test split Random selection sklearn train_test_split with stratify Handles edge cases, reproducible with random_state
Batch progress Print statements tqdm Accurate ETA, handles terminals, minimal overhead
German keyboard typos Character substitution nlpaug KeyboardAug QWERTZ layout aware, realistic typo patterns
Rate limiting sleep() calls Built-in retry with exponential backoff Handles 429 responses properly
Vector serialization Manual string formatting pgvector register_vector Type safety, proper encoding

Key insight: Embedding generation involves API rate limits, batch processing, and progress tracking - all well-solved problems. Focus on the domain-specific query variation generation where custom logic adds value.

Common Pitfalls

Pitfall 1: Data Leakage in Train/Test Split

What goes wrong: Model sees test data during training, inflating accuracy metrics Why it happens: Splitting after embedding, or using test set for model selection How to avoid: Add is_test_set column FIRST, before any embedding. Filter on this column for all training operations. Warning signs: Suspiciously high accuracy (>90% on semantic search)

Pitfall 2: Rate Limit Exhaustion

What goes wrong: 429 errors from Google/Jina APIs, incomplete embedding runs Why it happens: Sending requests too fast, no backoff strategy How to avoid: Batch appropriately (250 for Google, no limit for Jina but respect RPM), add delays between batches, implement retry with exponential backoff Warning signs: Intermittent failures, partial data

Pitfall 3: Embedding Dimension Mismatch

What goes wrong: ERROR: expected N dimensions, not M when inserting Why it happens: Model configuration differs from schema definition How to avoid: Schema already correct from Phase 1 (768, 1024, 384). Verify model output dimensions match before batch insert. Warning signs: First insert fails

Pitfall 4: Inconsistent Embedding Text

What goes wrong: Different models get different input text, invalid comparison Why it happens: Ad-hoc text preparation in each embedding script How to avoid: Single prepare_embedding_text() function used by all models Warning signs: Different embedding column NULL patterns

Pitfall 5: HNSW Index Build During Population

What goes wrong: Each embedding insert triggers index update, 100x slower Why it happens: Creating index before populating data How to avoid: Comment out index creation in init.sql (already done). Create indexes AFTER all embeddings populated. Warning signs: Embedding insertion takes hours instead of minutes

Pitfall 6: Synthetic Queries Too Clean

What goes wrong: Query variations don't represent real user typos/variations Why it happens: Using only programmatic augmentation without domain knowledge How to avoid: Mix approaches: keyboard typos for realism, word reordering for variation, LLM paraphrasing for semantic equivalents Warning signs: All synthetic queries look similar, no realistic misspellings

Code Examples

Google text-multilingual-embedding-002 (Vertex AI)

# Source: Google Cloud Vertex AI documentation
import os
from google import genai
from google.genai.types import EmbedContentConfig

# Set up for Vertex AI
os.environ['GOOGLE_GENAI_USE_VERTEXAI'] = 'True'
os.environ['GOOGLE_CLOUD_PROJECT'] = 'your-project-id'
os.environ['GOOGLE_CLOUD_LOCATION'] = 'us-central1'  # or 'europe-west1'

client = genai.Client()

def embed_google(texts: list[str]) -> list[list[float]]:
    """Embed texts using text-multilingual-embedding-002 via Vertex AI."""
    response = client.models.embed_content(
        model='text-multilingual-embedding-002',
        contents=texts,
        config=EmbedContentConfig(
            task_type='RETRIEVAL_DOCUMENT',  # For stored documents
            # output_dimensionality=768,  # Default for this model
        ),
    )
    return [embedding.values for embedding in response.embeddings]

# Batch size: max 250 texts, 20,000 tokens per request

Jina embeddings-v3 (REST API)

# Source: Jina AI Embeddings API documentation
import os
import requests

JINA_API_KEY = os.environ.get('JINA_API_KEY')
JINA_API_URL = 'https://api.jina.ai/v1/embeddings'

def embed_jina(texts: list[str]) -> list[list[float]]:
    """Embed texts using Jina embeddings-v3 API."""
    response = requests.post(
        JINA_API_URL,
        headers={
            'Content-Type': 'application/json',
            'Authorization': f'Bearer {JINA_API_KEY}',
        },
        json={
            'input': texts,
            'model': 'jina-embeddings-v3',
            'dimensions': 1024,  # Default, matches schema
            'task': 'retrieval.passage',  # For stored documents
        },
    )
    response.raise_for_status()
    return [d['embedding'] for d in response.json()['data']]

# Rate limits: Free tier 100 RPM, Paid 500 RPM
# No batch size limit - batches internally

all-MiniLM-L6-v2 (Local)

# Source: sentence-transformers documentation
from sentence_transformers import SentenceTransformer

# Load model once, reuse for all embeddings
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

def embed_minilm(texts: list[str]) -> list[list[float]]:
    """Embed texts using local MiniLM model."""
    # Returns numpy array, convert to list for DB storage
    embeddings = model.encode(
        texts,
        batch_size=128,  # Adjust based on GPU/CPU memory
        show_progress_bar=False,  # We use our own tqdm
        normalize_embeddings=True,  # For cosine similarity
    )
    return embeddings.tolist()

# Performance: ~1000 texts/sec on CPU, ~5000/sec on GPU
# Model size: 22MB, 384 dimensions

Batch Update Embeddings to Database

# Source: psycopg3 documentation + pgvector-python
import psycopg
from pgvector.psycopg import register_vector
import numpy as np

def update_embeddings_batch(
    conn,
    column_name: str,
    id_embedding_pairs: list[tuple[int, list[float]]]
):
    """Batch update embedding column for multiple rows."""
    register_vector(conn)

    with conn.cursor() as cur:
        # Use executemany for batch updates
        cur.executemany(
            f"UPDATE line_item SET {column_name} = %s WHERE id = %s",
            [(np.array(emb), id_) for id_, emb in id_embedding_pairs]
        )
    conn.commit()

Synthetic Query Variation Generation

# Source: nlpaug documentation + custom
import nlpaug.augmenter.char as nac
import random

# German QWERTZ keyboard typo augmenter
typo_aug = nac.KeyboardAug(
    aug_char_p=0.1,  # 10% of characters
    aug_word_p=0.3,  # 30% of words
    include_numeric=False,
    lang='de',  # German QWERTZ layout
)

def generate_typo_variation(text: str) -> str:
    """Generate realistic keyboard typo variation."""
    return typo_aug.augment(text)[0]

def generate_word_reorder_variation(text: str) -> str:
    """Reorder words for description-like text."""
    words = text.split()
    if len(words) <= 2:
        return text
    # Shuffle middle words, keep first and last
    middle = words[1:-1]
    random.shuffle(middle)
    return ' '.join([words[0]] + middle + [words[-1]])

def generate_llm_paraphrase(text: str, client) -> str:
    """Use Gemini to paraphrase German accounting text."""
    response = client.models.generate_content(
        model='gemini-2.0-flash',
        contents=f"""Paraphrase this German invoice description in a different way
while keeping the same meaning. Only output the paraphrase, nothing else.

Original: {text}""",
    )
    return response.text.strip()

Embedding Model Metadata Schema

-- Add to init.sql or run as migration
CREATE TABLE embedding_model_config (
    id SERIAL PRIMARY KEY,
    model_name TEXT NOT NULL UNIQUE,
    column_name TEXT NOT NULL,      -- e.g., 'embedding_google'
    dimensions INT NOT NULL,
    distance_metric TEXT NOT NULL,  -- 'cosine', 'l2', 'inner_product'
    task_type TEXT,                 -- e.g., 'RETRIEVAL_DOCUMENT'
    created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Populate with our 3 models
INSERT INTO embedding_model_config (model_name, column_name, dimensions, distance_metric, task_type) VALUES
    ('text-multilingual-embedding-002', 'embedding_google', 768, 'cosine', 'RETRIEVAL_DOCUMENT'),
    ('jina-embeddings-v3', 'embedding_jina', 1024, 'cosine', 'retrieval.passage'),
    ('all-MiniLM-L6-v2', 'embedding_minilm', 384, 'cosine', NULL);

Test Query Variation Schema

-- Stores synthetic query variations for test set evaluation
CREATE TABLE test_query_variation (
    id BIGSERIAL PRIMARY KEY,
    line_item_id BIGINT NOT NULL REFERENCES line_item(id),
    variation_type TEXT NOT NULL,  -- 'typo', 'reorder', 'paraphrase'
    original_text TEXT NOT NULL,   -- The embedding text that was varied
    varied_text TEXT NOT NULL,     -- The synthetic query
    created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_test_query_variation_line_item ON test_query_variation(line_item_id);

State of the Art

Old Approach Current Approach When Changed Impact
google-generativeai library google-genai SDK 2025 Unified SDK for Gemini API and Vertex AI
text-embedding-004 gemini-embedding-001 Jan 2026 text-embedding-004 deprecated, but text-multilingual-embedding-002 still supported
Fixed embedding dimensions Matryoshka dimensions 2024-2025 Jina v3 and Gemini support flexible dimensions
IVFFlat indexes HNSW indexes 2023 HNSW has better recall at same query speed

Deprecated/outdated:

Open Questions

  1. Vertex AI vs Gemini API for embeddings

  2. Jina API tier and rate limits

  3. Number of synthetic variations per test item

Sources

Primary (HIGH confidence)

Secondary (MEDIUM confidence)

Tertiary (LOW confidence)

Metadata

Confidence breakdown:

Research date: 2026-02-20 Valid until: 2026-03-20 (30 days - stable technologies, though Google may deprecate text-multilingual-embedding-002 soon)