Add train/test split infrastructure and embedding model metadata tracking.

Purpose: EVAL-01 requires clean train/test separation before any embedding work to prevent data leakage. INFRA-07 requires model metadata storage. Both must be done before embedding generation begins.

Output: Database schema additions (is_test_set column, embedding_model_config table), stratified split applied to existing 6078 line items, Python module for split operations.

<execution_context> @./.claude/get-shit-done/workflows/execute-plan.md @./.claude/get-shit-done/templates/summary.md </execution_context>

@.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md @.planning/phases/02-embedding-generation/02-RESEARCH.md

Prior phase context

@.planning/phases/01-foundation/01-02-SUMMARY.md

Existing source files

@src/db.py @init.sql

Task 1: Add schema for train/test split and model metadata init.sql Update init.sql to add two new database objects:
  1. Add is_test_set column to line_item table:
ALTER TABLE line_item ADD COLUMN IF NOT EXISTS is_test_set BOOLEAN DEFAULT FALSE;
  1. Create embedding_model_config table:
CREATE TABLE IF NOT EXISTS embedding_model_config (
    id SERIAL PRIMARY KEY,
    model_name TEXT NOT NULL UNIQUE,
    column_name TEXT NOT NULL,
    dimensions INT NOT NULL,
    distance_metric TEXT NOT NULL,
    task_type TEXT,
    created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
  1. Insert the 3 model configurations:
INSERT INTO embedding_model_config (model_name, column_name, dimensions, distance_metric, task_type) VALUES
    ('text-multilingual-embedding-002', 'embedding_google', 768, 'cosine', 'RETRIEVAL_DOCUMENT'),
    ('jina-embeddings-v3', 'embedding_jina', 1024, 'cosine', 'retrieval.passage'),
    ('all-MiniLM-L6-v2', 'embedding_minilm', 384, 'cosine', NULL)
ON CONFLICT (model_name) DO NOTHING;

Place these statements AFTER the CREATE TABLE line_item statement but BEFORE any index creation comments.

Also create a migration script migrations/001_train_test_split.sql with just the ALTER TABLE and CREATE TABLE statements for running against existing database. Run docker compose exec -T postgres psql -U dev -d dev -c "\d line_item" and verify is_test_set column exists. Run docker compose exec -T postgres psql -U dev -d dev -c "SELECT * FROM embedding_model_config" and verify 3 rows.

  • is_test_set column exists in line_item table with BOOLEAN type
  • embedding_model_config table exists with 3 model rows
  • Migration script exists for applying to running database
Task 2: Create train/test split module src/evaluation/__init__.py, src/evaluation/train_test_split.py Create the evaluation package and train_test_split module:
  1. Create src/evaluation/__init__.py:
"""Evaluation utilities for semantic search comparison."""
from .train_test_split import create_train_test_split

__all__ = ["create_train_test_split"]
  1. Create src/evaluation/train_test_split.py:
"""Train/test split with stratification by debit_account."""
import pandas as pd
from sklearn.model_selection import train_test_split as sklearn_split

def create_train_test_split(
    conn,
    test_size: float = 0.20,
    random_state: int = 42
) -> tuple[int, int]:
    """
    Create stratified train/test split by debit_account.

    Marks ~20% of line items as test set with proportional
    representation of each debit account.

    Args:
        conn: psycopg connection
        test_size: Fraction of data for test set (default 0.20)
        random_state: Random seed for reproducibility (default 42)

    Returns:
        Tuple of (train_count, test_count)
    """
    # Fetch all IDs and their debit_accounts
    df = pd.read_sql(
        "SELECT id, debit_account FROM line_item",
        conn
    )

    # Stratified split - ensures proportional debit_account distribution
    train_ids, test_ids = sklearn_split(
        df['id'],
        test_size=test_size,
        random_state=random_state,
        stratify=df['debit_account']
    )

    # Update database
    with conn.cursor() as cur:
        # Reset all to train
        cur.execute("UPDATE line_item SET is_test_set = FALSE")
        # Mark test set
        cur.execute(
            "UPDATE line_item SET is_test_set = TRUE WHERE id = ANY(%s)",
            (list(test_ids),)
        )
    conn.commit()

    return len(train_ids), len(test_ids)


def get_split_stats(conn) -> dict:
    """Get current train/test split statistics."""
    with conn.cursor() as cur:
        cur.execute("""
            SELECT
                is_test_set,
                COUNT(*) as count,
                COUNT(DISTINCT debit_account) as unique_accounts
            FROM line_item
            GROUP BY is_test_set
            ORDER BY is_test_set
        """)
        rows = cur.fetchall()

    return {
        "train": {"count": rows[0][1], "unique_accounts": rows[0][2]} if rows else {},
        "test": {"count": rows[1][1], "unique_accounts": rows[1][2]} if len(rows) > 1 else {}
    }

Use sklearn's train_test_split with stratify parameter for balanced evaluation. Run uv run python -c "from src.evaluation import create_train_test_split; print('OK')" - should print OK.

  • src/evaluation/ package exists with init.py
  • train_test_split.py exports create_train_test_split function
  • Function uses sklearn stratified split by debit_account
Task 3: Apply schema migration and execute split None (execution only) Execute the migration and split:
  1. Apply migration to running database:
docker compose exec -T postgres psql -U dev -d dev < migrations/001_train_test_split.sql
  1. Run Python script to execute the split:
from src.db import get_connection
from src.evaluation import create_train_test_split, get_split_stats

conn = get_connection()
train_count, test_count = create_train_test_split(conn)
print(f"Split complete: {train_count} train, {test_count} test ({test_count/(train_count+test_count)*100:.1f}%)")

stats = get_split_stats(conn)
print(f"Train: {stats['train']['count']} items, {stats['train']['unique_accounts']} unique accounts")
print(f"Test: {stats['test']['count']} items, {stats['test']['unique_accounts']} unique accounts")
conn.close()

Expected output: ~4862 train (80%), ~1216 test (20%) Run docker compose exec -T postgres psql -U dev -d dev -c "SELECT is_test_set, COUNT(*) FROM line_item GROUP BY is_test_set"

Expected:

  • FALSE: ~4862 rows
  • TRUE: ~1216 rows
  • Migration applied to running database
  • Split executed with ~80/20 distribution
  • Both train and test sets have representation of debit accounts
After all tasks: 1. `\d line_item` shows is_test_set BOOLEAN column 2. `SELECT * FROM embedding_model_config` shows 3 model rows 3. `SELECT is_test_set, COUNT(*) FROM line_item GROUP BY is_test_set` shows ~80/20 split 4. Python import `from src.evaluation import create_train_test_split` succeeds

<success_criteria>

After completion, create `.planning/phases/02-embedding-generation/02-01-SUMMARY.md`