Data Discovery Protocol (DDP)

You are building a Data Map — a persistent, structured index of all financial data available for a tenant. The Data Map is your primary deliverable. It will be used by future agents and humans as a navigation guide — an agent reading the Data Map should be able to construct correct excel/read ranges without calling excel/profile.

You MUST save your work using orcha-fpna-save-dm:

Call after completing each domain section to preserve progress
If you explored data but did not save, the work is lost

A data map may already exist for this tenant. Call orcha-fpna-data-map to check. If a map exists, your task is to verify it is still accurate, update any stale pointers, and fill in gaps. If no map exists, follow the full protocol below to create one from scratch.

Work through the phases below in order.

Orchestration

If you can dispatch subagents (e.g., Claude Code subagents, Cowork agents, or any parallel execution capability), you should use them. The protocol is designed for a coordinator + subagents model:

You are the coordinator. You do Phase 0 (inventory), Phase 2 (synthesis and cross-referencing), and Phase 3 (human review). You compile subagent outputs into the Data Map and call orcha-fpna-save-dm — subagents never save directly.
Subagents do Phase 1 (deep analysis). Each subagent profiles its assigned sheets, describes layouts, and maps data to domains.

Err toward dispatching — even a single-file dataset benefits from parallel sheet analysis.

If you cannot dispatch subagents, you do all phases yourself sequentially. The work is identical; only the execution model differs. Read Phase 1 as instructions for yourself rather than for subagents.

Phase 0 — Inventory

Build the context that subagents will need. This phase is lightweight — you do it yourself.

Call orcha-fpna-list-files with include_summary: true to get the full file listing with sheet names and row counts in one call.
Identify naming conventions (e.g., Budget_2026_v2.xlsx, Lohn_2025_01.xlsx)
Map folder organization (by year? by domain? flat? nested?)
Detect recurring file patterns — monthly exports, quarterly reports, versioned files
Note non-Excel files that might be relevant (PDFs, CSVs)

For each file, classify its frequency:

One-time: a single document (e.g., Budget_2026.xlsx)
Recurring: follows a pattern with a cadence (e.g., Lohn_YYYY_MM.xlsx → monthly payroll)

For recurring files, record the glob pattern (e.g., payroll/Lohn_*.xlsx).

Output of this phase: A file inventory with sheet names, row counts, and frequency classifications. This becomes the input for subagent assignments.

Phase 1 — Deep Analysis

This is the bulk of the work. If you can dispatch subagents, delegate this phase to them (see Partitioning below). If working solo, do it yourself sequentially.

Partitioning (when using subagents)

Give each subagent the file inventory from Phase 0, its assigned sheets/files, the domain definitions and minimum data point lists (from the Domains Reference below), and access to orcha-fpna-excel.

Partition the work based on the data shape:

Multiple files, each clearly scoped (e.g., Budget_2026.xlsx, Payroll_Jan.xlsx, BankStatement_Q1.csv): assign one subagent per file or per file group.
Single large file with many sheets: assign subagents by sheet group. Use sheet names and summaries from Phase 0 to cluster related sheets (e.g., "all P&L sheets", "all headcount/payroll sheets", "all cash flow sheets").
Mix: combine both strategies as appropriate.

Assign related sheets together so shared structure can be described once.

The Work

For every assigned sheet:

Profile the sheet using excel/profile to get headers, sample rows, merged regions, named ranges, and data bounds.
Describe the sheet layout — the physical structure of the sheet. This is critical: a future agent must be able to read data from this sheet using only the Data Map, without calling excel/profile itself. Describe:
- Which rows are metadata (title, preparer, dates, etc.)
- Which rows form the column header band (there may be multiple header rows — e.g., one with dates, one with "Actuals/Forecast" labels, one with geography labels)
- Where data rows start
- Which columns are row labels (section headers, line item names, sub-categories) vs. data columns
- The time axis: orientation (columns or rows), date format, range
- How sections are separated (empty rows, merged cells, indentation)
- Any structural quirks (columns shift between sheets, quarterly aggregates appended to the right, etc.)
When multiple sheets share the same layout (common in financial models), describe the layout once and reference it from other sheets.
Read row labels and key ranges as needed — excel/profile gives sample data but often doesn't capture all row labels for tall sheets. Use excel/read to get the full label column.
Map data points to domains using the domain definitions below. For each data point, record:
- Data point name
- Confidence (high/medium/low)
- Sheet name
- Cell range (exact location)
- Semantic markers (the text labels that identify this data — these survive layout changes)
- Shape (time-series, transaction-list, snapshot, pivot)
- Notes (quirks, formulas, errors, units)
Flag items from the minimum data point lists that were not found in the assigned sheets. Don't assume the data doesn't exist — another subagent's sheets may have it.
Note currency, language, and date formats.

Phase 1 Output

The output of this phase (per subagent, or as a whole if working solo):

Sheet layout descriptions (reusable across data points)
Data points mapped to domains (with all fields from step 4)
Items from minimum data point lists not found in the analyzed sheets

Domains Reference

These domain definitions are shared with subagents. A single sheet may contribute data points to multiple domains.

Domain 1: Orcha Documents (pre-existing)

Invoices, purchase orders, goods received notes, contracts, and financial notices are available as structured data in Orcha's database. Line items, supplier info, matching data, and document metadata are accessible via Orcha's API. No discovery needed — this is context for reference.

What's available:

Invoices with full line-item detail, tax breakdowns, payment terms, service periods, fraud flags, and validation results
Purchase orders with line items, quantities, expected delivery dates
Goods received notes with quantities received/rejected, quality assessments
Contracts with financial terms, payment schedules, risk flags, compliance checks
Document matching: invoices linked to POs, GRNs, and contracts with evidence scores

Domain 2: Journal Entries / Booking History (pre-existing)

Historical bookings are available in Orcha's database, sourced from DATEV REWE sync or CSV upload. No discovery needed — context for reference.

What's available per booking entry:

Supplier name
Description
Debit account and credit account (from GL chart of accounts)
Cost center
Net amount

Master data per tenant:

GL chart of accounts (account number, name, balance position)
Cost centers (number, name, responsible employee)
Business partners / creditors / debtors (account number, name, VAT ID, IBAN)

Domain 3: Budget / Forecast

Minimum data points:

Revenue lines (by product, service, department, or however the company segments)
Cost/expense lines (same granularity as revenue)
Time granularity (monthly, quarterly, annual)
Budget versions (original budget vs. revised forecasts — are there multiple?)
Department / cost center breakdown
Currency
Comparison columns (budget vs. actual, budget vs. forecast)

Domain 4: AR / Revenues

Minimum data points:

Customer/debtor list or breakdown
Revenue by period
Outstanding receivables / aging
Payment terms per customer
Revenue by product/service line
Credit notes / adjustments

Domain 5: Payroll

Minimum data points:

Gross salary per employee or department
Employer social contributions (AG-Anteil Sozialversicherung)
Net salary
Tax withholdings
Headcount
Department / cost center allocation
Bonus / variable compensation
Period (monthly, per pay run)

Domain 6: Travel Expenses

Minimum data points:

Amount per trip or per employee
Expense categories (flights, hotels, meals, mileage)
Cost center / department allocation
Tax-relevant breakdowns (VAT on hotels, per diem rates)
Approval status
Period

Domain 7: Cash Flow (Bank + Credit Card)

Minimum data points:

Account identifier (IBAN, account name)
Transaction date
Amount (debit/credit or signed)
Counterparty name
Reference / description text
Opening / closing balance per period
For credit cards: cardholder, merchant category

Phase 2 — Synthesis

Once Phase 1 is complete (all subagents have returned, or you've finished analyzing all sheets if working solo):

Merge sheet layouts: Deduplicate shared layouts. If the same layout pattern was described separately (by different subagents, or during sequential analysis of related sheets), consolidate into one description and reference it from all relevant sheets.
Merge data points by domain: Combine all domain mappings into the Data Map template sections.
Cross-reference with Orcha master data using orcha-data-master-data: Do cost centers in Excel files match Orcha's cost center dataset? Do supplier names in bank transactions match known business partners? Do GL account numbers in the budget match the chart of accounts?
Flag inconsistencies: budget departments that don't appear in cost center master data, account numbers that don't match the GL chart, etc.
Compile the gap list: merge all "not found" lists from Phase 1. Remove items that were found elsewhere. The remainder is the true gap list, organized by domain.
Save the draft Data Map via orcha-fpna-save-dm.

Phase 3 — Human Review

Present a summary to the human, organized by domain:

What was found: list each data point with its confidence level, source file, and location
Uncertain items: highlight anything with medium or low confidence. Ask the human to confirm or correct.
Missing items: for each gap from the minimum data point lists, ask a specific question. Don't ask "is payroll data available?" — ask "I couldn't find employer social contribution breakdowns in any of the payroll files. Is this data tracked somewhere, or is it not available?"
Ambiguities: flag any data that could belong to multiple domains or whose meaning is unclear (e.g., "Sheet 'Umsatz' has monthly figures — is this AR revenue data or a P&L revenue summary?")
Cross-reference issues: report any inconsistencies found in Phase 2.

Corrections may require revisiting earlier phases — e.g., if the human points out missed files, analyze those files (dispatch a subagent or do it yourself), then redo Phase 2 for the new data before continuing. Iterate until the human confirms the map is complete, then produce the final Data Map and call orcha-fpna-save-dm to persist it.

Incremental saves: Call orcha-fpna-save-dm after completing each domain section. This preserves progress in case the session ends unexpectedly. A new session can pick up where you left off via orcha-fpna-data-map.

Data Map Template

The Data Map you produce should follow this structure. The template is a starting point, not a constraint — adapt it to fit the data you find. Add sections for data that doesn't fit the predefined domains (e.g., "Supporting / Operational Data" for FX rates, cloud costs, or strategic timelines). Remove domain sections that are entirely empty. The goal is a complete, navigable index — not rigid conformity to a template.

Two patterns worth considering:

Aggregation-only sheets: If a sheet derives all its data from other sheets (dashboards, unit economics, chart sheets), list it at the bottom of the Data Map so future agents know not to profile it. Account for every sheet — either as a data source or as derived.
Cross-reference notes: If you found mappings between sheets (e.g., a "Ticker" column that bridges headcount departments to P&L cost categories), document these — they help future agents join data across domains.

# Data Map: <Tenant Name>

## Folder Structure

<Describe the directory layout, naming conventions, and recurring file
patterns. Include the root path and any notable organizational choices.>

## Orcha Structured Data

### Documents
<What document types are present in Orcha for this tenant.
Approximate volume and date range if known.>

### Booking History
<Source (DATEV REWE / CSV upload), date range, approximate volume.>
<Master data available: GL accounts (yes/no + count), cost centers
(yes/no + count), business partners (yes/no + count).>

## Sheet Layouts

Describe the physical structure of each sheet (or sheet family when
multiple sheets share the same layout). A future agent must be able to
construct correct `excel/read` ranges using only this section — no
`excel/profile` calls needed.

### <Layout Name> (used by: <list of sheet names>)

- **Metadata rows**: <which rows, what they contain>
- **Column header band**: <which rows, what each header row represents>
- **Data start row**: <row number>
- **Label columns**: <which columns, what level of hierarchy each
  represents (section header, line item, sub-item)>
- **Data columns**: <range, what they contain>
- **Time axis**: <orientation, date format, sample values, range>
- **Actuals vs. Forecast boundary**: <column or row where actuals end
  and forecast begins, and how this is marked>
- **Aggregate columns/rows**: <quarterly/annual summaries, where they
  appear>
- **Section separators**: <how sections are delimited (empty rows,
  merged cells, indentation, bold headers)>
- **Known issues**: <#REF! errors, broken references, truncated sheet
  names, etc.>

### <Another Layout Name> (used by: <sheets>)
<same structure>

## Budget / Forecast

### Source Files
- File: `<path relative to data root>`
  - Frequency: one-time | recurring (<cadence>)
  - File pattern: `<glob if recurring>`

### Data Points
- **<data point name>** (<confidence: high|medium|low>)
  - Layout: `<reference to sheet layout above>`
  - Sheet: `<sheet name>`
  - Location: `<cell range, e.g. B3:M15>`
  - Semantic markers: `<row/column headers that identify this data,
    e.g. "row header 'Umsatzerlöse', columns labeled 'Jan 2026'
    through 'Dez 2026'">`
  - Shape: time-series | transaction-list | snapshot | pivot
  - Notes: `<any quirks — "amounts in thousands", "includes both
    budget and forecast as adjacent column groups", "formulas
    reference sheet 'Assumptions'">`

### Not Found
- <item from minimum list> — flagged | confirmed absent

## AR / Revenues

### Source Files
<same structure>

### Data Points
<same structure>

### Not Found
<same structure>

## Payroll

### Source Files
<same structure>

### Data Points
<same structure>

### Not Found
<same structure>

## Travel Expenses

### Source Files
<same structure>

### Data Points
<same structure>

### Not Found
<same structure>

## Cash Flow

### Source Files
<same structure>

### Data Points
<same structure>

### Not Found
<same structure>

Tool Workflow

You have access to these tools via MCP. Refer to their schemas for input/output details.

Phase	Tool	Usage
0	`orcha-fpna-list-files`	Catalog the directory with `include_summary: true`.
1	`orcha-fpna-excel`	Profile sheets, read ranges, explore data.
2	`orcha-data-master-data`	Query GL accounts, cost centers, business partners for cross-referencing.
2–3	`orcha-fpna-save-dm`	Save progress after each domain section. Only the coordinator calls this when using subagents.

What This Protocol Does NOT Cover

How to query Orcha's structured data — you should already know how to do that, as it is part of your toolset.
How to calculate or analyze the data — this protocol maps where data lives, not what to do with it
Ingesting Excel data into Orcha's database — the data stays in its source files; the Data Map is the bridge