Data Discovery Protocol (DDP)

You are building a Data Map — a persistent, structured index of all financial data available for a tenant. The Data Map is your primary deliverable. It will be used by future agents and humans as a navigation guide — an agent reading the Data Map should be able to construct correct excel/read ranges without calling excel/profile.

You MUST save your work using orcha-fpna-save-dm:

A data map may already exist for this tenant. Call orcha-fpna-data-map to check. If a map exists, your task is to verify it is still accurate, update any stale pointers, and fill in gaps. If no map exists, follow the full protocol below to create one from scratch.

Work through the phases below in order.

Orchestration

If you can dispatch subagents (e.g., Claude Code subagents, Cowork agents, or any parallel execution capability), you should use them. The protocol is designed for a coordinator + subagents model:

Err toward dispatching — even a single-file dataset benefits from parallel sheet analysis.

If you cannot dispatch subagents, you do all phases yourself sequentially. The work is identical; only the execution model differs. Read Phase 1 as instructions for yourself rather than for subagents.

Phase 0 — Inventory

Build the context that subagents will need. This phase is lightweight — you do it yourself.

  1. Call orcha-fpna-list-files with include_summary: true to get the full file listing with sheet names and row counts in one call.
  2. Identify naming conventions (e.g., Budget_2026_v2.xlsx, Lohn_2025_01.xlsx)
  3. Map folder organization (by year? by domain? flat? nested?)
  4. Detect recurring file patterns — monthly exports, quarterly reports, versioned files
  5. Note non-Excel files that might be relevant (PDFs, CSVs)

For each file, classify its frequency:

For recurring files, record the glob pattern (e.g., payroll/Lohn_*.xlsx).

Output of this phase: A file inventory with sheet names, row counts, and frequency classifications. This becomes the input for subagent assignments.

Phase 1 — Deep Analysis

This is the bulk of the work. If you can dispatch subagents, delegate this phase to them (see Partitioning below). If working solo, do it yourself sequentially.

Partitioning (when using subagents)

Give each subagent the file inventory from Phase 0, its assigned sheets/files, the domain definitions and minimum data point lists (from the Domains Reference below), and access to orcha-fpna-excel.

Partition the work based on the data shape:

Assign related sheets together so shared structure can be described once.

The Work

For every assigned sheet:

  1. Profile the sheet using excel/profile to get headers, sample rows, merged regions, named ranges, and data bounds.

  2. Describe the sheet layout — the physical structure of the sheet. This is critical: a future agent must be able to read data from this sheet using only the Data Map, without calling excel/profile itself. Describe:

    When multiple sheets share the same layout (common in financial models), describe the layout once and reference it from other sheets.

  3. Read row labels and key ranges as needed — excel/profile gives sample data but often doesn't capture all row labels for tall sheets. Use excel/read to get the full label column.

  4. Map data points to domains using the domain definitions below. For each data point, record:

  5. Flag items from the minimum data point lists that were not found in the assigned sheets. Don't assume the data doesn't exist — another subagent's sheets may have it.

  6. Note currency, language, and date formats.

Phase 1 Output

The output of this phase (per subagent, or as a whole if working solo):

Domains Reference

These domain definitions are shared with subagents. A single sheet may contribute data points to multiple domains.

Domain 1: Orcha Documents (pre-existing)

Invoices, purchase orders, goods received notes, contracts, and financial notices are available as structured data in Orcha's database. Line items, supplier info, matching data, and document metadata are accessible via Orcha's API. No discovery needed — this is context for reference.

What's available:

Domain 2: Journal Entries / Booking History (pre-existing)

Historical bookings are available in Orcha's database, sourced from DATEV REWE sync or CSV upload. No discovery needed — context for reference.

What's available per booking entry:

Master data per tenant:

Domain 3: Budget / Forecast

Minimum data points:

Domain 4: AR / Revenues

Minimum data points:

Domain 5: Payroll

Minimum data points:

Domain 6: Travel Expenses

Minimum data points:

Domain 7: Cash Flow (Bank + Credit Card)

Minimum data points:

Phase 2 — Synthesis

Once Phase 1 is complete (all subagents have returned, or you've finished analyzing all sheets if working solo):

  1. Merge sheet layouts: Deduplicate shared layouts. If the same layout pattern was described separately (by different subagents, or during sequential analysis of related sheets), consolidate into one description and reference it from all relevant sheets.

  2. Merge data points by domain: Combine all domain mappings into the Data Map template sections.

  3. Cross-reference with Orcha master data using orcha-data-master-data: Do cost centers in Excel files match Orcha's cost center dataset? Do supplier names in bank transactions match known business partners? Do GL account numbers in the budget match the chart of accounts?

  4. Flag inconsistencies: budget departments that don't appear in cost center master data, account numbers that don't match the GL chart, etc.

  5. Compile the gap list: merge all "not found" lists from Phase 1. Remove items that were found elsewhere. The remainder is the true gap list, organized by domain.

  6. Save the draft Data Map via orcha-fpna-save-dm.

Phase 3 — Human Review

Present a summary to the human, organized by domain:

  1. What was found: list each data point with its confidence level, source file, and location

  2. Uncertain items: highlight anything with medium or low confidence. Ask the human to confirm or correct.

  3. Missing items: for each gap from the minimum data point lists, ask a specific question. Don't ask "is payroll data available?" — ask "I couldn't find employer social contribution breakdowns in any of the payroll files. Is this data tracked somewhere, or is it not available?"

  4. Ambiguities: flag any data that could belong to multiple domains or whose meaning is unclear (e.g., "Sheet 'Umsatz' has monthly figures — is this AR revenue data or a P&L revenue summary?")

  5. Cross-reference issues: report any inconsistencies found in Phase 2.

Corrections may require revisiting earlier phases — e.g., if the human points out missed files, analyze those files (dispatch a subagent or do it yourself), then redo Phase 2 for the new data before continuing. Iterate until the human confirms the map is complete, then produce the final Data Map and call orcha-fpna-save-dm to persist it.

Incremental saves: Call orcha-fpna-save-dm after completing each domain section. This preserves progress in case the session ends unexpectedly. A new session can pick up where you left off via orcha-fpna-data-map.

Data Map Template

The Data Map you produce should follow this structure. The template is a starting point, not a constraint — adapt it to fit the data you find. Add sections for data that doesn't fit the predefined domains (e.g., "Supporting / Operational Data" for FX rates, cloud costs, or strategic timelines). Remove domain sections that are entirely empty. The goal is a complete, navigable index — not rigid conformity to a template.

Two patterns worth considering:

# Data Map: <Tenant Name>

## Folder Structure

<Describe the directory layout, naming conventions, and recurring file
patterns. Include the root path and any notable organizational choices.>

## Orcha Structured Data

### Documents
<What document types are present in Orcha for this tenant.
Approximate volume and date range if known.>

### Booking History
<Source (DATEV REWE / CSV upload), date range, approximate volume.>
<Master data available: GL accounts (yes/no + count), cost centers
(yes/no + count), business partners (yes/no + count).>

## Sheet Layouts

Describe the physical structure of each sheet (or sheet family when
multiple sheets share the same layout). A future agent must be able to
construct correct `excel/read` ranges using only this section — no
`excel/profile` calls needed.

### <Layout Name> (used by: <list of sheet names>)

- **Metadata rows**: <which rows, what they contain>
- **Column header band**: <which rows, what each header row represents>
- **Data start row**: <row number>
- **Label columns**: <which columns, what level of hierarchy each
  represents (section header, line item, sub-item)>
- **Data columns**: <range, what they contain>
- **Time axis**: <orientation, date format, sample values, range>
- **Actuals vs. Forecast boundary**: <column or row where actuals end
  and forecast begins, and how this is marked>
- **Aggregate columns/rows**: <quarterly/annual summaries, where they
  appear>
- **Section separators**: <how sections are delimited (empty rows,
  merged cells, indentation, bold headers)>
- **Known issues**: <#REF! errors, broken references, truncated sheet
  names, etc.>

### <Another Layout Name> (used by: <sheets>)
<same structure>

## Budget / Forecast

### Source Files
- File: `<path relative to data root>`
  - Frequency: one-time | recurring (<cadence>)
  - File pattern: `<glob if recurring>`

### Data Points
- **<data point name>** (<confidence: high|medium|low>)
  - Layout: `<reference to sheet layout above>`
  - Sheet: `<sheet name>`
  - Location: `<cell range, e.g. B3:M15>`
  - Semantic markers: `<row/column headers that identify this data,
    e.g. "row header 'Umsatzerlöse', columns labeled 'Jan 2026'
    through 'Dez 2026'">`
  - Shape: time-series | transaction-list | snapshot | pivot
  - Notes: `<any quirks — "amounts in thousands", "includes both
    budget and forecast as adjacent column groups", "formulas
    reference sheet 'Assumptions'">`

### Not Found
- <item from minimum list> — flagged | confirmed absent

## AR / Revenues

### Source Files
<same structure>

### Data Points
<same structure>

### Not Found
<same structure>

## Payroll

### Source Files
<same structure>

### Data Points
<same structure>

### Not Found
<same structure>

## Travel Expenses

### Source Files
<same structure>

### Data Points
<same structure>

### Not Found
<same structure>

## Cash Flow

### Source Files
<same structure>

### Data Points
<same structure>

### Not Found
<same structure>

Tool Workflow

You have access to these tools via MCP. Refer to their schemas for input/output details.

Phase Tool Usage
0 orcha-fpna-list-files Catalog the directory with include_summary: true.
1 orcha-fpna-excel Profile sheets, read ranges, explore data.
2 orcha-data-master-data Query GL accounts, cost centers, business partners for cross-referencing.
2–3 orcha-fpna-save-dm Save progress after each domain section. Only the coordinator calls this when using subagents.

What This Protocol Does NOT Cover