Note (2026-04-24): After this document was written, legal_entity was renamed to tenant and the old tenant was renamed to organization. Read references to these terms with the pre-rename meaning.

orcha-fpna-excel Tool Design

MCP tool for programmatic Excel file analysis via a sandboxed Clojure DSL. Part of the Data Discovery Protocol (Phase 1 — File-by-File Analysis).

Problem

The DDP agent needs to inspect Excel files deeply: read cells/ranges, understand sheet structure, detect merged cells, find named ranges, check formulas and number formats. orcha-fpna-list-files with include_summary gives a surface overview (sheet names, headers, row counts), but the agent needs full programmatic access to handle the variety of financial spreadsheet layouts it will encounter.

Architecture

Execution Boundary

evaluate-excel(code: String, file-bytes: byte[]) -> String

The boundary takes a Clojure code string and raw file bytes. Returns the evaluation result as an EDN string. This signature works identically whether the implementation is in-process or out-of-process.

Serialization

Results are serialized to EDN via pr-str. EDN is Clojure's native data format — vectors, maps, keywords, nil all round-trip faithfully. The agent receives a string that directly represents the Clojure data structure returned by its code. For v2, the native binary does the same: pr-str to stdout.

Error Handling

The boundary always returns an EDN string — either the serialized result or a serialized error map. Two error categories:

  1. Evaluation errors (syntax error, undefined symbol, type error): SCI throws with location info. Returned as:

    {:error {:type :eval :message "..." :line 3 :column 12}}
    
  2. Infrastructure errors (timeout, unparseable file, OOM in v2): caught by the host. Returned as:

    {:error {:type :timeout}}
    {:error {:type :parse :message "Not a valid Excel file"}}
    

The MCP handler wraps the EDN string in the standard {:content [{:type "text" :text <edn-string>}]} response.

V1: In-Process (SCI on the MCP Server JVM)

MCP Server (JVM)
  orcha-fpna-excel handler
    1. Resolve legal entity, get FileStore
    2. Read file via FileStore -> byte[]
    3. Pass (code, file-bytes) to evaluate-excel
       a. Load workbook from bytes via POI
       b. Build SCI context (workbook bound to custom fns)
       c. Evaluate code with 30s thread deadline
       d. pr-str the result (or error map) to EDN string
       e. Close workbook
    4. Return EDN string to agent

SCI runs on a dedicated thread with a deadline. If the thread exceeds the timeout, it is interrupted and the result is an error map, also serialized to EDN.

V2: Out-of-Process (GraalVM Native Binary Per Execution)

MCP Server (JVM)
  orcha-fpna-excel handler
    1. Resolve legal entity, get FileStore
    2. Read file via FileStore -> byte[]
    3. Spawn native binary, pipe file-bytes via stdin, code via CLI arg
    4. Read EDN string from stdout, errors from stderr
    5. SIGKILL after timeout if process hasn't exited
    6. Return EDN string to agent

Native binary
    1. Read file-bytes from stdin
    2. Load workbook from bytes via POI
    3. Build SCI context (workbook bound to custom fns)
    4. Evaluate code
    5. pr-str the result (or error map) to stdout
    6. Close workbook, exit

Each invocation is a fresh process with no shared state.

POI has been confirmed to work inside GraalVM native-image with appropriate reflection configuration.

Security Model

V1 (In-Process)

V2 (Out-of-Process) — Adds:

DDoS Mitigation (V2 Only)

In v2, each tool call spawns a native process that could run for up to the timeout duration. An attacker with a valid token (or an agent manipulated via prompt injection through crafted Excel content) could attempt to exhaust resources by firing many concurrent evaluations.

Mitigations:

  1. Global concurrency cap: semaphore limiting concurrent evaluations (e.g., 4-8). Requests beyond the cap queue briefly or reject with a retry-later error.
  2. Per-tenant rate limit: e.g., 10 eval calls per identity per minute. The DDP workflow is sequential, so this is generous for legitimate use.
  3. Per-execution resource limits: bounded memory and CPU per process.
  4. Timeout: hard deadline per evaluation, non-negotiable.

Excel Library

Apache POI directly. No docjure — it doesn't expose formulas, merged cells, named ranges, or cell format strings, all of which the DDP requires.

Custom functions use POI's ss.usermodel interfaces:

Custom Function API (excel namespace)

All functions operate implicitly on the workbook loaded from the provided file bytes. The SCI code never sees the workbook object directly.

excel/summary

(excel/summary)

Returns a map from sheet names to metadata.

{"Sheet1" {:headers ["Col A" "Col B" ...] :row-count 150 :column-count 8}
 "Sheet2" {:headers [...] :row-count 42 :column-count 5}}

excel/sheets

(excel/sheets)

Returns a vector of sheet names.

["Sheet1" "Sheet2" "Assumptions"]

excel/read

(excel/read range)
(excel/read range opts)

Reads cells. Range uses standard Excel notation: "A1", "A1:D10", "Sheet1!A1:D10". Always returns a 2D vector (vector of row vectors), even for a single cell.

Options:

(excel/read "A1")                       ;=> [[42]]
(excel/read "A1:C2")                    ;=> [[1 2 3] [4 5 6]]
(excel/read "Sheet2!B3:D5")             ;=> [[...] [...] [...]]
(excel/read "A1" {:metadata? true})
;=> [[{:value 42 :formula "SUM(B1:B10)" :format "#,##0.00"}]]

excel/merged-regions

(excel/merged-regions sheet-name)

Returns merged cell ranges for a sheet. Ranges only, no values — use excel/read to get values if needed.

(excel/merged-regions "Sheet1")
;=> [{:range "B1:F1"} {:range "A3:A8"}]

excel/named-ranges

(excel/named-ranges)

Returns all named ranges in the workbook. Hidden/internal Excel names are filtered out.

(excel/named-ranges)
;=> [{:name "Revenue"    :refers-to "Sheet1!$B$2:$B$50" :scope :workbook}
;    {:name "Dept_Costs" :refers-to "Sheet2!$C$3:$C$20" :scope "Sheet2"}]

SCI Allowlist

Clojure Core Subset

Data: map, filter, reduce, mapv, filterv, into, get, get-in, assoc, dissoc, update, select-keys, keys, vals, merge, zipmap, group-by, sort-by, frequencies, first, second, last, rest, next, nth, take, drop, take-while, drop-while, concat, cons, conj, distinct, flatten, reverse, partition, partition-by, interleave, interpose, count, empty?, not-empty, contains?, some, every?, vector, hash-map, hash-set, set, list, vec, seq, range, repeat, repeatedly

Arithmetic: + - * / inc dec mod rem quot max min abs

Comparison: < > <= >= = not= compare

Logic: and, or, not, if, when, when-let, if-let, cond, condp, case

Strings: str, subs, clojure.string/split, clojure.string/join, clojure.string/replace, clojure.string/trim, clojure.string/lower-case, clojure.string/upper-case, clojure.string/includes?, clojure.string/starts-with?, clojure.string/ends-with?, re-find, re-matches, re-seq

Type predicates: nil?, string?, number?, integer?, double?, keyword?, map?, vector?, set?, seq?, coll?, boolean?, true?, false?, zero?, pos?, neg?, even?, odd?

Binding & control: let, fn, def, defn, do, -> ->> as-> cond-> cond->> some-> some->>

Math: Math/floor, Math/ceil, Math/round, Math/pow (via :classes allowlist)

Not Available

No IO, no network, no atoms/refs/agents, no loop/recur/trampoline, no Java interop beyond Math, no require/import/eval, no side effects.

range, repeat, and repeatedly produce infinite lazy sequences when called without bounds. Combined with eager functions like mapv or vec, they will run until the timeout kills execution. The timeout is the safety net — removing these functions would cripple legitimate use (e.g., (range 12) for months).

MCP Tool Registration

Registered as ::excel defmethod on tools/-tool in com.getorcha.link.mcp.tools.fpna. Scope: "fpna:read".

Input schema:

{
  "type": "object",
  "properties": {
    "legal_entity_id": {
      "type": "string",
      "description": "UUID of the legal entity. Optional if identity has access to exactly one.",
      "format": "uuid"
    },
    "file": {
      "type": "string",
      "description": "Relative file path within the legal entity's data directory."
    },
    "code": {
      "type": "string",
      "description": "Clojure code to evaluate. See tool description for available functions."
    }
  },
  "required": ["file", "code"]
}

The tool description will contain the full DSL reference: all custom functions with signatures, return shapes, and examples, plus the list of available clojure.core functions.

Documentation Updates

Tool Description

Full DSL reference embedded in the MCP tool description. The agent sees this when it discovers the tool. Covers: all 5 custom functions with signatures, return shapes, and examples; the available clojure.core subset; what's not available; timeout behavior.

DDP (data-discovery-protocol.md)

Add operational guidance to Phase 1:

Dependencies

New:

Existing (already in project):

Note: docjure remains as a dependency for list-files's extract-excel-summary. No need to remove it — it's already there and working for that use case.