Excel Native-Image Prototype Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Goal: Prove that FastExcel + supplemental StAX can read xlsx files (values, formulas, formats, merged regions, named ranges) inside a GraalVM native-image binary.

Architecture: Standalone Clojure CLI with two parsing layers — FastExcel reader for cell data and supplemental javax.xml.stream StAX parsing for merged regions/named ranges. Both access the same xlsx zip file. Compiles to native-image via clj-easy/graal-build-time.

Tech Stack: Clojure 1.12, FastExcel reader 0.19.0 (aalto-xml + commons-compress), GraalVM CE 21.0.2, tools.build for uberjar.

Design doc: docs/plans/2026-03-05-excel-native-image-design.md


Reference: Key APIs

FastExcel reader classes (source at ~/code/oss/fastexcel/fastexcel-reader/):

FastExcel directly instantiates com.fasterxml.aalto.stax.InputFactoryImpl (no ServiceLoader) in DefaultXMLInputFactory.java — this is good for native-image.

XLSX XML locations for supplemental parsing:


Task 1: deps.edn alias + build script

Files:

Step 1: Add aliases to deps.edn

Add two aliases after the existing :antq alias:

:native-excel       {:replace-paths ["src"]
                     :replace-deps  {org.clojure/clojure                        {:mvn/version "1.12.0"}
                                     org.dhatim/fastexcel-reader                {:mvn/version "0.19.0"}
                                     com.github.clj-easy/graal-build-time      {:mvn/version "1.0.5"}}}
:build-native-excel {:replace-paths ["scripts/ci"]
                     :replace-deps  {io.github.clojure/tools.build {:git/tag "v0.10.12" :git/sha "97c5562"}}
                     :ns-default    build-native-excel}

Step 2: Create the build script

Create scripts/ci/build_native_excel.clj:

(ns build-native-excel
  (:require [clojure.tools.build.api :as b]))


(defn uber [_]
  (let [class-dir "target/native-excel"
        uber-file "target/excel-sandbox.jar"
        basis     (b/create-basis {:aliases [:native-excel]})]
    (b/delete {:path class-dir})
    (b/copy-dir {:target-dir class-dir
                 :src-dirs   ["src"]})
    (b/compile-clj {:basis     basis
                    :class-dir class-dir
                    :src-dirs  ["src"]
                    :ns-compile '[com.getorcha.link.excel-sandbox]})
    (b/uber {:basis     basis
             :class-dir class-dir
             :uber-file uber-file
             :main      'com.getorcha.link.excel-sandbox})))

Step 3: Verify the alias resolves

Run: clj -A:native-excel -Stree 2>&1 | head -20

Expected: Dependency tree showing clojure, fastexcel-reader, aalto-xml, commons-compress, graal-build-time. No POI.

Step 4: Commit

git add deps.edn scripts/ci/build_native_excel.clj
git commit -m "feat: add native-excel build alias and build script"

Task 2: FastExcel reading functions

Files:

Step 1: Create the namespace with imports and FastExcel reading functions

Create src/com/getorcha/link/excel_sandbox.clj:

(ns com.getorcha.link.excel-sandbox
  "Prototype: read Excel files using FastExcel + supplemental StAX parsing.
  Designed to compile to GraalVM native-image."
  (:gen-class)
  (:import
   [java.io File]
   [javax.xml.stream XMLInputFactory XMLStreamConstants]
   [org.apache.commons.compress.archivers.zip ZipFile]
   [org.dhatim.fastexcel.reader
    Cell CellType ReadableWorkbook ReadingOptions Row Sheet]))


(defn- cell->value
  "Extract the display value from a Cell."
  [^Cell cell]
  (when cell
    (case (.name (.getType cell))
      "NUMBER"  (.getValue cell)
      "STRING"  (.getText cell)
      "BOOLEAN" (.asBoolean cell)
      "FORMULA" (.getText cell)
      "ERROR"   (str "ERROR:" (.getRawValue cell))
      "EMPTY"   nil
      nil)))


(defn- cell->metadata
  "Extract value + formula + format from a Cell."
  [^Cell cell]
  (when cell
    {:value   (cell->value cell)
     :formula (.getFormula cell)
     :format  (.getDataFormatString cell)}))


(defn ^:private sheets
  "Returns a vector of sheet names."
  [^ReadableWorkbook wb]
  (mapv #(.getName ^Sheet %) (iterator-seq (.iterator (.getSheets wb)))))


(defn ^:private summary
  "Returns a map from sheet names to metadata (headers, row-count, column-count)."
  [^ReadableWorkbook wb]
  (into {}
        (map (fn [^Sheet sheet]
               (let [rows    (with-open [stream (.openStream sheet)]
                               (vec (iterator-seq (.iterator stream))))
                     headers (when (seq rows)
                               (mapv (fn [^Cell c] (when c (.getText c)))
                                     (first rows)))]
                 [(.getName sheet)
                  {:headers      headers
                   :row-count    (count rows)
                   :column-count (if (seq rows)
                                   (.getCellCount ^Row (first rows))
                                   0)}])))
        (iterator-seq (.iterator (.getSheets wb)))))


(defn ^:private read-sheet-rows
  "Read all rows from a sheet. When metadata? is true, includes formula and format."
  [^ReadableWorkbook wb ^String sheet-name metadata?]
  (let [sheet (.orElse (.findSheet wb sheet-name) nil)]
    (when sheet
      (let [extract (if metadata? cell->metadata cell->value)]
        (with-open [stream (.openStream sheet)]
          (mapv (fn [^Row row] (mapv extract row))
                (iterator-seq (.iterator stream))))))))

Step 2: Verify it compiles on the JVM

Run: clj -A:native-excel -e "(require 'com.getorcha.link.excel-sandbox) (println :ok)"

Expected: :ok printed, no errors.

Step 3: Commit

git add src/com/getorcha/link/excel_sandbox.clj
git commit -m "feat: add FastExcel-based excel reading functions"

Task 3: Supplemental StAX parsing (merged regions + named ranges)

Files:

Step 1: Add merged regions parser

Add after the read-sheet-rows function:

(defn- ^XMLInputFactory xml-input-factory []
  (doto (XMLInputFactory/newFactory)
    (.setProperty XMLInputFactory/IS_NAMESPACE_AWARE false)
    (.setProperty XMLInputFactory/SUPPORT_DTD false)))


(defn ^:private parse-merged-regions
  "Parse <mergeCells> from a sheet's XML entry in the xlsx zip."
  [^ZipFile zip-file ^String sheet-entry-name]
  (let [entry (.getEntry zip-file sheet-entry-name)]
    (when entry
      (with-open [is (.getInputStream zip-file entry)]
        (let [reader (.createXMLStreamReader (xml-input-factory) is)]
          (try
            (loop [regions (transient [])]
              (if (.hasNext reader)
                (do (.next reader)
                    (if (and (= (.getEventType reader) XMLStreamConstants/START_ELEMENT)
                             (= (.getLocalName reader) "mergeCell"))
                      (recur (conj! regions {:range (.getAttributeValue reader nil "ref")}))
                      (recur regions)))
                (persistent! regions)))
            (finally
              (.close reader))))))))

Step 2: Add named ranges parser

Add after parse-merged-regions:

(defn ^:private parse-named-ranges
  "Parse <definedNames> from workbook.xml in the xlsx zip."
  [^ZipFile zip-file]
  (let [entry (or (.getEntry zip-file "xl/workbook.xml")
                  (.getEntry zip-file "xl/Workbook.xml"))]
    (when entry
      (with-open [is (.getInputStream zip-file entry)]
        (let [reader (.createXMLStreamReader (xml-input-factory) is)]
          (try
            (loop [ranges   (transient [])
                   in-names false]
              (if (.hasNext reader)
                (let [_    (.next reader)
                      type (.getEventType reader)]
                  (cond
                    (and (= type XMLStreamConstants/START_ELEMENT)
                         (= (.getLocalName reader) "definedNames"))
                    (recur ranges true)

                    (and (= type XMLStreamConstants/END_ELEMENT)
                         (= (.getLocalName reader) "definedNames"))
                    (persistent! ranges)

                    (and in-names
                         (= type XMLStreamConstants/START_ELEMENT)
                         (= (.getLocalName reader) "definedName"))
                    (let [name        (.getAttributeValue reader nil "name")
                          local-sheet (.getAttributeValue reader nil "localSheetId")
                          refers-to   (when (.hasNext reader)
                                        (.next reader)
                                        (when (= (.getEventType reader) XMLStreamConstants/CHARACTERS)
                                          (.getText reader)))]
                      (recur (conj! ranges {:name      name
                                            :refers-to refers-to
                                            :scope     (if local-sheet local-sheet :workbook)})
                             true))

                    :else (recur ranges in-names)))
                (persistent! ranges)))
            (finally
              (.close reader))))))))

Step 3: Verify it still compiles

Run: clj -A:native-excel -e "(require 'com.getorcha.link.excel-sandbox) (println :ok)"

Expected: :ok

Step 4: Commit

git add src/com/getorcha/link/excel_sandbox.clj
git commit -m "feat: add supplemental StAX parsing for merged regions and named ranges"

Task 4: Main entry point + JVM smoke test

Files:

Step 1: Add the -main function

Add at the end of the file:

(defn- run-all-features
  "Exercise all Excel reading features on the given file and print results."
  [^String path]
  (let [file (File. path)
        opts (ReadingOptions. true false)
        zip  (ZipFile. file)]
    (try
      (with-open [wb (ReadableWorkbook. file opts)]
        (println "=== SHEETS ===")
        (let [sheet-names (sheets wb)]
          (prn sheet-names)

          (println "\n=== SUMMARY ===")
          (prn (summary wb))

          (println "\n=== FIRST 5 ROWS (values) ===")
          (when-let [first-name (first sheet-names)]
            (let [rows (read-sheet-rows wb first-name false)]
              (doseq [row (take 5 rows)]
                (prn row))))

          (println "\n=== FIRST 3 ROWS (metadata) ===")
          (when-let [first-name (first sheet-names)]
            (let [rows (read-sheet-rows wb first-name true)]
              (doseq [row (take 3 rows)]
                (prn row))))

          (println "\n=== NAMED RANGES ===")
          (prn (parse-named-ranges zip))

          (println "\n=== MERGED REGIONS ===")
          (doseq [[idx sheet-name] (map-indexed vector sheet-names)]
            (let [entry-name (str "xl/worksheets/sheet" (inc idx) ".xml")
                  regions    (parse-merged-regions zip entry-name)]
              (when (seq regions)
                (println (str "  " sheet-name ":"))
                (prn regions))))))
      (finally
        (.close zip)))))


(defn -main [& args]
  (if-let [path (first args)]
    (run-all-features path)
    (do (println "Usage: excel-sandbox <path-to-xlsx>")
        (System/exit 1))))

Step 2: Run on JVM with the test file

Run: clj -A:native-excel -M -m com.getorcha.link.excel-sandbox dump/duo-maesn.xlsx

Expected: Output showing sheets, summary, rows (values and metadata), named ranges, merged regions. No exceptions. This validates all FastExcel + supplemental parsing works on JVM before attempting native-image.

Step 3: Commit

git add src/com/getorcha/link/excel_sandbox.clj
git commit -m "feat: add -main entry point with full feature exercise"

Task 5: Build uberjar

Step 1: Build the uberjar

Run: clj -T:build-native-excel uber

Expected: target/excel-sandbox.jar created. No errors. The uberjar includes AOT-compiled classes for com.getorcha.link.excel-sandbox plus all dependencies.

Step 2: Verify uberjar runs on JVM

Run: java -jar target/excel-sandbox.jar dump/duo-maesn.xlsx

Expected: Same output as Task 4 Step 2. This confirms the uberjar is correctly assembled with the right main class and all deps.

Step 3: Commit

No code changed — this is a build verification step.


Task 6: Compile native-image

Step 1: Run native-image compilation

Run:

/usr/lib/jvm/java-21-graalvm/bin/native-image \
  --features=clj_easy.graal_build_time.InitClojureClasses \
  --no-fallback \
  --report-unsupported-elements-at-runtime \
  -H:+ReportExceptionStackTraces \
  -jar target/excel-sandbox.jar \
  -o target/excel-sandbox

Flags explained:

Expected: Compilation succeeds (may take 1-3 minutes). Binary at target/excel-sandbox.

If it fails with ServiceLoader/reflection errors:

The most likely issue is javax.xml.stream.XMLInputFactory ServiceLoader lookup in the supplemental parser. FastExcel avoids this by directly instantiating com.fasterxml.aalto.stax.InputFactoryImpl, but our XMLInputFactory/newFactory call uses ServiceLoader.

Fix: Replace (XMLInputFactory/newFactory) in xml-input-factory with a direct instantiation:

(defn- ^XMLInputFactory xml-input-factory []
  (doto (com.fasterxml.aalto.stax.InputFactoryImpl.)
    (.setProperty XMLInputFactory/IS_NAMESPACE_AWARE false)
    (.setProperty XMLInputFactory/SUPPORT_DTD false)))

This requires adding the import [com.fasterxml.aalto.stax InputFactoryImpl] to the ns form. Then rebuild uberjar and retry native-image.

If it fails with other reflection errors:

Create resources/META-INF/native-image/reflect-config.json with the classes mentioned in the error. Add resources to the :replace-paths in the :native-excel alias. Rebuild uberjar and retry.

Step 2: Commit any fixes

git add -p  # stage only the files you changed
git commit -m "fix: resolve native-image compilation issues"

Task 7: Test native binary

Step 1: Run the native binary

Run: ./target/excel-sandbox dump/duo-maesn.xlsx

Expected: Same output as the JVM run (Task 4 Step 2 / Task 5 Step 2). All features work: sheets, summary, cell values, cell metadata (formulas + formats), named ranges, merged regions.

Step 2: Compare JVM vs native output

Run:

java -jar target/excel-sandbox.jar dump/duo-maesn.xlsx > /tmp/jvm-output.txt 2>&1
./target/excel-sandbox dump/duo-maesn.xlsx > /tmp/native-output.txt 2>&1
diff /tmp/jvm-output.txt /tmp/native-output.txt

Expected: No differences (or only minor formatting differences in BigDecimal rendering).

Step 3: Check binary size and startup time

Run:

ls -lh target/excel-sandbox
time ./target/excel-sandbox dump/duo-maesn.xlsx > /dev/null

Expected: Binary roughly 20-50MB. Startup+execution under 1 second (vs several seconds on JVM).

Step 4: Final commit

If any fixes were needed during testing:

git add -p
git commit -m "fix: resolve native-image runtime issues"

Task 8: Document results

Files:

Step 1: Add results section to design doc

Append a ## Results section to the design doc with:

Step 2: Commit

git add docs/plans/2026-03-05-excel-native-image-design.md
git commit -m "docs: add native-image prototype results"