2026-03-12: SmolVLA FP8/INT8 Global-Optimization Workstream

0. New-User Bootstrap (Verified 2026-03-13)

This section is the end-to-end onboarding path for users starting from a fresh Merlin checkout.

0.1 Clone third-party repos used by SmolVLA export

From repository root:

mkdir -p third_party
cd third_party

git clone -b mlir-smolvla https://github.com/ucb-bar/Understanding-PI0.git
git clone https://github.com/huggingface/lerobot.git

0.2 Set up Understanding-PI0 Python environment

cd third_party/Understanding-PI0
uv python pin 3.12
uv sync --extra export_iree
cd ../..

0.3 Build Merlin tools with NPU plugin enabled

conda run -n merlin-dev uv run tools/build.py \
  --profile npu \
  --config release \
  --no-build-python-bindings \
  --no-enable-libbacktrace

Sanity check:

build/host-merlin-release/install/bin/iree-compile --iree-list-plugins

Expected plugin list includes npu.

0.4 Branch linkage for patched third-party trees

For this workstream, behavior depends on patched trees in:

third_party/Understanding-PI0
third_party/iree-turbine
third_party/iree_bar
third_party/torch-mlir

As long as your third_party/* checkouts point at the patched branch/commit, the behavior documented here applies. Use this check:

for d in third_party/Understanding-PI0 third_party/iree-turbine third_party/iree_bar third_party/torch-mlir; do
  echo "== $d =="
  git -C "$d" rev-parse --abbrev-ref HEAD
  git -C "$d" rev-parse --short HEAD
done

When upstreaming, push each repository's branch independently and open separate PRs per repository.

0.5 Export SmolVLA MLIRs through `models/smolVLA`

The new wrapper script interfaces with third_party/Understanding-PI0 and produces all three required MLIR variants:

models/smolVLA/smolVLA.mlir (fp32 path)
models/smolVLA/smolVLA.q.int8.mlir (forced int8 path)
models/smolVLA/smolVLA.q.fp8.mlir (mixed fp8/int8 path)

If these files are not present, that means export has not been run yet in this workspace (they are generated artifacts, not static committed inputs).

Run:

conda run -n merlin-dev uv run models/smolVLA/export_smolvla.py \
  --mode all \
  --device cuda

Quick command preview without execution:

conda run -n merlin-dev uv run models/smolVLA/export_smolvla.py \
  --mode all \
  --device cuda \
  --dry-run

Single-mode exports are also supported:

conda run -n merlin-dev uv run models/smolVLA/export_smolvla.py --mode fp32 --device cuda
conda run -n merlin-dev uv run models/smolVLA/export_smolvla.py --mode int8 --device cuda
conda run -n merlin-dev uv run models/smolVLA/export_smolvla.py --mode fp8 --device cuda

0.6 Direct Understanding-PI0 workflow (optional)

If you want the original direct flow inside third_party/Understanding-PI0, use:

cd third_party/Understanding-PI0

uv run python scripts/quantizing_smolvla/inspect_fqns.py \
  --model-id lerobot/smolvla_base \
  --device cuda

uv run python scripts/quantizing_smolvla/quantize_mx.py \
  --model-id lerobot/smolvla_base \
  --device cuda \
  --out reports/smolvla_mx/smolvla_mx_quantized.pt \
  --report-json reports/smolvla_mx/quant_report.json

uv run python scripts/quantizing_smolvla/validate_one_step.py \
  --model-id lerobot/smolvla_base \
  --device cuda

uv run python scripts/export_iree.py \
  --model-id lerobot/smolvla_base \
  --device cuda \
  --print-readable \
  --out reports/smolvla_mx/smolvla_one_step.mlir

0.7 Minimum Reproducible Flow

Use this exact sequence from repo root:

# 1) Export mixed fp8/int8 SmolVLA MLIR
conda run -n merlin-dev uv run models/smolVLA/export_smolvla.py --mode fp8 --device cuda

# 2) Compile to global-opt with NPU target (short command)
conda run -n merlin-dev uv run tools/compile.py \
  models/smolVLA/smolVLA.q.fp8.mlir \
  --target npu_ucb \
  --quantized \
  --compile-to global-optimization \
  --dump-phases

Core output directory from step 2:

build/compiled_models/smolVLA/npu_ucb_RVV_smolVLA.q.fp8/phases/

Required files to inspect every run:

module.1.input.mlir
module.4.global-optimization.mlir

Then run NPU text ISA + simulator:

BIN=build/host-merlin-release/tools
DUMP=build/compiled_models/smolVLA/npu_ucb_RVV_smolVLA.q.fp8/phases

"$BIN/iree-opt" "$DUMP/module.4.global-optimization.mlir" \
  --iree-plugin=npu \
  --mlir-print-op-generic \
  > "$DUMP/module.4.global-optimization.generic.mlir"

"$BIN/npu-translate" \
  --allow-unregistered-dialect \
  --mlir-to-npu-text-isa \
  "$DUMP/module.4.global-optimization.generic.mlir" \
  > "$DUMP/smolvla.program.isa.txt"

conda run -n merlin-dev uv run python \
  compiler/src/merlin/Dialect/NPU/scripts/check_isa_contract.py \
  "$DUMP/smolvla.program.isa.txt"

conda run -n merlin-dev uv run python \
  third_party/npu_model/compiler/scripts/run_simulator_smoke.py \
  "$DUMP/smolvla.program.isa.txt"

Where generated files live:

exported MLIRs: models/smolVLA/ (generated, gitignored)
compiled artifacts/phases: build/compiled_models/... (build outputs, not committed)

1. Context and Goal

Goal for this workstream:

compile the real SmolVLA graph (not a PoC) to --compile-to=global-optimization
inspect both dumps every run:
module.1.input.mlir
module.4.global-optimization.mlir
push computation toward low precision (fp8/int8) and avoid unnecessary QDQ-style expansion in runtime regions

Primary compile flow used for the latest rerun:

conda run -n merlin-dev uv run tools/compile.py \
  third_party/Understanding-PI0/reports/smolvla_mx/smolvla_one_step.mlir \
  --target spacemit_x60 \
  --quantized \
  --compile-to global-optimization \
  --dump-compilation-phases-to tmp/smolvla_global_opt_phases_quantfix12_softmax_fastpath_rerun \
  --iree-compile-arg=--iree-input-type=torch \
  --iree-compile-arg=--iree-opt-const-eval=false \
  --iree-compile-arg=--iree-global-opt-enable-demote-contraction-inputs-to-bf16=matmul

2. Why Vanilla IREE Was Not Enough For This Case

Vanilla IREE is semantically correct, but this exported graph already contains explicit decode/dequant math for MX payloads (ui8 -> i32 -> bitcast f32 -> clamp/where -> bf16).

That means global optimization can hoist and fuse parts, but it will not "re-infer" native FP8 attention kernels from an already expanded decode + mixed-precision graph.

Original exported graph snippet (decode path)

%124 = torch.prims.convert_element_type %123, %int3_215
  : !torch.vtensor<[768,24],ui8>, !torch.int -> !torch.vtensor<[768,24],si32>
%125 = torch.operator "torch.aten.bitwise_left_shift.Tensor_Scalar"(%124, %int23)
  : (!torch.vtensor<[768,24],si32>, !torch.int) -> !torch.vtensor<[768,24],si32>
%126 = torch.aten.view.dtype %125, %int6_216
  : !torch.vtensor<[768,24],si32>, !torch.int -> !torch.vtensor<[768,24],f32>
%128 = torch.aten.clamp %127, %float1.175490e-38, %none_217
  : !torch.vtensor<[768,24],f32>, !torch.float, !torch.none -> !torch.vtensor<[768,24],f32>
%131 = torch.aten.where.self %129, %128, %130
  : !torch.vtensor<[768,24],i1>, !torch.vtensor<[768,24],f32>, !torch.vtensor<[768,24],f32>
    -> !torch.vtensor<[768,24],f32>
%132 = torch.prims.convert_element_type %131, %int15_228
  : !torch.vtensor<[768,24],f32>, !torch.int -> !torch.vtensor<[768,24],bf16>

3. Compiler Modifications

3.1 Preserve attention region compute type

File:

third_party/iree_bar/compiler/plugins/input/Torch/InputConversion/ConvertTMTensorToLinalgExt.cpp

Change:

removed forced f32 region argument for iree_linalg_ext.attention
region arg now uses frontend-derived targetType

// Preserve the frontend attention compute type instead of forcing f32.
block->addArgument(targetType, loc);

3.2 Keep attention decomposition in region score type

File:

third_party/iree_bar/compiler/src/iree/compiler/Dialect/LinalgExt/IR/AggregatedOpInterfaceImpl.cpp

Change:

removed hardcoded f32 in attention decomposition
uses scoreType from attention region argument

Type scoreType = getRegion().front().getArgument(0).getType();
Value s = computeQKAndElementwise(..., scoreType, ...);

3.3 Keep online-attention conversion in region score type

File:

third_party/iree_bar/compiler/src/iree/compiler/Dialect/LinalgExt/Transforms/TileAttention.cpp

Change:

removed hardcoded f32 temp buffers/inits
uses attention region score type end-to-end in this conversion

3.4 Torch softmax low-precision folding/decomposition

Files (both trees patched):

third_party/iree_bar/third_party/torch-mlir/lib/Dialect/Torch/Transforms/DecomposeComplexOps.cpp
third_party/torch-mlir/lib/Dialect/Torch/Transforms/DecomposeComplexOps.cpp

Changes:

added FoldSoftmaxIntConvertElementTypeOp
AtenSoftmaxInt fast-path: decompose directly in bf16/f16 when softmax is immediately cast down
for bf16/f16 result types, force softmax accumulator type to bf16/f16

Result:

global linalg.softmax moved from f32 to bf16 for the 23 softmax islands

3.5 Added quant IR analyzer

File:

tools/analyze_quant_ir.py

Capabilities:

compares module.1.input.mlir and module.4.global-optimization.mlir
typed softmax counters (linalg.softmax f32 vs linalg.softmax bf16)
matmul signature counters and enforcement thresholds

4. Stage-by-Stage Results

Metrics from tools/analyze_quant_ir.py.

Stage	input `attention region f32`	global `attention region f32`	global `linalg.softmax f32`	global `linalg.softmax bf16`	global `batch_matmul f32*bf16->f32`	global `batch_matmul bf16*bf16->f32`	global `truncf f32->bf16 outside initializer`
quantfix3	36	36	23	0	23	0	213
quantfix6_attn_online_dtype	0	0	23	0	23	0	213
quantfix8_demote_contract_bf16	0	0	23	0	0	23	259
quantfix11_softmax_fastpath	0	0	0	23	0	23	259
quantfix12_softmax_fastpath_rerun	0	0	0	23	0	23	259

Interpretation:

Attention region type forcing bug is fixed.
Softmax precision blocker is improved: softmax now bf16 in global-opt.
Remaining primary blocker is the score path matmul accumulation pattern: linalg.batch_matmul bf16*bf16 -> f32 (23 instances).

5. Expandable MLIR Evidence

Input (`quantfix12`) still has f32 score matmul islands

%2462 = linalg.batch_matmul
  ins(%collapsed_1831, %collapsed_1832
      : tensor<15x291x64xf32>, tensor<15x64x291xf32>)
  outs(%2461 : tensor<15x291x291xf32>) -> tensor<15x291x291xf32>

%2467 = linalg.generic ... ins(%2465 : tensor<1x15x291x291xf32>)
  outs(%2466 : tensor<1x15x291x291xbf16>) {
^bb0(%in: f32, %out: bf16):
  %5333 = arith.truncf %in : f32 to bf16
  linalg.yield %5333 : bf16
}

Global-opt (`quantfix12`) now has bf16 softmax but score matmul still f32 output

%1827 = linalg.batch_matmul
  ins(%collapsed_1078, %1826
      : tensor<15x291x64xbf16>, tensor<15x64x291xbf16>)
  outs(%1814 : tensor<15x291x291xf32>) -> tensor<15x291x291xf32>

%1832 = linalg.softmax dimension(2)
  ins(%1831 : tensor<15x291x291xbf16>)
  outs(%1830 : tensor<15x291x291xbf16>) -> tensor<15x291x291xbf16>

Input (`quantfix12`) attention op region is bf16 (fixed)

%129 = iree_linalg_ext.attention ...
  ins(... : tensor<12x1024x64xbf16>, tensor<12x1024x64xbf16>,
            tensor<12x1024x64xbf16>, bf16, tensor<12x1024x1024xi1>)
  outs(%128 : tensor<12x1024x64xbf16>) {
^bb0(%arg13: bf16):
  iree_linalg_ext.yield %arg13 : bf16
} -> tensor<12x1024x64xbf16>

6. How IREE Handles FP8 vs Attention in This Tree

6.1 What exists already

FP8 arithmetic coverage exists in e2e tests:
third_party/iree_bar/tests/e2e/linalg/small_float_arith.mlir
ROCm FP8 matmul ukernel path exists:
third_party/iree_bar/compiler/plugins/target/ROCM/builtins/mlir_ukernel/iree_uk_amdgpu_matmul_f8E4M3FN.mlir
Global-opt can keep many model matmuls in mixed low precision with explicit f8E4M3FN -> bf16 scaling semantics.

6.2 What is still f32-centric

Attention and softmax e2e references are mainly f32-typed:
third_party/iree_bar/tests/e2e/linalg_ext_ops/attention.mlir
third_party/iree_bar/tests/e2e/linalg_ext_ops/attention_i1_mask.mlir
third_party/iree_bar/tests/e2e/linalg/softmax.mlir
LLVMGPU attention scheduling code currently models attention matmuls with f32 c type assumptions in schedule construction (KernelConfig.cpp).

Practical consequence for our pipeline:

IREE does have FP8 matmul support.
It does not automatically turn a decomposed score/softmax/value path into fully FP8 attention math when the frontend graph already carries mixed/f32 score semantics.

7. Current Status for NPU/Gemmini Matching

Implemented and validated now:

NPU conversion pass now rewrites:
linalg.batch_matmul -> npu_kernel.matmul
linalg.softmax -> npu_schedule.softmax_fragment
iree_linalg_ext.attention -> npu_kernel.ukernel_generic (npu_uk_gemma_attention_*)
NPU verifier now accepts batched matmul-like ranks and has explicit attention-family checks.
Gemmini conversion now handles named linalg.matmul in addition to existing generic recovery.
Gemmini plugin include/wiring blocker fixed (GemminiOptions.cpp include path).

8. Real-Model Coverage Check (Global-Opt Dump)

Validation was run on:

tmp/smolvla_global_opt_phases_quantfix12_softmax_fastpath_rerun/module.4.global-optimization.mlir

8.1 NPU conversion coverage

Before convert-linalg-to-npu-kernel:

linalg.batch_matmul: 46
linalg.softmax: 23
iree_linalg_ext.attention: 36

After convert-linalg-to-npu-kernel:

linalg.batch_matmul: 0
linalg.softmax: 0
iree_linalg_ext.attention: 0
npu_kernel.matmul: 113
npu_kernel.ukernel_generic: 36
npu_schedule.softmax_fragment: 23

After full NPU pipeline (... -> convert-npu-schedule-to-isa):

npu_kernel.*: 0
npu_schedule.*: 0
npu_isa.matmul_mxu0: 185
npu_isa.vexp: 59
npu_isa.vmul: 118

NPU-lowered snippet (real global-opt file)

%0 = npu_kernel.ukernel_generic "npu_uk_gemma_attention_bf16_bf16"(...)
...
%1 = npu_schedule.ukernel_launch "npu_uk_gemma_attention_bf16_bf16"(...)
...
%2 = npu_isa.matmul_mxu0 ...

9. Remaining Precision Blocker

We have removed the worst QDQ runtime behavior (bitcast/trunc now hoisted to initializers in this run), and softmax is bf16 in global-opt.

Remaining blocker before "near-zero fp32":

23 score-path linalg.batch_matmul islands in global-opt remain bf16*bf16 -> f32.
In NPU-kernel-lowered form this appears as 24 npu_kernel.matmul bf16*bf16->f32 (one additional non-batch contraction path).

10. Next Minimal Steps (Not Overdoing)

Keep current matcher path as baseline: it is now complete for both NPU and Gemmini on real global-opt forms.
If we decide to reduce fp32 further, add one targeted rewrite for score-matmul accumulators only (instead of broad precision changes).
Keep tools/analyze_quant_ir.py as a gate on module.1.input.mlir and module.4.global-optimization.mlir.

11. 2026-03-13 Continuation

11.1 Quantfix3 input/global-opt analysis (the two files we now always check)

Files:

tmp/smolvla_global_opt_phases_quantfix3/module.1.input.mlir
tmp/smolvla_global_opt_phases_quantfix3/module.4.global-optimization.mlir

Current counts in these dumps:

input linalg.batch_matmul ... -> tensor<...xf32>: 23
global-opt linalg.batch_matmul ... -> tensor<...xf32>: 23
global-opt linalg.softmax ... -> tensor<...xf32>: 23

Global-opt f32 island example (`quantfix3`)

%1820 = linalg.fill ins(%cst_16 : f32) outs(%1819 : tensor<15x291x291xf32>) -> tensor<15x291x291xf32>
%1821 = linalg.batch_matmul ins(%collapsed_1078, %collapsed_1079
  : tensor<15x291x64xf32>, tensor<15x64x291xbf16>)
  outs(%1820 : tensor<15x291x291xf32>) -> tensor<15x291x291xf32>

11.2 NPU island handling (implemented)

File updated:

compiler/src/merlin/Dialect/NPU/Transforms/ConvertLinalgToNPUKernel.cpp

What changed:

generalized demotion rewrite from only bf16 x bf16 -> f32 to float score paths including f32 x bf16 -> f32
inserted explicit tensor elementwise casts:
cast float operands to bf16 when needed (truncf)
run npu_kernel.matmul in bf16
widen back to f32 with extf only where IR still expects f32

Result on real quantfix3 global-opt after convert-linalg-to-npu-kernel:

npu_kernel.matmul ... -> tensor<...xf32>: 0
npu_kernel.matmul ... -> tensor<...xbf16>: 47
linalg.batch_matmul ... -> tensor<...xf32>: 0

NPU-demoted snippet (`quantfix3` global-opt lowered form)

%1688 = npu_kernel.matmul %expanded, %__hoisted_tensor_960x32xbf16
  : tensor<1x32xbf16>, tensor<960x32xbf16> -> tensor<1x960xbf16>
%1689 = linalg.generic ... ins(%1688 : tensor<1x960xbf16>)
  outs(%1687 : tensor<1x960xf32>) {
^bb0(%in: bf16, %out: f32):
  %3561 = arith.extf %in : bf16 to f32
  linalg.yield %3561 : f32
}

Full NPU lowering still completes on the same file:

npu_kernel.*: 0 after final pass
npu_schedule.*: 0 after final pass
npu_isa.matmul_mxu0: 185
npu_isa.vexp: 59

11.3 NPU model execution validation (implemented)

Scripts validated:

compiler/src/merlin/Dialect/NPU/scripts/run_npu_model_smoke.sh
compiler/src/merlin/Dialect/NPU/scripts/run_frontend_to_npu_model_e2e.sh

Validated symbols/flows:

npu_uk_matmul_bf16_bf16_bf16
npu_uk_gemma_attention_bf16_bf16
frontend hook e2e from post-global-opt-hook.mlir

Support updates made for this:

compiler/src/merlin/Dialect/NPU/scripts/emit_symbol_ukernel_isa.sh
parses symbol-suffixed types for bf16/fp8/f32/i8/i32 families
compiler/src/merlin/Dialect/NPU/scripts/check_isa_contract.py
handles current third_party/npu_model/npu_model/configs/isa_definition.py path

11.4 Gemmini FP8 route (implemented)

Files updated:

compiler/plugins/target/Gemmini/GemminiOptions.h
compiler/plugins/target/Gemmini/GemminiOptions.cpp
compiler/plugins/target/Gemmini/PluginRegistration.cpp
compiler/src/merlin/Dialect/Gemmini/Transforms/Passes.h
compiler/src/merlin/Dialect/Gemmini/Transforms/ConvertToGemmini.cpp
compiler/src/merlin/Dialect/Gemmini/IR/GemminiOps.cpp
compiler/src/merlin/Dialect/Gemmini/Transforms/LowerGemminiToIREE.cpp

New option:

--iree-gemmini-enable-fp8-matmul

Behavior:

recovers FP8 matmul forms (f8E4M3FN/f8E4M3FN -> bf16|f32) into gemmini.matmul
keeps default behavior conservative (FP8 recovery off unless flag is enabled)
lower-to-ISA and lower-back-to-IREE paths now handle FP8 matmul variants

Real-model (quantfix3 global-opt) util.func coverage with current default path:

gemmini.matmul_tile: 66
gemmini.matmul: 0 (all staged)
remaining linalg.matmul: 1

Tests added/updated:

compiler/src/merlin/Dialect/Gemmini/Transforms/tests/post-global-opt-hook.mlir
includes FP8 function and checks with --iree-gemmini-enable-fp8-matmul
compiler/src/merlin/Dialect/Gemmini/Transforms/tests/matmul-lower-to-isa.mlir
FP8 gemmini.matmul -> gemmini.matmul_tile
compiler/src/merlin/Dialect/Gemmini/Transforms/tests/lower-gemmini-to-iree.mlir
FP8 tile lowering includes arith.extf path

12. Attention Flags Experiment (`2026-03-13`)

Tested IREE flags suggested for attention:

--iree-global-opt-enable-attention-v-transpose
--iree-linalgext-attention-softmax-max=65504

Compile commands were run both with and without those flags using:

tools/compile.py ... --compile-to global-optimization
phase dumps:
tmp/smolvla_global_opt_phases_attnflags0 (without)
tmp/smolvla_global_opt_phases_attnflags1 (with)

Result:

module.1.input.mlir hashes are identical
module.4.global-optimization.mlir hashes are identical
key metrics are unchanged (23 f32 score matmul islands in global-opt)

Conclusion for current tree + model path:

these two flags are accepted by the compiler but are currently a no-op for this specific SmolVLA input/lowering path.

13. Reproducible Export -> Global-Opt -> NPU ISA -> `npu_model`

This is the concrete path from exported SmolVLA MLIR to NPU simulator input.

13.1 Compile exported SmolVLA to global-opt with phase dumps

# Baseline target bundle
conda run -n merlin-dev uv run tools/compile.py \
  models/smolVLA/smolVLA.q.fp8.mlir \
  --target spacemit_x60 \
  --quantized \
  --compile-to global-optimization \
  --dump-phases

# NPU plugin path (from models/npu_ucb.yaml)
conda run -n merlin-dev uv run tools/compile.py \
  models/smolVLA/smolVLA.q.fp8.mlir \
  --target npu_ucb \
  --quantized \
  --compile-to global-optimization \
  --dump-phases

# Gemmini plugin path (from models/gemmini_mx.yaml)
conda run -n merlin-dev uv run tools/compile.py \
  models/smolVLA/smolVLA.q.fp8.mlir \
  --target gemmini_mx \
  --quantized \
  --compile-to global-optimization \
  --dump-phases

SmolVLA-specific compile knobs (--iree-input-type=torch, --iree-opt-const-eval=false, demote-contraction flag) are now encoded in models/*.yaml under models: smolVLA, so users do not pass those manually.

13.2 Always inspect the two key phase outputs

# Example for the npu_ucb fp8 build output directory:
ls -lh \
  build/compiled_models/smolVLA/npu_ucb_RVV_smolVLA.q.fp8/phases/module.1.input.mlir \
  build/compiled_models/smolVLA/npu_ucb_RVV_smolVLA.q.fp8/phases/module.4.global-optimization.mlir

conda run -n merlin-dev uv run python tools/analyze_quant_ir.py \
  build/compiled_models/smolVLA/npu_ucb_RVV_smolVLA.q.fp8/phases

The phase dumps are intentionally under build/compiled_models/... so they are treated as generated build artifacts, not source-controlled inputs.

These are the two files we now treat as the ground truth for matching and lowering:

module.1.input.mlir
module.4.global-optimization.mlir

13.3 Lower global-opt IR to NPU dialect/ISA

BIN=build/host-merlin-release/tools
DUMP=build/compiled_models/smolVLA/npu_ucb_RVV_smolVLA.q.fp8/phases

"${BIN}/iree-opt" "${DUMP}/module.4.global-optimization.mlir" \
  --iree-plugin=npu \
  --mlir-print-op-generic \
  --pass-pipeline='builtin.module(convert-linalg-to-npu-kernel,npu-verify-ukernel-symbols,convert-npu-kernel-to-schedule,npu-verify-ukernel-symbols,convert-npu-schedule-to-isa)' \
  > "${DUMP}/module.4.global-optimization.npu-isa.mlir"

Note: --mlir-print-op-generic is required here so non-NPU attrs/ops from global-opt remain parseable by npu-translate.

13.4 Translate NPU ISA MLIR to text ISA

"${BIN}/npu-translate" \
  --allow-unregistered-dialect \
  --mlir-to-npu-text-isa \
  "${DUMP}/module.4.global-optimization.npu-isa.mlir" \
  > "${DUMP}/smolvla.program.isa.txt"

13.5 Validate ISA contract and run `npu_model` smoke

conda run -n merlin-dev uv run python \
  compiler/src/merlin/Dialect/NPU/scripts/check_isa_contract.py \
  "${DUMP}/smolvla.program.isa.txt"

conda run -n merlin-dev uv run python \
  third_party/npu_model/compiler/scripts/run_simulator_smoke.py \
  "${DUMP}/smolvla.program.isa.txt"

The above contract-check + smoke path was validated using:

real compile dump: tmp/smolvla_global_opt_phases_verify_npu_ucb_real
ISA MLIR: tmp/smolvla_global_opt_phases_quantfix12_softmax_fastpath_rerun/module.4.global-optimization.npu-isa.blog-generic.mlir
ISA text: tmp/smolvla_global_opt_phases_quantfix12_softmax_fastpath_rerun/smolvla.blog-generic.isa.txt

13.7 Quick matcher checks (exact commands)

rg -o "gemmini\\.[a-zA-Z_]+" -N \
  tmp/smolvla_global_opt_phases_verify_gemmini_mx_real2/module.4.global-optimization.mlir \
  | sort | uniq -c

rg -o "npu_isa\\.[a-zA-Z_]+" -N \
  tmp/smolvla_global_opt_phases_verify_npu_ucb_real/module.4.global-optimization.mlir \
  | sort | uniq -c

Dev-blog written by: Agustin Coppari Hollmann

Project Members: Yufeng Chi and Agustin Coppari Hollmann

2026-03-12: SmolVLA FP8/INT8 Global-Optimization Workstream

0. New-User Bootstrap (Verified 2026-03-13)

0.1 Clone third-party repos used by SmolVLA export

0.2 Set up Understanding-PI0 Python environment

0.3 Build Merlin tools with NPU plugin enabled

0.4 Branch linkage for patched third-party trees

0.5 Export SmolVLA MLIRs through models/smolVLA

0.6 Direct Understanding-PI0 workflow (optional)

0.7 Minimum Reproducible Flow

1. Context and Goal

2. Why Vanilla IREE Was Not Enough For This Case

3. Compiler Modifications

3.1 Preserve attention region compute type

3.2 Keep attention decomposition in region score type

3.3 Keep online-attention conversion in region score type

3.4 Torch softmax low-precision folding/decomposition

3.5 Added quant IR analyzer

4. Stage-by-Stage Results

5. Expandable MLIR Evidence

6. How IREE Handles FP8 vs Attention in This Tree

6.1 What exists already

6.2 What is still f32-centric

7. Current Status for NPU/Gemmini Matching

8. Real-Model Coverage Check (Global-Opt Dump)

8.1 NPU conversion coverage

9. Remaining Precision Blocker

10. Next Minimal Steps (Not Overdoing)

11. 2026-03-13 Continuation

11.1 Quantfix3 input/global-opt analysis (the two files we now always check)

11.2 NPU island handling (implemented)

11.3 NPU model execution validation (implemented)

11.4 Gemmini FP8 route (implemented)

12. Attention Flags Experiment (2026-03-13)

13. Reproducible Export -> Global-Opt -> NPU ISA -> npu_model

13.1 Compile exported SmolVLA to global-opt with phase dumps

13.2 Always inspect the two key phase outputs

13.3 Lower global-opt IR to NPU dialect/ISA

13.4 Translate NPU ISA MLIR to text ISA

13.5 Validate ISA contract and run npu_model smoke

13.7 Quick matcher checks (exact commands)

0.5 Export SmolVLA MLIRs through `models/smolVLA`

12. Attention Flags Experiment (`2026-03-13`)

13. Reproducible Export -> Global-Opt -> NPU ISA -> `npu_model`

13.5 Validate ISA contract and run `npu_model` smoke