2026-03-12: SmolVLA FP8/INT8 Global-Optimization Workstream
Repro pin: merlin@
cf8c3ac1· iree_bar@68acd99c74Status: Active
At a glance
This entry covers three distinct workstreams in the order the underlying work landed. Jump to whichever you need:
- §0–§13 — SmolVLA compilation & precision (export, global-opt, FP8/INT8, NPU/Gemmini matching). Original 2026-03-12 → 2026-03-13 work.
- §14–§16 — AutoComp kernel library, tile loops, cross-invocation orchestration. 2026-04-11 → 2026-04-13 follow-up.
- §17 — Golden-data testing ladder. 2026-04-13 (later) infrastructure work.
Related entries:
- 2026-04-06 OPU VOPACC coverage — narrow-M matmul findings cross-cut this work.
0. SmolVLA-specific Bootstrap (Verified 2026-03-13)
This section adds the SmolVLA-specific repos and exports on top of the generic Merlin bootstrap. For the generic conda env / submodule sync / host build steps, follow getting_started.md first.
0.1 Clone third-party repos used by SmolVLA export
From repository root:
git clone -b mlir-smolvla \
https://github.com/ucb-bar/Understanding-PI0.git \
third_party/Understanding-PI0
git clone https://github.com/huggingface/lerobot.git \
third_party/lerobot
Both clones are required: Understanding-PI0/pyproject.toml declares
lerobot as an editable path dependency (../lerobot), so lerobot
must be present before running uv sync in the next step.
0.2 Set up Understanding-PI0 Python environment
From repository root:
Note —
evdevis excluded by default. Thelerobot → pynput → evdevchain is only needed for physical input-device handling, not MLIR export.evdevfails to build when the conda cross-compiler's sysroot headers are older than the system kernel headers (the build reads/usr/include/linux/input-event-codes.hto generate code, then compiles against the conda sysroot which lacks newer constants likeKEY_ACCESSIBILITY). Thepyproject.tomloverrides evdev's marker so it is never resolved.
0.3 Build Merlin tools with NPU plugin enabled
Sanity check:
Expected plugin list includes npu.
0.4 Branch linkage for patched third-party trees
For this workstream, behavior depends on patched trees in:
third_party/Understanding-PI0third_party/iree-turbinethird_party/iree_barthird_party/torch-mlir
Important:
- the Torch-input legalization used by the compile flow also depends on
third_party/iree_bar/third_party/torch-mliras checked out bythird_party/iree_bar - after syncing any of the compiler-side trees above, rebuild
build/host-merlin-release/before retrying the SmolVLA compile flow
As long as your third_party/* checkouts point at the patched branch/commit,
the behavior documented here applies. Use this check:
for d in third_party/Understanding-PI0 third_party/iree-turbine third_party/iree_bar third_party/torch-mlir; do
echo "== $d =="
git -C "$d" rev-parse --abbrev-ref HEAD
git -C "$d" rev-parse --short HEAD
done
When upstreaming, push each repository's branch independently and open separate PRs per repository.
0.5 Export SmolVLA MLIRs through models/smolVLA
The new wrapper script interfaces with third_party/Understanding-PI0 and
produces all three required MLIR variants:
models/smolVLA/smolVLA.mlir(fp32 path)models/smolVLA/smolVLA.q.int8.mlir(forced int8 path)models/smolVLA/smolVLA.q.fp8.mlir(mixed fp8/int8 path)
If these files are not present, that means export has not been run yet in this workspace (they are generated artifacts, not static committed inputs).
Run:
Quick command preview without execution:
conda run -n merlin-dev uv run models/smolVLA/export_smolvla.py \
--mode all \
--device cuda \
--dry-run
Single-mode exports are also supported:
conda run -n merlin-dev uv run models/smolVLA/export_smolvla.py --mode fp32 --device cuda
conda run -n merlin-dev uv run models/smolVLA/export_smolvla.py --mode int8 --device cuda
conda run -n merlin-dev uv run models/smolVLA/export_smolvla.py --mode fp8 --device cuda
0.6 Direct Understanding-PI0 workflow (optional)
If you want the original direct flow inside third_party/Understanding-PI0,
use:
cd third_party/Understanding-PI0
uv run python scripts/quantizing_smolvla/inspect_fqns.py \
--model-id lerobot/smolvla_base \
--device cuda
uv run python scripts/quantizing_smolvla/quantize_mx.py \
--model-id lerobot/smolvla_base \
--device cuda \
--out reports/smolvla_mx/smolvla_mx_quantized.pt \
--report-json reports/smolvla_mx/quant_report.json
uv run python scripts/quantizing_smolvla/validate_one_step.py \
--model-id lerobot/smolvla_base \
--device cuda
uv run python scripts/export_iree.py \
--model-id lerobot/smolvla_base \
--device cuda \
--print-readable \
--out reports/smolvla_mx/smolvla_one_step.mlir
0.7 Minimum Reproducible Flow
Use this exact sequence from repo root:
# 1) Export mixed fp8/int8 SmolVLA MLIR
conda run -n merlin-dev uv run models/smolVLA/export_smolvla.py --mode fp8 --device cuda
# 2) Compile to global-opt with NPU target (short command)
conda run -n merlin-dev uv run tools/merlin.py compile \
models/smolVLA/smolVLA.q.fp8.mlir \
--target npu_ucb \
--quantized \
--compile-to global-optimization \
--dump-phases
If step 2 fails with:
error: failed to legalize operation 'torch.operator' that was explicitly marked illegal
... torch.prims.device_put ...
that is a compiler-stack mismatch, not an export-format requirement.
The CUDA export is valid for this flow. The failing compiler build is missing the
custom Torch-input lowering that forwards torch.prims.device_put during
Torch-to-IREE legalization.
Use this quick check before retrying:
rg -n "torch.prims.device_put|createConvertCustomQuantOpPass" \
third_party/iree_bar/third_party/torch-mlir/lib/Dialect/TorchConversion/Transforms/ConvertCustomQuantOp.cpp \
third_party/iree_bar/compiler/plugins/input/Torch/InputConversion/Passes.cpp
Then rebuild the compiler and rerun the compile step:
conda run -n merlin-dev uv run tools/merlin.py setup submodules \
--submodules-profile core \
--submodule-sync
conda run -n merlin-dev uv run tools/merlin.py build \
--profile npu \
--config release \
--no-build-python-bindings \
--no-enable-libbacktrace
Exporting with --device cpu may avoid the device_put op, but that should be
treated as a temporary workaround for an out-of-date compiler tree, not the
expected setup for this workstream.
Core output directory from step 2:
Required files to inspect every run:
module.1.input.mlirmodule.4.global-optimization.mlir
Then run NPU text ISA + simulator:
BIN=build/host-merlin-release/tools
DUMP=build/compiled_models/smolVLA/npu_ucb_RVV_smolVLA.q.fp8/phases
"$BIN/iree-opt" "$DUMP/module.4.global-optimization.mlir" \
--iree-plugin=npu \
--mlir-print-op-generic \
> "$DUMP/module.4.global-optimization.generic.mlir"
"$BIN/npu-translate" \
--allow-unregistered-dialect \
--mlir-to-npu-text-isa \
"$DUMP/module.4.global-optimization.generic.mlir" \
> "$DUMP/smolvla.program.isa.txt"
conda run -n merlin-dev uv run python \
compiler/src/merlin/Dialect/NPU/scripts/check_isa_contract.py \
"$DUMP/smolvla.program.isa.txt"
conda run -n merlin-dev uv run python \
third_party/npu_model/compiler/scripts/run_simulator_smoke.py \
"$DUMP/smolvla.program.isa.txt"
Where generated files live:
- exported MLIRs:
models/smolVLA/(generated, gitignored) - compiled artifacts/phases:
build/compiled_models/...(build outputs, not committed)
1. Context and Goal
Goal for this workstream:
- compile the real SmolVLA graph (not a PoC) to
--compile-to=global-optimization - inspect both dumps every run:
module.1.input.mlirmodule.4.global-optimization.mlir- push computation toward low precision (
fp8/int8) and avoid unnecessary QDQ-style expansion in runtime regions
Primary compile flow used for the latest rerun:
conda run -n merlin-dev uv run tools/compile.py \
third_party/Understanding-PI0/reports/smolvla_mx/smolvla_one_step.mlir \
--target spacemit_x60 \
--quantized \
--compile-to global-optimization \
--dump-compilation-phases-to tmp/smolvla_global_opt_phases_quantfix12_softmax_fastpath_rerun \
--iree-compile-arg=--iree-input-type=torch \
--iree-compile-arg=--iree-opt-const-eval=false \
--iree-compile-arg=--iree-global-opt-enable-demote-contraction-inputs-to-bf16=matmul
2. Why Vanilla IREE Was Not Enough For This Case
Vanilla IREE is semantically correct, but this exported graph already contains explicit decode/dequant math for MX payloads (ui8 -> i32 -> bitcast f32 -> clamp/where -> bf16).
That means global optimization can hoist and fuse parts, but it will not "re-infer" native FP8 attention kernels from an already expanded decode + mixed-precision graph.
Original exported graph snippet (decode path)
%124 = torch.prims.convert_element_type %123, %int3_215
: !torch.vtensor<[768,24],ui8>, !torch.int -> !torch.vtensor<[768,24],si32>
%125 = torch.operator "torch.aten.bitwise_left_shift.Tensor_Scalar"(%124, %int23)
: (!torch.vtensor<[768,24],si32>, !torch.int) -> !torch.vtensor<[768,24],si32>
%126 = torch.aten.view.dtype %125, %int6_216
: !torch.vtensor<[768,24],si32>, !torch.int -> !torch.vtensor<[768,24],f32>
%128 = torch.aten.clamp %127, %float1.175490e-38, %none_217
: !torch.vtensor<[768,24],f32>, !torch.float, !torch.none -> !torch.vtensor<[768,24],f32>
%131 = torch.aten.where.self %129, %128, %130
: !torch.vtensor<[768,24],i1>, !torch.vtensor<[768,24],f32>, !torch.vtensor<[768,24],f32>
-> !torch.vtensor<[768,24],f32>
%132 = torch.prims.convert_element_type %131, %int15_228
: !torch.vtensor<[768,24],f32>, !torch.int -> !torch.vtensor<[768,24],bf16>
3. Compiler Modifications
3.1 Preserve attention region compute type
File:
third_party/iree_bar/compiler/plugins/input/Torch/InputConversion/ConvertTMTensorToLinalgExt.cpp
Change:
- removed forced
f32region argument foriree_linalg_ext.attention - region arg now uses frontend-derived
targetType
// Preserve the frontend attention compute type instead of forcing f32.
block->addArgument(targetType, loc);
3.2 Keep attention decomposition in region score type
File:
third_party/iree_bar/compiler/src/iree/compiler/Dialect/LinalgExt/IR/AggregatedOpInterfaceImpl.cpp
Change:
- removed hardcoded
f32in attention decomposition - uses
scoreTypefrom attention region argument
Type scoreType = getRegion().front().getArgument(0).getType();
Value s = computeQKAndElementwise(..., scoreType, ...);
3.3 Keep online-attention conversion in region score type
File:
third_party/iree_bar/compiler/src/iree/compiler/Dialect/LinalgExt/Transforms/TileAttention.cpp
Change:
- removed hardcoded
f32temp buffers/inits - uses attention region score type end-to-end in this conversion
3.4 Torch softmax low-precision folding/decomposition
Files (both trees patched):
third_party/iree_bar/third_party/torch-mlir/lib/Dialect/Torch/Transforms/DecomposeComplexOps.cppthird_party/torch-mlir/lib/Dialect/Torch/Transforms/DecomposeComplexOps.cpp
Changes:
- added
FoldSoftmaxIntConvertElementTypeOp AtenSoftmaxIntfast-path: decompose directly in bf16/f16 when softmax is immediately cast down- for bf16/f16 result types, force softmax accumulator type to bf16/f16
Result:
global linalg.softmaxmoved from f32 to bf16 for the 23 softmax islands
3.5 Added quant IR analyzer
File:
tools/analyze_quant_ir.py
Capabilities:
- compares
module.1.input.mlirandmodule.4.global-optimization.mlir - typed softmax counters (
linalg.softmax f32vslinalg.softmax bf16) - matmul signature counters and enforcement thresholds
4. Stage-by-Stage Results
Metrics from tools/analyze_quant_ir.py.
| Stage | input attention region f32 |
global attention region f32 |
global linalg.softmax f32 |
global linalg.softmax bf16 |
global batch_matmul f32*bf16->f32 |
global batch_matmul bf16*bf16->f32 |
global truncf f32->bf16 outside initializer |
|---|---|---|---|---|---|---|---|
| quantfix3 | 36 | 36 | 23 | 0 | 23 | 0 | 213 |
| quantfix6_attn_online_dtype | 0 | 0 | 23 | 0 | 23 | 0 | 213 |
| quantfix8_demote_contract_bf16 | 0 | 0 | 23 | 0 | 0 | 23 | 259 |
| quantfix11_softmax_fastpath | 0 | 0 | 0 | 23 | 0 | 23 | 259 |
| quantfix12_softmax_fastpath_rerun | 0 | 0 | 0 | 23 | 0 | 23 | 259 |
Interpretation:
- Attention region type forcing bug is fixed.
- Softmax precision blocker is improved: softmax now bf16 in global-opt.
- Remaining primary blocker is the score path matmul accumulation pattern:
linalg.batch_matmul bf16*bf16 -> f32(23 instances).
5. Expandable MLIR Evidence
Input (`quantfix12`) still has f32 score matmul islands
%2462 = linalg.batch_matmul
ins(%collapsed_1831, %collapsed_1832
: tensor<15x291x64xf32>, tensor<15x64x291xf32>)
outs(%2461 : tensor<15x291x291xf32>) -> tensor<15x291x291xf32>
%2467 = linalg.generic ... ins(%2465 : tensor<1x15x291x291xf32>)
outs(%2466 : tensor<1x15x291x291xbf16>) {
^bb0(%in: f32, %out: bf16):
%5333 = arith.truncf %in : f32 to bf16
linalg.yield %5333 : bf16
}
Global-opt (`quantfix12`) now has bf16 softmax but score matmul still f32 output
%1827 = linalg.batch_matmul
ins(%collapsed_1078, %1826
: tensor<15x291x64xbf16>, tensor<15x64x291xbf16>)
outs(%1814 : tensor<15x291x291xf32>) -> tensor<15x291x291xf32>
%1832 = linalg.softmax dimension(2)
ins(%1831 : tensor<15x291x291xbf16>)
outs(%1830 : tensor<15x291x291xbf16>) -> tensor<15x291x291xbf16>
Input (`quantfix12`) attention op region is bf16 (fixed)
6. How IREE Handles FP8 vs Attention in This Tree
6.1 What exists already
- FP8 arithmetic coverage exists in e2e tests:
third_party/iree_bar/tests/e2e/linalg/small_float_arith.mlir- ROCm FP8 matmul ukernel path exists:
third_party/iree_bar/compiler/plugins/target/ROCM/builtins/mlir_ukernel/iree_uk_amdgpu_matmul_f8E4M3FN.mlir- Global-opt can keep many model matmuls in mixed low precision with explicit
f8E4M3FN -> bf16scaling semantics.
6.2 What is still f32-centric
- Attention and softmax e2e references are mainly f32-typed:
third_party/iree_bar/tests/e2e/linalg_ext_ops/attention.mlirthird_party/iree_bar/tests/e2e/linalg_ext_ops/attention_i1_mask.mlirthird_party/iree_bar/tests/e2e/linalg/softmax.mlir- LLVMGPU attention scheduling code currently models attention matmuls with f32
ctype assumptions in schedule construction (KernelConfig.cpp).
Practical consequence for our pipeline:
- IREE does have FP8 matmul support.
- It does not automatically turn a decomposed score/softmax/value path into fully FP8 attention math when the frontend graph already carries mixed/f32 score semantics.
7. Current Status for NPU/Gemmini Matching
Implemented and validated now:
- NPU conversion pass now rewrites:
linalg.batch_matmul->npu_kernel.matmullinalg.softmax->npu_schedule.softmax_fragmentiree_linalg_ext.attention->npu_kernel.ukernel_generic(npu_uk_gemma_attention_*)- NPU verifier now accepts batched matmul-like ranks and has explicit attention-family checks.
- Gemmini conversion now handles named
linalg.matmulin addition to existing generic recovery. - Gemmini plugin include/wiring blocker fixed (
GemminiOptions.cppinclude path).
8. Real-Model Coverage Check (Global-Opt Dump)
Validation was run on:
tmp/smolvla_global_opt_phases_quantfix12_softmax_fastpath_rerun/module.4.global-optimization.mlir
8.1 NPU conversion coverage
Before convert-linalg-to-npu-kernel:
linalg.batch_matmul: 46linalg.softmax: 23iree_linalg_ext.attention: 36
After convert-linalg-to-npu-kernel:
linalg.batch_matmul: 0linalg.softmax: 0iree_linalg_ext.attention: 0npu_kernel.matmul: 113npu_kernel.ukernel_generic: 36npu_schedule.softmax_fragment: 23
After full NPU pipeline (... -> convert-npu-schedule-to-isa):
npu_kernel.*: 0npu_schedule.*: 0npu_isa.matmul_mxu0: 185npu_isa.vexp: 59npu_isa.vmul: 118
NPU-lowered snippet (real global-opt file)
9. Remaining Precision Blocker
We have removed the worst QDQ runtime behavior (bitcast/trunc now hoisted to initializers in this run), and softmax is bf16 in global-opt.
Remaining blocker before "near-zero fp32":
- 23 score-path
linalg.batch_matmulislands in global-opt remainbf16*bf16 -> f32. - In NPU-kernel-lowered form this appears as 24
npu_kernel.matmul bf16*bf16->f32(one additional non-batch contraction path).
10. Next Minimal Steps (Not Overdoing)
- Keep current matcher path as baseline: it is now complete for both NPU and Gemmini on real global-opt forms.
- If we decide to reduce fp32 further, add one targeted rewrite for score-matmul accumulators only (instead of broad precision changes).
- Keep
tools/analyze_quant_ir.pyas a gate onmodule.1.input.mlirandmodule.4.global-optimization.mlir.
11. 2026-03-13 Continuation
11.1 Quantfix3 input/global-opt analysis (the two files we now always check)
Files:
tmp/smolvla_global_opt_phases_quantfix3/module.1.input.mlirtmp/smolvla_global_opt_phases_quantfix3/module.4.global-optimization.mlir
Current counts in these dumps:
- input
linalg.batch_matmul ... -> tensor<...xf32>:23 - global-opt
linalg.batch_matmul ... -> tensor<...xf32>:23 - global-opt
linalg.softmax ... -> tensor<...xf32>:23
Global-opt f32 island example (`quantfix3`)
11.2 NPU island handling (implemented)
File updated:
compiler/src/merlin/Dialect/NPU/Transforms/ConvertLinalgToNPUKernel.cpp
What changed:
- generalized demotion rewrite from only
bf16 x bf16 -> f32to float score paths includingf32 x bf16 -> f32 - inserted explicit tensor elementwise casts:
- cast float operands to
bf16when needed (truncf) - run
npu_kernel.matmulinbf16 - widen back to
f32withextfonly where IR still expectsf32
Result on real quantfix3 global-opt after convert-linalg-to-npu-kernel:
npu_kernel.matmul ... -> tensor<...xf32>:0npu_kernel.matmul ... -> tensor<...xbf16>:47linalg.batch_matmul ... -> tensor<...xf32>:0
NPU-demoted snippet (`quantfix3` global-opt lowered form)
%1688 = npu_kernel.matmul %expanded, %__hoisted_tensor_960x32xbf16
: tensor<1x32xbf16>, tensor<960x32xbf16> -> tensor<1x960xbf16>
%1689 = linalg.generic ... ins(%1688 : tensor<1x960xbf16>)
outs(%1687 : tensor<1x960xf32>) {
^bb0(%in: bf16, %out: f32):
%3561 = arith.extf %in : bf16 to f32
linalg.yield %3561 : f32
}
Full NPU lowering still completes on the same file:
npu_kernel.*:0after final passnpu_schedule.*:0after final passnpu_isa.matmul_mxu0:185npu_isa.vexp:59
11.3 NPU model execution validation (implemented)
Scripts validated:
compiler/src/merlin/Dialect/NPU/scripts/run_npu_model_smoke.shcompiler/src/merlin/Dialect/NPU/scripts/run_frontend_to_npu_model_e2e.sh
Validated symbols/flows:
npu_uk_matmul_bf16_bf16_bf16npu_uk_gemma_attention_bf16_bf16- frontend hook e2e from
post-global-opt-hook.mlir
Support updates made for this:
compiler/src/merlin/Dialect/NPU/scripts/emit_symbol_ukernel_isa.sh- parses symbol-suffixed types for bf16/fp8/f32/i8/i32 families
compiler/src/merlin/Dialect/NPU/scripts/check_isa_contract.py- handles current
third_party/npu_model/npu_model/configs/isa_definition.pypath
11.4 Gemmini FP8 route (implemented)
Files updated:
compiler/plugins/target/Gemmini/GemminiOptions.hcompiler/plugins/target/Gemmini/GemminiOptions.cppcompiler/plugins/target/Gemmini/PluginRegistration.cppcompiler/src/merlin/Dialect/Gemmini/Transforms/Passes.hcompiler/src/merlin/Dialect/Gemmini/Transforms/ConvertToGemmini.cppcompiler/src/merlin/Dialect/Gemmini/IR/GemminiOps.cppcompiler/src/merlin/Dialect/Gemmini/Transforms/LowerGemminiToIREE.cpp
New option:
--iree-gemmini-enable-fp8-matmul
Behavior:
- recovers FP8 matmul forms (
f8E4M3FN/f8E4M3FN -> bf16|f32) intogemmini.matmul - keeps default behavior conservative (FP8 recovery off unless flag is enabled)
- lower-to-ISA and lower-back-to-IREE paths now handle FP8 matmul variants
Real-model (quantfix3 global-opt) util.func coverage with current default path:
gemmini.matmul_tile:66gemmini.matmul:0(all staged)- remaining
linalg.matmul:1
Tests added/updated:
compiler/src/merlin/Dialect/Gemmini/Transforms/tests/post-global-opt-hook.mlir- includes FP8 function and checks with
--iree-gemmini-enable-fp8-matmul compiler/src/merlin/Dialect/Gemmini/Transforms/tests/matmul-lower-to-isa.mlir- FP8
gemmini.matmul -> gemmini.matmul_tile compiler/src/merlin/Dialect/Gemmini/Transforms/tests/lower-gemmini-to-iree.mlir- FP8 tile lowering includes
arith.extfpath
12. Attention Flags Experiment (2026-03-13)
Tested IREE flags suggested for attention:
--iree-global-opt-enable-attention-v-transpose--iree-linalgext-attention-softmax-max=65504
Compile commands were run both with and without those flags using:
tools/compile.py ... --compile-to global-optimization- phase dumps:
tmp/smolvla_global_opt_phases_attnflags0(without)tmp/smolvla_global_opt_phases_attnflags1(with)
Result:
module.1.input.mlirhashes are identicalmodule.4.global-optimization.mlirhashes are identical- key metrics are unchanged (
23f32 score matmul islands in global-opt)
Conclusion for current tree + model path:
- these two flags are accepted by the compiler but are currently a no-op for this specific SmolVLA input/lowering path.
13. Reproducible Export -> Global-Opt -> NPU ISA -> npu_model
This is the concrete path from exported SmolVLA MLIR to NPU simulator input.
13.1 Compile exported SmolVLA to global-opt with phase dumps
# Baseline target bundle
conda run -n merlin-dev uv run tools/compile.py \
models/smolVLA/smolVLA.q.fp8.mlir \
--target spacemit_x60 \
--quantized \
--compile-to global-optimization \
--dump-phases
# NPU plugin path (from models/npu_ucb.yaml)
conda run -n merlin-dev uv run tools/compile.py \
models/smolVLA/smolVLA.q.fp8.mlir \
--target npu_ucb \
--quantized \
--compile-to global-optimization \
--dump-phases
# Gemmini plugin path (from models/gemmini_mx.yaml)
conda run -n merlin-dev uv run tools/compile.py \
models/smolVLA/smolVLA.q.fp8.mlir \
--target gemmini_mx \
--quantized \
--compile-to global-optimization \
--dump-phases
SmolVLA-specific compile knobs (--iree-input-type=torch,
--iree-opt-const-eval=false, demote-contraction flag) are now encoded in
models/*.yaml under models: smolVLA, so users do not pass those manually.
13.2 Always inspect the two key phase outputs
# Example for the npu_ucb fp8 build output directory:
ls -lh \
build/compiled_models/smolVLA/npu_ucb_RVV_smolVLA.q.fp8/phases/module.1.input.mlir \
build/compiled_models/smolVLA/npu_ucb_RVV_smolVLA.q.fp8/phases/module.4.global-optimization.mlir
conda run -n merlin-dev uv run python tools/analyze_quant_ir.py \
build/compiled_models/smolVLA/npu_ucb_RVV_smolVLA.q.fp8/phases
The phase dumps are intentionally under build/compiled_models/... so they are
treated as generated build artifacts, not source-controlled inputs.
These are the two files we now treat as the ground truth for matching and lowering:
module.1.input.mlirmodule.4.global-optimization.mlir
13.3 Lower global-opt IR to NPU dialect/ISA
BIN=build/host-merlin-release/tools
DUMP=build/compiled_models/smolVLA/npu_ucb_RVV_smolVLA.q.fp8/phases
"${BIN}/iree-opt" "${DUMP}/module.4.global-optimization.mlir" \
--iree-plugin=npu \
--mlir-print-op-generic \
--pass-pipeline='builtin.module(convert-linalg-to-npu-kernel,npu-verify-ukernel-symbols,convert-npu-kernel-to-schedule,npu-verify-ukernel-symbols,convert-npu-schedule-to-isa)' \
> "${DUMP}/module.4.global-optimization.npu-isa.mlir"
Note: --mlir-print-op-generic is required here so non-NPU attrs/ops from
global-opt remain parseable by npu-translate.
13.4 Translate NPU ISA MLIR to text ISA
"${BIN}/npu-translate" \
--allow-unregistered-dialect \
--mlir-to-npu-text-isa \
"${DUMP}/module.4.global-optimization.npu-isa.mlir" \
> "${DUMP}/smolvla.program.isa.txt"
13.5 Validate ISA contract and run npu_model smoke
conda run -n merlin-dev uv run python \
compiler/src/merlin/Dialect/NPU/scripts/check_isa_contract.py \
"${DUMP}/smolvla.program.isa.txt"
conda run -n merlin-dev uv run python \
third_party/npu_model/compiler/scripts/run_simulator_smoke.py \
"${DUMP}/smolvla.program.isa.txt"
The above contract-check + smoke path was validated using:
- real compile dump:
tmp/smolvla_global_opt_phases_verify_npu_ucb_real - ISA MLIR:
tmp/smolvla_global_opt_phases_quantfix12_softmax_fastpath_rerun/module.4.global-optimization.npu-isa.blog-generic.mlir - ISA text:
tmp/smolvla_global_opt_phases_quantfix12_softmax_fastpath_rerun/smolvla.blog-generic.isa.txt
13.7 Quick matcher checks (exact commands)
rg -o "gemmini\\.[a-zA-Z_]+" -N \
tmp/smolvla_global_opt_phases_verify_gemmini_mx_real2/module.4.global-optimization.mlir \
| sort | uniq -c
rg -o "npu_isa\\.[a-zA-Z_]+" -N \
tmp/smolvla_global_opt_phases_verify_npu_ucb_real/module.4.global-optimization.mlir \
| sort | uniq -c
14. AutoComp Kernel Library (2026-04-11)
Per-tile NPU kernels generated by AutoComp (third_party/autocomp) with
LLM-driven beam search and cycle-accurate feedback from npu_model.
14.1 Results
| Kernel | Cycles | Method |
|---|---|---|
| matmul (fp8×fp8→bf16) | 198 | autocomp |
| silu | 178 | autocomp |
| elementwise_add | 168 | autocomp |
| elementwise_mul | 200 | autocomp |
| elementwise_div | 220 | hand-written replacement |
| elementwise_sub | 199 | autocomp |
| rope_frequency | 137 | autocomp |
| gelu_tanh | 327 | hand-written replacement |
| softmax | 196 | hand-written |
| rms_norm | 258 | hand-written |
| reduction_sum | 110 | hand-written |
| requant (bf16→fp8) | 107 | hand-written |
| attention (SDPA) | 463 | hand-written materialized QK/softmax/V tile |
| tiled_matmul_2x1 | 297 | autocomp (2 tiles, ~148 cy/tile amortized) |
| fast_gelu (preloaded consts) | 356 | autocomp (slower than standalone 337 cy; constant-DMA cost outweighed savings) |
Kernel sources: benchmarks/SaturnNPU/kernel_library/*.py. Each file exports
test() -> List[Instruction]. The executable manifest is
benchmarks/SaturnNPU/kernel_library/manifest.json and is copied into
third_party/npu_model/npu_model/configs/programs/smolvla_kernel_manifest.json
for representative-program pushes.
14.2 Why kernels were hand-written or replaced
Several kernels (softmax, rms_norm, reduction_sum, requant, attention, and later the replacement div/GELU variants) were faster to write directly than to keep driving autocomp search. Root-causing the failures revealed issues that are structural, not search-quality problems:
- Hardware layout convention.
vpack.bf16.fp8consumes two bf16 inputs as left/right column halves (cat([m_lo, m_hi], dim=1)) — the same layout MXU produces natively. A row-major 32×32 bf16 tile in DRAM cannot bevloaded into the right shape with a single DMA. The harness fix: present inputs already split at the DRAM level, matching production flow. - Default ERF = 0.
vpack/vunpackmultiply by an ERF scale; without an explicitseli rd=k, imm=1the output is zeroed. Autocomp missed this hint across 8+ iterations. - Golden precision. The simulator's reductions promote to fp32 then cast to bf16. Goldens computed entirely in fp32 produced spurious mismatches at the ~0.03 level. Compute goldens stepwise in bf16 to match hardware.
- DMA flag discipline.
dma.configanddma.loadon the same channel without adma.waitbetween them trip an "Flag N already set" assertion. addi12-bit signed immediate. Addresses ≥ 2048 needlui + addipair or chainedaddis through a base register. Silently wrapping to negative is a subtle bug.
14.3 npu_model simulator bugs fixed
Two latent bugs in third_party/npu_model were silently breaking every
autocomp run that exercised vpack.bf16.fp8:
vpack.bf16.fp8missing.view(torch.uint8)(configs/isa_definition.py:556). Passed a fp8 tensor towrite_mrf_fp8which assertsuint8. The companionvmatpop_fp8_acc_mxu1does the view correctly at line 944.write_mrf_fp8value-casts instead of byte-copying (hardware/arch_state.py:131). Didself.mrf[vd].view(float8_e4m3fn)[:] = u8_value, which makes PyTorch value-cast uint8 byte 173 → fp8 +176.0 (stored as byte 115) instead of reinterpreting the bits. Fix: view target asuint8.
Both patches are in the local npu_model submodule and are prerequisites
for requant / anything downstream of MXU→fp8.
14.4 Harness conventions (production-aligned)
To match MXU's natural layout, kernel harnesses present bf16 tiles pre-split into 32×16 left / right halves:
Reductions (row-sum, row-max, rms) produce a broadcast output:
(32, 16) bf16 where each row is the per-row scalar repeated 16 times.
This is exactly what downstream broadcast ops consume — no reshape needed
between kernels.
Constants (1/N, eps, GELU coefficients) are DMA-preloaded into DRAM
as filled 32×16 bf16 tiles. This avoids scalar-store constant construction
that originally made gelu 796 cy; the preloaded-constant variant landed at
337 cy.
14.5 Reusable kernel template
The hand-written VPU kernels follow the same 30–80-line template:
1. Scalar address setup (x1..x11): VMEM in/out, DRAM in/out, size
2. dma.config ch0 / ch1 → dma.wait ch0 / ch1 (clear flags)
3. dma.load halves + constants in parallel on ch0 / ch1
4. dma.wait ch0 / ch1
5. vload into m0, m1 (+ m2, m3 for constants)
6. VPU compute on bf16, pairwise over halves
7. (If reducing) vredsum.row / vredmax.row per half, then combine via vadd / vmaximum
8. (If requanting) seli ERF=1; vpack.bf16.fp8
9. vstore + dma.store ch0 / ch1
10. dma.wait ch0 / ch1
Cycles = sum of functional-unit latencies + DMA transfer time (1024 B at
64 B/cycle = 16 cycles per DMA). The VPU's non-pipelineable ops (vexp,
vrecip, vsqrt, vtanh) are 8-cycle; simple ops (vadd, vmul,
vsub) are 2-cycle; delay imm=N stalls for N cycles to respect
data-hazard windows (the NPU is in-order with no dependency tracking).
14.6 Are we ready to stitch?
Yes, for all 13 kernel families. The missing elementwise-div, GELU tanh,
and attention families now have handwritten kernels checked against
npu_model. The first stitching bridge is implemented as a manifest-backed
native ISA path:
benchmarks/SaturnNPU/kernel_library/stitch.pyconcatenates kernel invocations and rewrites only the DRAM address registers used bydma.load.ch<N>/dma.store.ch<N>, preserving per-kernel VMEM scratch.compiler/src/merlin/Dialect/NPU/scripts/emit_kernel_manifest_isa.pyemits nativenpu_modeltext ISA from a kernel sequence or the full manifest.compiler/src/merlin/Dialect/NPU/scripts/run_kernel_manifest_smoke.pyexecutes a stitched manifest program directly innpu_model.npu_isa.native_instcarries manifest-native instructions through MLIR.convert-npu-schedule-to-isacan now lower scheduled matmul, softmax, and ukernel launches by looking them up in the manifest instead of expanding the older abstractnpu_isasequence.
Validation on 2026-04-12:
uv run python compiler/src/merlin/Dialect/NPU/scripts/emit_kernel_manifest_isa.py \
--all --connect-linear --output /tmp/smolvla_manifest_all.isa.txt
uv run python compiler/src/merlin/Dialect/NPU/scripts/check_isa_contract.py \
/tmp/smolvla_manifest_all.isa.txt
# ISA contract check PASSED: 610 instruction(s), 28 mnemonic(s)
uv run python compiler/src/merlin/Dialect/NPU/scripts/run_kernel_manifest_smoke.py \
--all --connect-linear --max-cycles 50000
# stitched manifest smoke PASSED: 13 kernel(s), 610 instruction(s)
uv run tools/merlin.py build --profile npu --cmake-target iree-opt
printf "<npu_schedule.matmul_tile module>" | \
build/host-merlin-debug/tools/iree-opt --iree-plugin=npu \
--iree-npu-native-kernel-lowering \
--iree-npu-kernel-manifest=benchmarks/SaturnNPU/kernel_library/manifest.json \
--pass-pipeline="builtin.module(convert-npu-schedule-to-isa)" - | \
build/host-merlin-debug/tools/npu-translate --mlir-to-npu-text-isa
# emits manifest-native matmul instructions
The demoted SmolVLA artifact was checked with both textual scans and MLIR Python bindings. After stripping the 845 MB dense-resource payload block for parsing, the MLIR binding walk sees 0 f32 compute ops; the only f32 mentions in the raw file are three ABI imports immediately converted to bf16. Dense resources are also recoverable for the actual model weights: 496 references, 494 unique payloads, all with byte counts matching the tensor type after the MLIR four-byte blob prefix.
Remaining work before numerical E2E parity is graph binding and allocation, not
kernel availability: lower every real SmolVLA linalg/schedule op into one of
the manifest families, allocate DRAM regions for activations/constants/weights,
rewrite per-invocation addresses, and compare the resulting full program in
npu_model against the PyTorch reference.
Nice-to-have, non-blocking:
- DMA cross-kernel pipelining (prob 30
tiled_matmul_2x1) — halves DRAM-stall cycles by overlapping kernel N+1's weight load with kernel N's compute. - Fast gelu (prob 31) — skip for now: it landed at 356 cycles, slower than the current 337-cycle GELU tanh kernel.
15. Graph binding: linalg.generic → NPU kernels (2026-04-12)
Extended ConvertLinalgToNPUKernel.cpp with 5 new pattern matchers that
recognize elementwise and reduction linalg.generic ops produced by the
global-opt phase of the SmolVLA pipeline.
New patterns (file compiler/src/merlin/Dialect/NPU/Transforms/ConvertLinalgToNPUKernel.cpp):
| Pattern | Body shape | Bound kernel |
|---|---|---|
LowerElementwiseGenericToUKernelPattern("arith.addf") |
single addf, all-parallel iterators, identity input maps |
npu_uk_elementwise_add |
LowerElementwiseGenericToUKernelPattern("arith.mulf") |
single mulf, all-parallel |
npu_uk_elementwise_mul |
LowerElementwiseGenericToUKernelPattern("arith.subf") |
single subf, all-parallel, preserves operand order |
npu_uk_elementwise_sub |
LowerElementwiseGenericToUKernelPattern("arith.divf") |
single divf, all-parallel, preserves operand order |
npu_uk_elementwise_div |
LowerRowReductionSumToUKernelPattern |
single reduction iterator + addf body combining %in with %out |
npu_uk_reduction_sum |
Coverage on real module.5.demoted.mlir:
Unbound linalg.generic 2972 → 2406 (566 ops bound, 19% of gap closed)
elementwise_add 177
elementwise_mul 138
elementwise_sub 114
reduction_sum 214
(matmul 334, attention 36 — already handled by existing patterns)
elementwise_div 0 — real SmolVLA divf cases are mixed with
arith.index_cast / sitofp (scalar index math),
correctly rejected by the strict matcher
Why the remaining 2406: most are multi-op chains (silu/gelu/rms_norm
decomposed by global-opt into sequences of single-op linalg.generic), type
conversions (truncf alone, 392), fills/copies (empty bodies, 267), and
index math. The chain cases can be covered either by a follow-up fusion pass
(coalesce sequences into a single ukernel_generic) or by letting the
per-op elementwise binding stand — the individual addf/mulf pieces of
the chain already bind; what's missing is the semantic labeling.
Verification:
build/host-merlin-debug/tools/iree-opt --iree-plugin=npu \
--pass-pipeline="builtin.module(convert-linalg-to-npu-kernel)" \
build/compiled_models/smolVLA/npu_ucb_RVV_smolVLA.q.fp8po2/smolVLA.q.fp8po2_global_optimization.mlir \
| rg -c "linalg.generic" # → 2406 (was 2972 before these patterns)
Manifest smoke still passes (13 kernels, 610 instructions) — no regression on the verified kernel library.
15.1 Tile loop generation: TileNPUKernelToSchedule pass (2026-04-13)
New pass at compiler/src/merlin/Dialect/NPU/Transforms/TileNPUKernelToSchedule.cpp.
Reads tile_shape from the kernel manifest and wraps each whole-tensor
npu_kernel.ukernel_generic op in an scf.for tile loop nest, emitting
npu_schedule.ukernel_launch at tile granularity with
tensor.extract_slice / tensor.insert_slice for input/output plumbing.
Option: --kernel-manifest=<path> (when absent, defaults to [32, 32]).
Behavior:
- Whole-tensor binary elementwise on (M, N) bf16 where both dims > tile_shape
→ emits nested scf.for + per-tile ukernel_launch.
- Whole-tensor ops already at tile shape → pass skips them (returns failure());
they flow through the existing ConvertNPUKernelToSchedule 1:1 pass.
Coverage on module.5.demoted.mlir (combined pipeline):
iree-opt --iree-plugin=npu \
--pass-pipeline="builtin.module(convert-linalg-to-npu-kernel, \
tile-npu-kernel-to-schedule{kernel-manifest=benchmarks/SaturnNPU/kernel_library/manifest.json}, \
convert-npu-kernel-to-schedule)" \
module.5.demoted.mlir
Yields 1013 scheduled ukernel_launch ops + 23 softmax_fragment ops +
450 scf.for tile loops, zero remaining npu_kernel.ukernel_generic.
Distribution:
| Symbol | Count |
|---|---|
npu_uk_matmul_f8E4M3FN_f8E4M3FN_bf16 |
334 |
npu_uk_reduction_sum |
214 |
npu_uk_elementwise_add |
177 |
npu_uk_elementwise_mul |
138 |
npu_uk_elementwise_sub |
114 |
npu_uk_gemma_attention_bf16_bf16 |
36 |
Verification — test MLIR with a 64×64 elementwise_add:
// Input: linalg.generic addf on tensor<64x64xbf16>
// After pipeline:
%0 = tensor.empty() : tensor<64x64xbf16>
%1 = scf.for %i = %c0 to %c64 step %c32 iter_args(%acc1 = %0) {
%2 = scf.for %j = %c0 to %c64 step %c32 iter_args(%acc2 = %acc1) {
%al = tensor.extract_slice %a[%i, %j] [32, 32] [1, 1] ...
%bl = tensor.extract_slice %b[%i, %j] [32, 32] [1, 1] ...
%r = npu_schedule.ukernel_launch "npu_uk_elementwise_add"(%al, %bl) ...
%out = tensor.insert_slice %r into %acc2[%i, %j] [32, 32] [1, 1] ...
scf.yield %out : tensor<64x64xbf16>
}
scf.yield %2 : tensor<64x64xbf16>
}
4 ukernel_launch invocations, one per 32×32 tile, correctly stitched.
15.2 Patch points + per-invocation address rewriting (2026-04-13)
Manifest annotation script at
compiler/src/merlin/Dialect/NPU/scripts/annotate_kernel_patch_points.py
walks each kernel's instruction stream and records:
register— which scalar xN holds a DRAM address for a given DMA role.role—dram_in_K,dram_out_K, ortransfer_size.original_value— the kernel's hardcoded address (for sanity checks).instructions— indices of the addi/lui chain that computes the register.
Populated manifest.json in place. Example for matmul: three patch points
(dram_in_0=0x0, dram_in_1=0x500, dram_out_0=0xb00) each anchored to a
single-addi chain. For elementwise_add, interleaved scalar/DMA layout ends
up with a 2-step chain (lui+addi) for the larger offsets.
Compiler-side address patching landed in
compiler/src/merlin/Dialect/NPU/Transforms/ConvertNPUScheduleToISA.cpp.
The lowering pass now:
- Loads
patch_pointsalongsideinstructionsfrom the manifest. - Maintains a shared
invocationCounteracross allLowerSchedule*patterns. - For each emitted kernel, clones the instruction list and rewrites the
addi/lui chains for
dram_in_*/dram_out_*roles. Allocation strategy (first cut):inputBase=0,outputBase=0x400,dramStride=0x100, per-role sub-stridestride/4. All offsets fit in a 12-bit signed immediate so most single-addi chains patch cleanly.
End-to-end verification: compiled a 2-distinct-matmul test through the
full plugin pipeline with --iree-npu-native-kernel-lowering and observed:
invocation 0 (first matmul): rd=1 imm=0, rd=2 imm=64, rd=3 imm=1024
invocation 1 (second matmul): rd=1 imm=256, rd=2 imm=320, rd=3 imm=1280
Distinct DRAM regions per invocation, as expected.
Known limitation: kernels whose patch chain is 1 instruction cannot represent addresses outside 12-bit signed range without instruction insertion (which would shift all subsequent indices in the stream). Current allocator stays within 2 KB to sidestep this; real SmolVLA needs either (a) manifest prep that pads every 1-addi chain to 2 instructions (lui+addi pair), or (b) a compiler pass that inserts lui instructions and re-computes patch_point indices. Tracked as follow-up.
15.3 npu_model Program scaffolds: SmolVLATransformerBlock & SmolVLAFullProgram (2026-04-13)
New files at third_party/npu_model/npu_model/configs/programs/:
| File | Purpose | State |
|---|---|---|
smolvla_transformer_block.py |
Single transformer block stitched from 12 manifest kernels (rms_norm → Q/K/V matmul → attention → O matmul → residual add → rms_norm → MLP → residual add) | Imports + stitches 518 instructions; simulator run needs stitch.py stride tuning |
smolvla_full_program.py |
Full SmolVLA program scaffold with weight-loader plumbing for the 494 dense_resource payloads in module.5.demoted.mlir |
Imports + stitches 12 kernels as placeholder; real-model weights/schedule are follow-up work |
Both import via kernel_library.stitch.stitch_kernels after adding
benchmarks/SaturnNPU/ to sys.path so the package-relative imports
inside stitch.py resolve. Follow the gemma_attention.py precedent.
15.4 Kernel tiling direction (from Slack 2026-04-12)
Aligned on: kernels are the tiling unit — each kernel may consume multi-tile inputs and handle internal accumulation (K-reduction, running max/sum for attention). Sync between consecutive kernels is treated as "another kernel" to profile independently.
This doesn't change the first-cut compiler pipeline (we have only
single-tile kernels in the manifest today — tile_shape: [32, 32]). It
does shape the contract: TileNPUKernelToSchedule reads tile_shape from
the manifest rather than hardcoding 32, so when multi-tile kernels
(tiled_matmul_2x1: [64, 32], etc.) land later, no code change is needed.
A scaffold InsertSyncKernels pass will be wired into the pipeline at that
point so future sync-family manifest entries have a dedicated insertion
point.
Autocomp / manifest cross-check:
Cross-checking manifest.json against autocomp run_metrics.json best
scores (autocomp candidate sources have been pruned; only run metrics
remain):
| Kernel | Manifest cy | Autocomp best cy | Delta |
|---|---|---|---|
| matmul, silu, add/mul/sub, rope | matches | matches | 0 |
| gelu_tanh | 327 | 337 | manifest is 10cy better (post-autocomp tuning) |
| elementwise_div | 220 | 212 | manifest is 8cy stale — autocomp source gone, can't regenerate |
| softmax, rms_norm, reduction_sum, requant, attention | N/A (hand-written) | N/A (autocomp never converged) | — |
| tiled_matmul_2x1 (prob 30) | absent | 297 | 25% per-tile improvement, worth adding when DMA pipelining pass lands |
| fast_gelu (prob 31) | absent | 356 | slower than 327cy gelu_tanh — skip |
16. Cross-Invocation Orchestration: Matmul K-Accumulation + Flash Attention Plumbing (2026-04-13)
Triggered by a Slack question from Huijae An: "if the kernels spat out from merlin are tiled already...does it also orchestrate the results from the tiled computes? for matmul, does it automatically accumulate the partial results, and for attention, keep the running sum?"
The compiler had only single-tile kernels in the manifest (M=K=N=32 for
matmul, seq=32 for attention) — sufficient for the placeholder
SmolVLATransformerBlock smoke but not for SmolVLA's real shapes (matmuls
with K up to 2048, attention at (12, 1024, 64)). This section closes the
matmul gap end-to-end and lands the compiler-side plumbing for flash
attention.
16.1 Pre-existing manifest bug surfaced and fixed
While building the matmul K-accumulator smoke test, the simulator
returned all-zero bytes at the expected output address. Tracing the patched
DMA store revealed every matmul/attention/elementwise manifest kernel
encoded addi rd, x0, K for K outside the signed 12-bit range
([-2048, 2047]) — for example addi rd=8, rs1=0, imm=2048 for a
2048-byte transfer size. RISC-V (and the npu_model simulator) sign-extends
the 12-bit immediate, so 2048 silently became -2048 and the DMA store
copied zero bytes.
Fix in /tmp/fix_manifest_addi.py (idempotent rewrite tool, can live in
compiler/src/merlin/Dialect/NPU/scripts/ next iteration):
addi rd, x0, K→lui rd, hi(K); addi rd, rd, lo(K)(rs1 == 0 case)addi rd, rd, K→lui x31, hi(K); addi x31, x31, lo(K); add rd, rd, x31(cumulative-bump case inelementwise_add; uses x31 as a scratch since no kernel touches regs above x21)
13 overflowing addi ops rewritten across matmul, silu,
elementwise_add, and the new matmul_acc_* variants.
The annotator (annotate_kernel_patch_points.py) is then re-run to
repopulate patch_points with corrected indices. Without this fix, none
of the manifest kernels produce correct numerics — every prior matmul
output went into negative DRAM addresses (which Python-list-style indexing
in arch_state.write_dram silently translated into writes near the end of
the DRAM tensor).
16.2 Matmul K-tiling: 3 manifest variants + compiler pass
Manifest (benchmarks/SaturnNPU/kernel_library/manifest.json):
matmul_acc_first— issuesvmatmul.mxu0(overwrite accumulator); skips the pop / vstore / DMA store tail. 21 instructions.matmul_acc_mid— same prologue, issuesvmatmul.acc.mxu0(accumulator-add); also skips the drain. 21 instructions.matmul_acc_last—vmatmul.acc.mxu0followed by the original drain (vmatpop.bf16.acc.mxu0+ 2 vstores + DMA store). 28 instructions.
All three reuse the existing matmul kernel's scalar/DMA setup and
vmatpush.weight.mxu0. Synthesized via
/tmp/gen_matmul_acc.py; patch_points populated by the existing
annotator and validated end-to-end.
Compiler pass (TileMatmulOpPattern /
TileMatmulUKernelPattern in
compiler/src/merlin/Dialect/NPU/Transforms/TileNPUKernelToSchedule.cpp):
- Match
npu_kernel.matmuland anynpu_kernel.ukernel_generic "npu_uk_matmul...". - Compute tile counts
M / mTile × N / nTile × K / kTile(defaults to 32 each, overridable via manifesttile_shape). - Skip if
K_tiles ≤ 1— the existing 1:1ConvertNPUKernelToSchedulepath handles those. - Emit nested
scf.forover (M, N) with the K-loop unrolled inside the body. K-iteration k → variant symbol selected by position in the chain:_firstat k=0,_midfor the middle,_lastat k=K_tiles-1. Only_last's tensor result is read (the others' MXU state is implicit hardware state and the SSA results are dead).
Validation:
- IR shape:
tile_matmul_k.mlir(FileCheck-passing) covers K=2 (no_mid) and K=3 (fullfirst/mid/lasttriple). - Numerical:
MatmulKTileSmokeProgram(inthird_party/npu_model/npu_model/configs/programs/matmul_k_tile_smoke.py) runs a 32×64 ⊗ 64×32 matmul asstitch_kernels(["matmul_acc_first", "matmul_acc_last"])andtorch.allcloses against the bf16 reference withinrtol=1e-2, atol=1e-2. Max abs diff: 2.0 over outputs ranging up to ±584. Total cycles: 256 for two K-tile invocations.
16.3 Flash attention: compiler plumbing in place; ISA bodies stubbed
The same three-variant approach was registered for attention
(attention_acc_first/mid/last) plus a TileAttentionUKernelPattern
that emits an scf.for nest over (batch, q-block) with the K/V dimension
unrolled inside.
Compiler pass signs:
- Matches any
ukernel_genericwhose symbol starts withnpu_uk_attention(ornpu_uk_gemma_attention) and excludesnpu_uk_attention_acc(avoid recursion). - Skip when K/V seq ≤ 32 (single-tile case → 1:1 path).
- Loop nest:
- Outer batch loop (only emitted for rank-3 inputs).
- Inner Q row-block loop in 32-row strides.
- K/V tile chain unrolled inside the Q body, picking
_first / _mid / _lastby position.
- Lit test:
tile_attention_kv.mlircovers seq_kv=64 (first/last pair) and the rank-3 batched seq_kv=96 (first/mid/last triple).
Stub status of the kernel bodies: the manifest entries
attention_acc_first/mid/last currently clone the existing single-tile
attention kernel ISA (79 instructions each). They have the right
DMA in/out layout for compiler-side address patching, but the actual
online-softmax recurrence (running max + denom + output rescaling
across K/V tiles) is not implemented — each variant just runs a
single-tile attention and writes the same bf16 output. Replacing
the bodies with the real recurrence is the remaining work for
numerical correctness.
16.4 SmolVLAFullProgram: weights wired, PyTorch ref hooked
compiler/src/merlin/Dialect/NPU/scripts/extract_mlir_dense_resources.py
gained a load_dense_resources(mlir_path, base_addr) library function
that returns [(dram_address, torch.Tensor)]. Validated against the
demoted MLIR for SmolVLA: 494 payloads, 421 MB total, mix of bf16 and
f8E4M3FN, packed contiguously starting at the configured base.
SmolVLAFullProgram (in
third_party/npu_model/npu_model/configs/programs/smolvla_full_program.py)
now:
- Calls
load_dense_resourcesat class construction somemory_regionsis populated with all 494 SmolVLA weights. - Exposes
compute_pytorch_reference(seed=42)— a callable that loads the bf16 baseline SmolVLA via Understanding-PI0'sload_smolvla_policy+one_step_no_cache, with the same dummy inputs the simulator should be given. Heavy imports (lerobot, mx) live inside the function so the Program class stays cheap to construct.
16.5 0-byte DMA simulator assertion
third_party/npu_model/npu_model/hardware/arch_state.py:256's
assert address < offset + data.numel() <= dram_size triggered on
zero-byte DMA writes (the inequality is false when data.numel() == 0).
Replaced with assert data.numel() == 0 or address + data.numel() <= dram_size,
which preserves the bounds check for non-empty writes and treats the
empty case as a no-op.
16.6 What's left for a real E2E numerical SmolVLA run
- Hand-coding (or autocomp-synthesizing) the online-softmax recurrence
inside
attention_acc_first/mid/last. The math is documented in the plan; the implementation needs roughly 200-300 ISA ops per variant and careful staging of theprev_m / prev_denom / prev_outDRAM state across invocations. - Running the SmolVLA forward through the full compile pipeline with
TileMatmulPattern+TileAttentionPatternenabled and dumping the stitched text ISA, soSmolVLAFullProgramcan swap its placeholder 12-kernel sequence for the real schedule. - Comparing the simulator's final DRAM tile against
compute_pytorch_reference()within bf16 tolerance.
17. Golden-Data Testing Ladder (2026-04-13, later)
Pivoted from "finish flash attention then E2E" to "build the testing ladder from single kernels to the full model, reveal hidden manifest bugs along the way." Several real manifest / annotator bugs surfaced and got fixed as a side-effect.
17.1 Per-kernel golden suite (Piece A)
New artifacts:
benchmarks/SaturnNPU/kernel_library/kernel_golden_fixtures.py— 11 kernel fixtures, each pairing a canonical torch input with a torch reference and expected output. Torch refs reusegenerate_npu_golden_tests.py(rms_norm, silu, softmax, gelu, rope) and addref_reduction_sum,ref_requant,ref_single_tile_attention.third_party/npu_model/npu_model/configs/programs/kernel_golden_suite.py— dynamic Program factory: oneKernelGolden_<kernel>class per fixture, auto-discovered bytest_programs.py.compiler/src/merlin/Dialect/NPU/scripts/run_kernel_golden_tests.py— CLI runner withPASS | FAIL | XFAILper kernel.
Result: 8 of 11 kernels pass torch.allclose: matmul, silu, softmax, requant, rope_frequency, single-tile attention, elementwise_add, elementwise_sub. The 3 xfails are tagged in the fixture with specific manifest bug descriptions:
elementwise_mul: DMA writes B to VMEM x2=0x900 but the subsequentvloadreads from VMEM x5=0x800. Manifest-level register mismatch. Also had 3 trailing junk ops (beq + 2 addi) causing an infinite loop; trimmed.reduction_sum: output row-ordering doesn't match per-row column sum. Likely a vector-width / row-stride interaction needing ISA re-trace.rms_norm: ~6× scale mismatch vs torch; eps / weight-broadcast semantic unclear from the current ISA.
17.2 Annotator + fix-script refinements (prerequisite for 17.1)
Fixing the kernel golden suite surfaced that the addi-overflow fix
from section 16.1 had a subtle asymmetry: the autocomp-generator relied
on RISC-V sign-extension for addi rd=X, rs1=X, imm>2047 (cumulative
offset bumps in elementwise_add), where the sign-extended interpretation
was the intended semantic. My original rewrite turned those into
literal adds, breaking the addresses. Fixes:
fix_manifest_addi_overflow.pynow skipsrs1 == rdcases — only thers1 == 0shape (matmul's initial DRAM-address setup) gets thelui + addisplit.revert_rs1eqrd_addi_overflow.py— new script that undoes thelui + addi + addtriplet my earlier fix produced, returning the original singleaddiwith its out-of-range immediate. Idempotent.annotate_kernel_patch_points.py— now tracksadd/suband sign-extendsaddiimm. This makes the annotator's register-value estimate match what the simulator actually computes; without this the manifest'sdram_in_*/dram_out_*were reported at the wrong offsets for several kernels.
17.3 Matmul K-tile chain harness (Piece B.1)
matmul_k_tile_smoke.py parameterized into a factory that emits one
Program per K_tiles ∈ {2, 3, 4}. All three pass torch.allclose at
rtol=1e-2, atol=2.0 (atol covers one bf16 ULP at K=128 accumulation,
where output magnitudes reach ±~800). Max diffs 2-4. Per-run cycles:
K=2 → 282, K=3 → 364, K=4 → 446.
Evidence the MXU accumulator survives the full K chain and the compiler
pattern (TileMatmulPattern) lowers to the right sequence.
17.4 Text-ISA parser + round-trip harness (Pieces D.1, E lite)
third_party/npu_model/npu_model/software/text_isa_loader.py parses the
stitch.py-format text ISA (<mnemonic> key=value, key=value, ...) back
into an Instruction[Args] stream. Round-trip across all 19 manifest
kernels is lossless: 823 instructions parsed back identically to the
originals.
run_text_isa_roundtrip_parity.py exercises the full loop: fixture →
stitch_kernels → instruction_to_text → parse_text_isa → Simulation →
torch.allclose. Same 8/11 pass + 3 xfail result as the in-memory suite,
confirming the parser is behavior-preserving.
Gap: the compiler's npu-translate --mlir-to-npu-text-isa emits a
different simplified format (vmul instead of vmul.bf16, dma.load
instead of dma.load.ch<N>, integer-indexed register fields). Bridging
the two formats is the remaining Piece D.2 work: either teach the
parser aliases, or extend the emitter to produce the manifest dialect.
17.6 Real flash-attention kernels + composition E2E (2026-04-13, later)
Using the verified reference SmolVLAFusedAttentionProgram at
benchmarks/SaturnNPU/kernel_library/smolvla_fused_attention.py as the
template, split its MRF-resident state (m, l, O_col0..3) into DRAM-spilled
state for three manifest variants:
benchmarks/SaturnNPU/kernel_library/attention_acc_kernels.py— factored prologue + KV iteration body + state init / load / store / normalize epilogue. Generatorsfirst_body(),mid_body(),last_body()compose these into the three instruction streams.compiler/src/merlin/Dialect/NPU/scripts/gen_attention_acc_variants.py— writes the generated bodies intomanifest.json, replacing the stub entries from section 16.3.benchmarks/SaturnNPU/kernel_library/stitch.py:stitch_attention_chain— new helper that threads running-state DRAM addresses across invocations via ping-pong slots + explicit load/store overrides (the stitcher's existingload_overrides/store_overridesper-invocation mechanism already supports this without core changes).third_party/npu_model/npu_model/configs/programs/attention_kv_chain_smoke.py—AttentionKVChain{2,3}Programvalidate the chain end-to-end against the ISA-exact torch reference. Results: max_diff = 0.0034 (atol=0.2), 3077 cycles at KV=2, 4726 cycles at KV=3.
Per-variant instruction counts after annotation:
attention_acc_first: 120, attention_acc_mid: 122, attention_acc_last: 125
— each has the four per-tile DMA loads (Q, K, V, scale), the full online-
softmax recurrence body, and either state-store or normalize+store.
17.7 Cross-kernel composition smokes (Piece F.A)
third_party/npu_model/npu_model/configs/programs/composition_smoke.py:
CompositionMatmulOnly— single matmul invocation driven by explicit address overrides via the stitcher (199 cyc, max_diff = 0).CompositionTwoMatmulChains— two independent K=2 matmul_acc chains running back to back with non-overlapping DRAM regions (561 cyc, max_diff = 2, atol=2.0).
Together with the matmul K-tile (B.1) and attention K/V chain (B.2) suites,
this proves the end-to-end stitching model: kernel outputs flow to
subsequent kernel inputs through explicit DRAM addresses with the
patcher-driven load_overrides / store_overrides.
17.8 Resolved kernel bugs + all 11 goldens green
Re-traced and fixed three manifest kernel issues surfaced by the golden suite:
- rms_norm — kernel computes
y = x * rsqrt(sum(x²) * inv_dim + eps);dram_in_2is a broadcastinv_dimconstant,dram_in_3is a broadcasteps. The earlier torch reference was passing weight tiles in those slots (textbook RMS-norm with learnable weight); the kernel has no weight multiplication. Updated fixture to match the exact ISA semantic — now passes at max_diff=0.015. - elementwise_mul — VMEM register mismatch: DMA load wrote B to
VMEM x2 (0x900), but the subsequent
vloadread from VMEM x5 (0x800, the DRAM register). Patched the two offending vloads'rs1from 5 → 2 viacompiler/src/merlin/Dialect/NPU/scripts/fix_elementwise_mul_vload_bug.py. Now passes at max_diff=0. - reduction_sum — torch reference used float32
x.sum(dim=-1)but the kernel adds the two bf16 halves first and then row-reduces; order of accumulation differs. Aligned the reference math to the kernel path; now max_diff=0.
Result: run_kernel_golden_tests.py reports 11/11 PASS, 0 xfail.
17.9 Layout model — same 2048-byte footprint across kernels
Matmul writes its 32×32 bf16 output as bf16_split_halves: 1024 B of
cols 0-15 packed as [32, 16], then 1024 B of cols 16-31. Elementwise
ops read the same 2048 B with vload imm12=0 + vload imm12=32
(each pulls one 32×16 register-width chunk from VMEM).
Keeping every activation in this split-halves layout through the block
lets arithmetic kernels compose directly: matmul's out_h0 +
elementwise's A_h0 maps cleanly, and reading the final output as
torch.cat([h0, h1], dim=0) (shape [64, 16]) recovers the logical
tile. No repack kernel needed — only the fp8→bf16 and bf16→fp8
boundaries (matmul inputs, requant outputs) require layout awareness.
17.10 Mini-block composition chain end-to-end (Piece F.A proper)
composition_mini_block.py::CompositionMiniBlockRmsRequantMatmulAdd
chains four kernels:
x bf16 split halves
├── rms_norm → x_norm (split halves)
├── requant → x_fp8 (contiguous fp8 32×32)
├── matmul(x_fp8, W_fp8) → proj (split halves)
└── elementwise_add(proj, x) → res (split halves, [64,16] view)
159 total instructions. Simulator runs in 741 cycles. The torch
reference mirrors the ISA's exact arithmetic (fp8 roundtrip,
bf16 accumulation, split-halves layout). Result:
allclose=True, max_diff=1.125 at magnitudes up to 13.5. This is the
merge-gate proof that multi-kernel composition + DRAM threading works
end to end.
17.11 Scheduling primitives audit
Counted the mnemonic distribution in three representative schedules to confirm every level of the tile-to-tile orchestration is represented:
| Primitive | Mini-block | Attention KV=3 | Matmul K=4 |
|---|---|---|---|
| DMA loads / stores / waits / config | 10 / 5 / 17 / 6 | 14 / 3 / 29 / 12 | 8 / 1 / 9 / 4 |
| Vector load/store | 19 | 67 | 10 |
| MXU weight-push | 1 | 12 | 4 |
| MXU matmul (overwrite + accum) | 1 + 0 | 9 + 3 | 1 + 3 |
| MXU push-bf16 / pop-fp8 (roundtrip) | 0 | 21 / 21 | 0 |
| Vector arith (add/mul/sub) | 9 | 52 | 0 |
| Softmax primitives (exp + redmax + redsum) | 2 | 21 | 0 |
| Delay barriers | 10 | 27 | 1 |
| Scalar setup (addi/lui/add) | 73 | 68 | 58 |
Confirms: (a) cross-invocation DMA sync happens via dma.wait, (b) MXU accumulator state survives across invocations in the matmul chain, (c) the attention chain exchanges running state through DRAM via per- invocation load/store overrides, (d) dtype bridging (bf16⇄fp8) happens inline via acc-roundtrips, (e) pipeline hazards are handled via delay barriers where needed.
17.12 Remaining for full SmolVLA E2E
What's still left for SmolVLAFullProgram to torch.allclose vs
compute_pytorch_reference():
- Block-composition helper — a Python function that emits the
stitched instruction stream for one full transformer block (Q/K/V
projections → attention → O projection → residual → rms_norm →
MLP up/silu/down → residual). The primitives all exist; the helper
iterates the matmul N/K tile grid and calls
stitch_attention_chainfor the attention step. - Weight-layout packer — turns the 494 demoted-MLIR weight tiles into the fp8 32×32-tile-per-(m_tile, k_tile) layout the matmul chains expect. Purely a DRAM marshaling pass.
- Sim-time gating — at SmolVLA shapes (seq=1024, hidden=768), one
matmul expands to ~18k kernel invocations; six matmuls per block ×
16 blocks ≈ 1.7M invocations ≈ tens of CPU-hours in the current
simulator. A
--fast-simpath (skip per-cycle trace writing) is the obvious lever; otherwise Piece G is a CI batched run. - RoPE in attention — SmolVLA applies rotary position embeddings to Q, K before attention. The current flash kernel doesn't. Either precompute RoPE'd Q, K in torch and feed them as inputs, or extend the kernel.
17.13 Open follow-ups (compiler-side alignment)
17.9 What's left (gated on Piece C + D.2)
- Flash-attention ISA bodies (Piece C). The compiler-side plumbing (TileAttentionPattern, manifest stubs) is in place. The real online-softmax recurrence is the multi-hour hand-coding effort.
- Compiler dump + alias layer (Piece D.2). Adds a
--iree-npu-native-kernel-loweringdump path intools/compile.pyproducingmodule.6.isa.txtin the stitch-dialect format that our parser already handles. - Attention chain tests (Piece B.2). Structurally identical to
MatmulKTile{2,3,4}Program; waits on real Piece C kernels. - One-transformer-block E2E (Piece F). The merge gate — extract
one block from
module.5.demoted.mlir, compile it, swap the placeholder schedule inSmolVLATransformerBlock, and asserttorch.allclosevs a PyTorch block forward. Requires C + D.2. - Full SmolVLA forward (Piece G). Stretch goal; same pattern as F scaled up to ~10³-10⁴ kernel invocations.
Dev-blog written by: Agustin Coppari Hollmann
Project Members: Yufeng Chi and Agustin Coppari Hollmann