2026-04-13: Saturn OPU vfredusum.vs hang — opcode survey + MLIR scalarization fix
Repro pin: merlin@
320fbf06· iree_bar@68acd99c74Status: Active
Related entries:
- 2026-04-14 f32-reduction lowering hang — sibling reduction-hang on the same hardware; the present entry fixes the
vfredusum.vsopcode case via MLIR scalarization, while the 2026-04-14 entry handles the broader f32-reduction tree-reduction codegen via per-functiontarget-features="-v".
Context and Goal
ViT-small inference on the Saturn OPU FireSim build hangs at HAL dispatch ordinal 9 (the first LayerNorm). Output stops after exactly one workgroup print:
Comprehensive vsetvli fusing in the OPU matmul ukernel did not fix it (Phase
1 of the prior diagnostic plan ruled the matmul out: ord=9 is not the
matmul, it's the LayerNorm reduction). The goal of this work is to (a) get
direct evidence of which RVV instruction hangs the Saturn vector unit, and
(b) produce a structural codegen fix that prevents emission of the offending
opcode rather than chasing the symptom in one model at a time.
Implementation Changes
1. Isolated RVV self-test on FireSim
samples/SaturnOPU/simple_embedding_ukernel/model_benchmark.c now has a
#ifdef SATURN_RVV_SELFTEST block (run before any IREE init in main()) that
probes each suspected-hang opcode in its own inline-asm checkpoint with an
fprintf(stderr,...) + fflush(stderr) immediately after. A hang inside the
asm block leaves the previous [rvv] cp=N line as the last visible UART
output, so the exact hung opcode is identified by the surviving prefix.
Each checkpoint is independently skippable via a compile-time bitmask
(SATURN_RVV_SELFTEST_SKIP=0x18 skips cp=3 and cp=4), so we can step past
a confirmed hang to expose the next probe across multiple FireSim runs:
| cp | opcode | role |
|---|---|---|
| 1 | csrr vlenb |
baseline sanity |
| 2 | vadd.vv |
control — plain vector arithmetic |
| 3 | vfredusum.vs |
LayerNorm sum reduction (PRIMARY suspect) |
| 4 | vfsqrt.v |
LayerNorm inv_sqrt(var+eps) |
| 5 | vrgather.vi |
LayerNorm broadcast / gather |
| 6 | vfredmin.vs |
softmax / argmin |
| 7 | vfredmax.vs |
softmax max-subtract |
| 8 | vfwredusum.vs |
widening (f32→f64) reduction |
| 9 | vfslide1down.vf |
tree-reduction primitive (LLVM fallback) |
2. FireSim host-level timeout wrapper
A hang inside the running RISC-V ELF freezes the simulated core (the vector
unit never retires the bad opcode), so a software timeout inside the binary
cannot rescue it. build_tools/firesim/run_rvv_selftest.sh now wraps
firesim runworkload with a host-side timeout --signal=TERM
--kill-after=30s ${TIMEOUT}s ... (default 5 min) and follows up with
firesim kill so the FPGA is clean for the next survey iteration. The script
also accepts a skip-mask argument and rebuilds the rvvtest binary with the
matching -DSATURN_RVV_SELFTEST_SKIP=... cmake flag, so the full Phase A
survey is just four invocations:
bash build_tools/firesim/run_rvv_selftest.sh 0x00 # baseline (cp=3 hangs)
bash build_tools/firesim/run_rvv_selftest.sh 0x08 # skip cp=3 → expose cp=4
bash build_tools/firesim/run_rvv_selftest.sh 0x18 # skip cp=3,4 → expose cp=5
bash build_tools/firesim/run_rvv_selftest.sh 0x38 # skip cp=3,4,5 → expose cp=6..9
3. +xopu-gated MLIR scalarization pattern
third_party/iree_bar/compiler/src/iree/compiler/Codegen/LLVMCPU/ConvertToLLVM.cpp
has a new ScalarizeXopuFloatReductionPattern that rewrites every
vector.reduction <add|mul|minimumf|maximumf|minnumf|maxnumf> %v :
vector<Nxf*> into f* into a chain of scalar vector.extract + scalar
arith.{add,mul,...}f ops, preserving the optional accumulator and
fastmath flags. The pattern runs inside ConvertToLLVMPass::runOnOperation()
before the existing Vector→Vector and Vector→SCF lowerings, gated strictly
on hasFeature(targetConfig, "+xopu") so non-Saturn RISC-V (and non-RISC-V)
backends are byte-identical before and after. Integer reductions are
intentionally not scalarized — Phase A has not implicated them.
What Worked
Phase 1 — ord=9 ≠ matmul
Dumping the vmfb dispatch table with iree-dump-module and cross-referencing
against the post-vectorization MLIR confirmed dispatch ordinal 9 in vit_small
is a linalg.generic-only LayerNorm reduction (input 64×128xi8, output
64×128xi8), not an OPU matmul. This pivoted the investigation away
from opu_matmul_riscv_64.c entirely.
Phase 2 — assembly review of dispatch_1
llc -march=riscv64 -mattr=+v -mabi=lp64d -O3 on the dispatch_1 bitcode
emitted 706 lines of RV64+V assembly with 58 vsetvli/vsetivli and 388 RVV
instructions in the inner K-loop — all vsetvlis correctly placed, no
missing prologue or stripped fence. The vector instructions themselves are
the problem, not their preamble.
Phase 3 — selftest pinpoints vfredusum.vs
First FireSim run with SATURN_RVV_SELFTEST=1 (skip mask 0):
[rvv] SELFTEST START skip=0x0 (cps 1..9)
[rvv] cp=1 vlenb=16
[rvv] cp=2 vadd c=[11,22,33,44]
← cp=3 NEVER PRINTS — hang
vfredusum.vs on vector<4xf32> reducing {1,2,3,4} into a scalar with a
vfmv.s.f-initialised 0.0 accumulator never returns. This is the
smoking gun.
Phase B — opcode emission inventory (host-side, no FireSim)
llvm-objdump on the linked dispatch ELF for each model gives a definitive
count of each suspected opcode:
| model | vfredusum.vs | vfredmax.vs | vfsqrt.v | vrgather.vi | vfslide1down.vf | vfredmin.vs | vfwredusum.vs |
|---|---|---|---|---|---|---|---|
| vit_small | 54 | 4 | 4 | 80 | 18 | 0 | 0 |
| large_mlp | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| smolvla | deferred | deferred | deferred | deferred | deferred | deferred | deferred |
Reproduce with:
conda run -n merlin-dev uv run tools/merlin.py compile \
models/opu_bench_suite/opu_bench_vit_small.q.int8.mlir \
--target saturn_opu --hw OPU --dump-artifacts \
--output-dir /tmp/vit_small_phaseB
build/host-merlin-release/llvm-project/bin/llvm-objdump -d --mattr=+v \
/tmp/vit_small_phaseB/binaries/*.so | \
grep -oE '(vfred[a-z]+\.vs|vfwred[a-z]+\.vs|vfsqrt\.v|vrgather\.v[ix]|vfslide1down\.vf|vcompress\.vm)' | \
sort | uniq -c
Two important takeaways:
large_mlpemits zero suspect opcodes. It's a pure GEMM model — every reduction lives inside theiree_uk_opu_matmulukernel (custom OPU instructions, not RVV reductions). This makes it the perfect regression baseline: our+xopu-gated pattern must produce a byte-identical ELF for large_mlp before and after the change.- vit_small uses
vfslide1down.vf18× already. This is LLVM's tree-reduction fallback, almost certainly co-emitted alongside the 54vfredusum.vsfor the samevector.reduction addops. After our pattern scalarises those reductions, we expect the slide count to drop sharply.
vfredmin, vfwredusum, and vcompress don't appear in any model we've
inventoried, so we can defer probing them on FireSim until a model needs
them. The pattern still covers <minimumf|maximumf|minnumf|maxnumf> just in
case future codegen needs it.
What Did Not Work
- "Comprehensive
vsetvlifusing inopu_matmul_riscv_64.c" (twice). The matmul wasn't even involved — Phase 1 ruled it out. - Trying to add
printfcheckpoints insideiree_uk_opu_matmul_loop. That ukernel is compiled as LLVM bitcode with-nostdinc -ffreestanding, so libc isn't linked into the embedded ELF; an unresolvedprintfsymbol breaks dispatch loading. The IREE-blessed escape hatch is theiree_hal_executable_environment_v0_t::import_funcs[]callback (used byiree_h2f_ieee) — but Phase 3 made that unnecessary. - Initial attempt to
llvm-objdump --mattr=+xopu.+xopuisn't a known LLVM feature; Saturn OPU custom opcodes are emitted by source-level.insn r 0x57, ...directives, not by the codegen, so--mattr=+valone is enough for static analysis. llc→.sanalysis as a hang predictor. The assembly looked correct. The bug is in the hardware's execution ofvfredusum.vs, not in the instruction encoding — only direct hardware probing exposed it.
Debugging Notes
- A hang on the simulated core means the host wrapper must impose the
timeout. No software-side guard inside the running ELF can break a
stuck vector instruction; the core never retires and never returns to the
scheduler. The host shell
timeout+firesim killpair is the only reliable way to bound a survey run. - Each new
[rvv] cp=Nprobe must use.option push / .option arch, +v / ... / .option pop. The bare-metal toolchain compiles the rest of the C with-march=rv64imafdc(no+v), so the inline-asm assembler refuses vector mnemonics unless we locally enable+vfor the block. vfwredusum.vsrequires a two-stagevsetvli(e64 to seat the f64 scalar accumulator inv9[0], then e32 for the source vector and the reduction itself, then e64 again to extract the scalar back out). Vector register contents survive thevsetvlichange, so this is safe.large_mlpalready exercises the +xopu codegen pipeline end-to-end with zero suspect opcodes. Use it as the regression oracle for any future +xopu-gated change inConvertToLLVM.cpp.- The pattern is intentionally narrow. It only matches floating-point
reductions, only the six combining kinds we expect. If a future model
trips integer reductions (
vredsum.vsetc.) and Phase A confirms those also hang, broaden the match — don't widen prophylactically.
Test Coverage and Commands
Reproduce the hang (pre-fix)
bash build_tools/firesim/run_rvv_selftest.sh 0x00 300
# Expected: cp=1 prints, cp=2 prints, cp=3 never prints, host timeout fires.
Run the full Phase A opcode survey
for mask in 0x00 0x08 0x18 0x38; do
bash build_tools/firesim/run_rvv_selftest.sh "$mask" 300
done
# Each run rebuilds the rvvtest binary with the new skip mask, runs under a
# 5-min host timeout, and prints "stopped after cp=N" verdict.
Verify the fix statically (post-Phase C)
conda run -n merlin-dev uv run tools/merlin.py compile \
models/opu_bench_suite/opu_bench_vit_small.q.int8.mlir \
--target saturn_opu --hw OPU --dump-artifacts \
--output-dir /tmp/vit_small_post_fix
build/host-merlin-release/llvm-project/bin/llvm-objdump -d --mattr=+v \
/tmp/vit_small_post_fix/binaries/*.so | \
grep -cE '(vfred[a-z]+\.vs|vfwred[a-z]+\.vs)' # expect 0
Verify regression-free for non-LayerNorm models
conda run -n merlin-dev uv run tools/merlin.py compile \
models/opu_bench_suite/opu_bench_large_mlp.q.int8.mlir \
--target saturn_opu --hw OPU --dump-artifacts \
--output-dir /tmp/large_mlp_post_fix
diff <(llvm-objdump -d /tmp/large_mlp_phaseB/binaries/*.so) \
<(llvm-objdump -d /tmp/large_mlp_post_fix/binaries/*.so)
# expect: no diff (large_mlp emits zero vector.reduction ops)
Verify the fix dynamically
# Stage vit_small under FireSim and run end-to-end inference.
# Expected log:
# [apply] #15 ord=9 bindings=3
# [d] #1073 o=9 wg=8 ...
# [d] #1074 o=9 wg=9 ... ← second workgroup now appears
# ... rest of inference ...
# DONE
Addendum (2026-04-14) — residual narrow-M hang in iree_uk_opu_matmul
The vfredusum scalarization fixed every LayerNorm/softmax dispatch in
vit_small. Post-fix, the model runs cleanly through dispatches 0–8, enters
dispatch 9 (matmul_like_64x128x128), and the simulation stalls at the
first workgroup of that dispatch ([d] o=9 wg=0,0,0 of 8,1,1). This is a
second, independent bug inside the tier-6 OPU matmul ukernel
(iree_uk_opu_matmul, compiled as LLVM bitcode under -ffreestanding
-nostdinc). Static disassembly of the post-fix ELF still shows zero
vfred* opcodes — the scalarization fix is doing its job, but the
ukernel hangs for a different reason.
Characterization:
- The ukernel works for
large_mlp(128×2048×2048, M-tiles=8) and every transformer batch_matmul in ViT (those take the siblingiree_uk_mmt4dukernel, not this one). - The ukernel fails for
vit_smalldispatch 9 (64×128×128, M-tiles=4) and would fail for any narrow-M 3D matmul passed through the encoding resolver. By inspection, the likely regime isM-tiles < 8. - The hang is bitcode-internal (no libc), so standard
printfdebugging requires plumbing an IREE-style ukernel-import "escape hatch" (analogous toiree_h2f_ieee), or a higher-level compile-time workaround (pad M to a multiple of 8 before the encoding resolver, or downgrade narrow matmuls to themmt4dpath instead ofopu_matmul).
Runtime engineering improvements landed alongside this investigation so the next debug iteration is tractable:
- Warmup-gated debug prints. A new
iree_merlin_dispatch_debug_enabledflag (defined iniree/hal/utils/deferred_command_buffer.c) suppresses[apply]/[d]/[vm_invoke]prints until the benchmark callsiree_merlin_enable_dispatch_debug(1)right beforeWarmup START. Previously, ~1070 init dispatches spammed the UART before reaching the hang, turning every FireSim iteration into a 10-minute affair. After gating, the hang reproduces in seconds. - Host-level FireSim timeout + signal-clean cleanup.
build_tools/firesim/run_phase_d.shnow takestimeout=0for unbounded runs, traps SIGINT sofiresim killalways runs at teardown, and falls back to$FIRESIM_RUNS_DIR/sim_slot_0/uartlogwhen the kill path interrupts artifact collection.
The narrow-M ukernel bug is out of scope for the vfredusum scalarization change. Tracked as a follow-up below; the vfredusum fix and the paper figures stand on their own.
Addendum 2 (2026-04-14) — narrow-M bypass in the OPU encoding resolver
Resolved the narrow-M hang with a targeted bypass inside the OPU encoding
resolver. The fix is in
third_party/iree_bar/compiler/src/iree/compiler/Codegen/ExternalInterfaces/CPUEncodingExternalModels.cpp
in lowerOPUContractionToUkernel: for 2D-output matmul dispatches with
static M-tile-count < 8, the helper now emits linalg.mmt4d (into a
fresh 4D packed output) plus a tensor.unpack back to the 2D identity
shape, rather than calling the tier-6 iree_uk_opu_matmul ukernel. The
compute still runs on OPU hardware because iree_uk_mmt4d has a runtime
+xopu dispatch into iree_uk_mmt4d_opu_full_loop (the tier 2-4
VOPACC path).
Why this works where the earlier attempts failed:
- The earlier global
iree-opu-disable-encoding-resolverflag broke batch_matmul dispatches (ViT attention) because the default CPU encoding resolver doesn't handle their non-standard indexing maps. Keeping the OPU resolver attached means batch_matmul dispatches still take the OPU custom path; only the narrow-M 2D case is overridden. - The preprocessing-pass attempts (rewriting to
linalg.batch_matmulbatch=1 or padding M to 128) were undone by downstream canonicalization / dispatch hoisting. The per-op bypass inside the encoding resolver runs at the right layer.
Verification:
| Model | iree_uk_opu_matmul (pre-fix) |
post-fix | iree_uk_mmt4d (pre-fix) |
post-fix |
|---|---|---|---|---|
| vit_small | 6 | 0 | 5 | 11 |
| large_mlp | 4 | 4 | 0 | 0 |
| vit (full) | (hangs) | completes | — | — |
| tinyllama | (hangs) | completes | — | — |
large_mlp is untouched (its M-tile count is 8 ≥ threshold) and keeps
the tier-6 fast path. vit_small, vit, tinyllama all drop to
tier 2-4 perf (~35 Ops/cyc analytical, ~26× the RVV baseline) but are
now runnable end-to-end on FireSim.
Paper figures (benchmarks/SaturnOPU/make_paper_figures.sh): for the
per-model decomposition plot, vit_small / vit / tinyllama now show their
matmul compute in the opu_mmt4d green segment instead of the
encoding_resolver blue segment. Overall OPU % per model is unchanged
(both are OPU-accelerated paths). large_mlp is unchanged.
Follow-Up Tasks
- [ ] Residual narrow-M hang in
iree_uk_opu_matmul. Candidates: (a) ukernel-internal debug via IREE ukernel-import escape hatch, (b) compile-time pad-to-8 narrow-M workaround, (c) route narrow 3D matmuls through theiree_uk_mmt4dukernel which handles them correctly (requires output-shape plumbing since mmt4d produces 4D while the encoding resolver produces 2D identity). - [ ] Run Phase A survey end-to-end on FireSim (4 runs × 5-min timeout) to
get works/hangs status for
vfsqrt.v,vrgather.vi,vfredmax.vs, andvfslide1down.vf. Ifvfsqrtorvrgatheralso hang, extend the MLIR pattern to scalarizemath.sqrtover vectors and lowervector.broadcast/vector.shuffletovector.insertchains respectively. - [ ] Re-run vit_small full inference on FireSim post-fix to confirm LayerNorm now returns and the model completes.
- [ ] Run smolvla compile (skipped here — 909 MB MLIR, not host-compile friendly without batching) and inventory its opcode emission to confirm the same pattern set covers it.
- [ ] File an upstream issue against
saturn-vectorsRTL referencing the works/hangs table once Phase A is complete. - [ ] Once stable, gate the rvvtest binary behind a default-OFF
MERLIN_BUILD_OPU_RVV_SELFTESTcmake option insamples/SaturnOPU/simple_embedding_ukernel/CMakeLists.txtso it stays in tree as a hardware-bringup probe without bloating default builds. - [ ] Restore
chipyard/sims/firesim/deploy/config_runtime.yamlfrom the.bak_rvvtestbackup once the rvvtest workload is no longer the active one.