2026-03-13: RISC-V MMT4D Ukernel Workstream
Context and Goal
This workstream focused on moving the RISC-V matmul effort onto the mmt4d
ukernel path instead of the vector-contract custom-kernel path.
The concrete goals were:
- keep regular RVV int8 on an efficient
mmt4dpath - enable SpacemiT
+xsmtvdotint8 throughmmt4d - enable Saturn OPU
+xopuint8 throughmmt4dwith high-K tiling - make FP8 compilation work for the same lowering flow
- validate the generated assembly with
--dump-artifacts, not just the IR
The main design constraint was to follow IREE's existing mmt4d structure and
tile-selection conventions as closely as possible, while still exposing target-
specific tile families where the hardware clearly wants them.
Implementation Changes
1. RISC-V ukernel-side mmt4d work
The i8 path was moved onto explicit RISC-V ukernel implementations in:
third_party/iree_bar/runtime/src/iree/builtins/ukernel/arch/riscv_64/mmt4d_riscv_64_v_i8.cthird_party/iree_bar/runtime/src/iree/builtins/ukernel/arch/riscv_64/mmt4d_riscv_64_tiles.inlthird_party/iree_bar/runtime/src/iree/builtins/ukernel/arch/riscv_64/query_tile_sizes_riscv_64_entry_point.c
Implemented tile families:
- SpacemiT
+xsmtvdot:4x4x8 - OPU
+xopuint8:16x16x128with narrow-M truncations
The OPU int8 path keeps the high-K tile because that is where the hardware extension actually makes sense. This matches the intent from the Saturn sample kernels and avoids falling back to a low-K generic RVV-style shape.
2. Compiler tile selection and lowering configuration
Compiler-side tile selection was updated so encoding/materialization and lowering strategy agree on the same tile families:
third_party/iree_bar/compiler/src/iree/compiler/Codegen/ExternalInterfaces/CPUEncodingExternalModels.cppthird_party/iree_bar/compiler/src/iree/compiler/Codegen/LLVMCPU/KernelDispatch.cpp
Changes made:
- added
+xsmtvdotFP8/int8 tile families using4x4x8 - added
+xopuint8 tile family using16x16x128 - added
+xopuFP8 tile family using16x16x8
The OPU FP8 case intentionally does not use K=128 yet. That tile shape
caused oversized vector contracts during lowering, so for now the compiler uses
a smaller K=8 shape to keep the path compiling while native FP8 OPU lowering
is still missing.
3. FP8 legalization fix
The first FP8 blocker was not OPU- or SpacemiT-specific. The CPU backend left
arith.extf illegal for cases like:
The fix was made in:
third_party/iree_bar/compiler/src/iree/compiler/Codegen/Common/ConvertUnsupportedFloatArithPass.cpp
The pass already knew how to emulate small-float extf to f32. It now
reconstructs f32 first and then emits a final cast to the requested wider
destination type such as f16.
Regression coverage was added in:
third_party/iree_bar/compiler/src/iree/compiler/Codegen/Common/test/convert_unsupported_float_arith.mlir
4. Lowering-strategy tests
Test coverage was extended so target-specific lowering selection is visible in MLIR before looking at assembly:
third_party/iree_bar/compiler/src/iree/compiler/Codegen/LLVMCPU/test/select_riscv_spacemit_lowering_strategy.mlirthird_party/iree_bar/compiler/src/iree/compiler/Codegen/LLVMCPU/test/select_riscv_opu_lowering_strategy.mlir
The new OPU test now covers both:
- int8 -> i32
- fp8 -> f16
5. Benchmark / artifact scripts
Artifact-driven validation scripts were added or updated in Merlin:
benchmarks/SpacemiTX60/compile_matmul_xsmt_i8_ukernel_all.shbenchmarks/SpacemiTX60/compile_matmul_xsmt_fp8.shbenchmarks/SaturnOPU/compile_matmul_opu_i8_ukernel_all.shbenchmarks/SaturnOPU/compile_matmul_opu_fp8_ukernel_all.sh
These scripts compile with:
--iree-llvmcpu-enable-ukernels=all--iree-llvmcpu-link-ukernel-bitcode=true--iree-opt-data-tiling=true--iree-dispatch-creation-data-tiling=true--dump-artifacts
For i8, the scripts also verify the hot loop by checking the dumped .s file
for the expected target instructions.
What Worked
1. XSMT int8 mmt4d path
The generated hot loop for SpacemiT int8 is structurally good:
- two vector loads
- one
vmadotencoded as.insn - pointer increments
- loop branch
Relevant assembly:
/scratch2/agustin/merlin/build/compiled_models/SpacemiT/spacemit_x60_RVV_matmul_i8_2048/files/module_matmul_i8_2048_linked_embedded_elf_riscv_64.s
The hot loop is:
.LBB0_3:
vle8.v v0, (a1)
vle8.v v1, (a2)
.insn r 43, 3, 113, t3, zero, ra
addi a2, a2, 32
addi a0, a0, 1
addi a1, a1, 32
bnez a0, .LBB0_3
This is the right kind of loop to optimize: no obvious in-loop spills, no extra scalar unpacking, and the target instruction is in the inner loop.
2. OPU int8 mmt4d path
The OPU int8 loop is also structurally good:
- strided vector load for A
- strided vector load for B
- one
vopaccencoded as.insn - pointer increments
- loop branch
Relevant assembly:
/scratch2/agustin/merlin/build/compiled_models/SpacemiT/saturn_opu_OPU_matmul_i8_2048/files/module_matmul_i8_2048_linked_embedded_elf_riscv_64.s
Hot loop:
.LBB0_4:
vlse8.v v16, (a4), a0
vlse8.v v18, (a1), a0
.insn r 87, 2, 81, zero, s2, a6
addi a1, a1, 1
addi a3, a3, 1
addi a4, a4, 1
bnez a3, .LBB0_4
This is the expected shape for the OPU int8 path and is much closer to the hardware samples than a generic vector-contract lowering.
3. FP8 compilation
After the arith.extf legalization fix, both FP8 targets compile:
- SpacemiT FP8 lowers to
linalg.mmt4dwith4x4x8 - OPU FP8 lowers to
linalg.mmt4dwith16x16x8
Configured dispatch artifacts:
/scratch2/agustin/merlin/build/compiled_models/SpacemiT/spacemit_x60_RVV_matmul_fp8_2048/configs/configured_module_matmul_fp8_2048_dispatch_0.mlir/scratch2/agustin/merlin/build/compiled_models/SpacemiT/saturn_opu_OPU_matmul_fp8_2048/configs/configured_module_matmul_fp8_2048_dispatch_0.mlir
Both now show linalg.mmt4d instead of failing during LLVMCPU lowering.
What Did Not Work (and Why)
1. FP8 is not efficient yet
FP8 now compiles, but the assembly is not good enough yet.
The SpacemiT FP8 assembly contains:
- repeated temporary vector stores/loads to stack
- scalar extraction patterns
- many calls to
__truncsfhf2 - no
vmadotFP8-like inner-product instruction in the hot loop
Relevant assembly:
/scratch2/agustin/merlin/build/compiled_models/SpacemiT/spacemit_x60_RVV_matmul_fp8_2048/files/module_matmul_fp8_2048_linked_embedded_elf_riscv_64.s
This means the path is compiling through generic legalized vector/scalar code, not a target-native FP8 kernel.
The same issue exists for OPU FP8:
- it lowers through
mmt4d - but the dumped
.sdoes not showvopacc-style FP8 hardware usage - it contains extensive software conversion/truncation traffic
So the current FP8 result is:
- compiler path fixed
- code generation path still not hardware-accelerated
2. OPU FP8 high-K tile was too aggressive
Trying to force OPU FP8 into 16x16x128 immediately failed legality checks due
to enormous intermediate vector contracts.
That was reduced to K=8 as a temporary compiler-side compromise. The int8 OPU
path keeps K=128; the FP8 OPU path does not yet have the dedicated lowering
needed to support that shape efficiently.
Debugging Notes
The debugging loop that worked best here was:
- confirm the selected lowering config in
configured_module_*.mlir - confirm that the op is really
linalg.mmt4d - compile with
--dump-artifacts - inspect the first hot loop in the dumped
.s
This was important because simply seeing the right tile sizes in MLIR was not enough. FP8 is the main example: the IR shape was acceptable, but the final assembly made it obvious that the path was still software-heavy.
The key compiler bug found during this workstream was the illegal
arith.extf from FP8 vectors to f16, which was only visible once FP8 matmul
started reaching the LLVMCPU lowering pipeline.
Test Coverage and Exact Commands
Build:
conda run -n merlin-dev uv run tools/merlin.py build \
--profile full-plugin \
--config release \
--cmake-target iree-compile
Unsupported-float regression:
cd third_party/iree_bar
/scratch2/agustin/merlin/build/host-merlin-release/tools/iree-opt \
--split-input-file \
--pass-pipeline="builtin.module(func.func(iree-convert-unsupported-float-arith))" \
compiler/src/iree/compiler/Codegen/Common/test/convert_unsupported_float_arith.mlir \
| /scratch2/agustin/merlin/build/host-merlin-release/llvm-project/bin/FileCheck \
compiler/src/iree/compiler/Codegen/Common/test/convert_unsupported_float_arith.mlir
Lowering-strategy tests:
cd third_party/iree_bar
/scratch2/agustin/merlin/build/host-merlin-release/tools/iree-opt \
--pass-pipeline='builtin.module(iree-llvmcpu-select-lowering-strategy)' \
--split-input-file \
compiler/src/iree/compiler/Codegen/LLVMCPU/test/select_riscv_spacemit_lowering_strategy.mlir \
| /scratch2/agustin/merlin/build/host-merlin-release/llvm-project/bin/FileCheck \
compiler/src/iree/compiler/Codegen/LLVMCPU/test/select_riscv_spacemit_lowering_strategy.mlir
/scratch2/agustin/merlin/build/host-merlin-release/tools/iree-opt \
--pass-pipeline='builtin.module(iree-llvmcpu-select-lowering-strategy)' \
--split-input-file \
compiler/src/iree/compiler/Codegen/LLVMCPU/test/select_riscv_opu_lowering_strategy.mlir \
| /scratch2/agustin/merlin/build/host-merlin-release/llvm-project/bin/FileCheck \
compiler/src/iree/compiler/Codegen/LLVMCPU/test/select_riscv_opu_lowering_strategy.mlir
Artifact-driven compiles:
benchmarks/SpacemiTX60/compile_matmul_xsmt_i8_ukernel_all.sh
benchmarks/SpacemiTX60/compile_matmul_xsmt_fp8.sh
benchmarks/SaturnOPU/compile_matmul_opu_i8_ukernel_all.sh
benchmarks/SaturnOPU/compile_matmul_opu_fp8_ukernel_all.sh
Critical Bug Fix: vmadot VL Not Set (2026-03-18)
Root cause
The vmadot kernel (iree_uk_mmt4d_tile_s8s8s32_4x4x8_riscv_64_xsmtvdot_native
in mmt4d_riscv_64_v_i8.c) triggered SIGILL on larger matrices (256x256+)
while working on small ones (32x64). Two bugs:
Bug 1: VL=16 instead of VL=32. The accumulator init used
vsetvli zero, zero, e32, m2 which set VL=16 (for 16 x i32). When switching
to e8, m1 for the input loads, the old code used vsetvli zero, zero which
keeps VL=16. But vmadot requires VL=32 (4x4x8 = 32 bytes per operand).
With VL=16, only half the data was loaded and the hardware trapped.
Bug 2: LLVM bitcode pipeline strips vsetvli. Even after fixing the source
to use vsetvli zero, %0, e8, m1, ta, ma with %0=32, LLVM's RISC-V backend
(in the bitcode link path) merges consecutive vsetvli instructions with the
same type config, converting our explicit VL=32 back to vsetvli zero, zero.
Fix
Use .word 0x0c02f057 (raw encoding of vsetvli zero, t0, e8, m1, ta, ma)
inside the inline asm block, preceded by li t0, 32. This encoding is opaque
to LLVM's bitcode optimizer, so it survives the bitcode pipeline intact.
The fixed hot loop (mmt4d_riscv_64_v_i8.c):
for (int k = 0; k < params->K; ++k) {
asm volatile(
"li t0, 32\n\t"
".word 0x0c02f057\n\t" // vsetvli zero, t0, e8, m1, ta, ma
"vle8.v v0, (%0)\n\t" // Load LHS (32 bytes)
"vle8.v v4, (%1)\n\t" // Load RHS (32 bytes)
".insn r 0x2b, 3, 0x71, v8, v0, v4\n\t" // vmadot v8, v0, v4
:
: "r"(lhs_ptr), "r"(rhs_ptr)
: "memory", "t0");
lhs_ptr += 32;
rhs_ptr += 32;
}
Key register assignments:
- v8 accumulator (VRM2, even-aligned: v8-v9, holds 16 x i32 = 4x4 tile)
- v0 LHS input (32 bytes = 4 rows x 8 cols of i8)
- v4 RHS input (32 bytes, does NOT overlap with acc v8-v9)
Critical build step: after changing the kernel, BOTH the host compiler
(iree-compile) AND the cross-compiled runtime must be rebuilt:
# 1. Rebuild host ukernel bitcode + iree-compile
conda run -n merlin-dev uv run tools/merlin.py build \
--profile full-plugin --config release \
--cmake-target iree-compile
# 2. Rebuild cross-compiled runtime
conda run -n merlin-dev uv run tools/merlin.py build \
--profile spacemit \
--cmake-target iree-run-module
The host iree-compile embeds the ukernel bitcode at link time. If only the
spacemit runtime is rebuilt, the old bitcode remains embedded in iree-compile
and the VMFB still contains the broken kernel.
Verified assembly output
After the fix, the VMFB assembly shows:
.LBB0_3: ; hot loop
li t0, 32 ; AVL = 32
.word 201519191 ; vsetvli zero, t0, e8, m1, ta, ma
vle8.v v0, (a0) ; load LHS
vle8.v v4, (a2) ; load RHS
.insn r 43, 3, 113, v8, v0, v4 ; vmadot v8, v0, v4
addi a0, a0, 32
addi a2, a2, 32
...
bnez ..., .LBB0_3
On-Board Benchmark Results (2026-03-18)
Tested on SpacemiT X60 (8-core RISC-V, VLEN=256, Linux 6.1.15).
Board: root@10.44.86.251.
Correctness
Both paths produce correct results for matmul_q_i8_256.mlir
(256x256 quantized matmul, input=1):
- RVV i8:
256x256xi32= all 256. Correct. - xsmtvdot i8:
256x256xi32= all 256. Correct.
Benchmark: 1024x1024 quantized matmul (i8xi8->i32)
Using iree-benchmark-module with multiple iterations:
| Path | Tile | Median | Speedup |
|---|---|---|---|
RVV i8 (vwmul.vx + vwadd.wv) |
8x16x1 | 312 ms | 1.0x |
xsmtvdot i8 (vmadot NPU) |
4x4x8 | 34.8 ms | 9.0x |
# RVV i8
iree-benchmark-module --device=local-task \
--module=bench_1024_rvv.vmfb \
--function=matmul_i8_quantized \
--input="1024x1024xi8=1" --input="1024x1024xi8=1" \
--benchmark_repetitions=5
# xsmtvdot (NPU)
iree-benchmark-module --device=local-task \
--module=bench_1024_xsmtvdot.vmfb \
--function=matmul_i8_quantized \
--input="1024x1024xi8=1" --input="1024x1024xi8=1" \
--benchmark_repetitions=5
The vmadot IME instruction provides a 9x speedup over standard RVV widening multiply-accumulate for int8 quantized matmul.
RVV i8 assembly (reference hot loop)
From tests/e2e/SpacemiT/tmp/matmul_q_i8_rvv/:
.LBB0_1:
vsetvli zero, zero, e8, mf2, ta, ma
vle8.v v24, (t1) ; load RHS (16 i8 elements)
lbu a1, -3(a3) ; load 8 scalar LHS bytes
lbu a0, -2(a3)
...
vwmul.vx v25, v24, a1 ; widening mul: i8*i8 -> i16
vwmul.vx v26, v24, a0
...
vsetvli zero, zero, e16, m1, tu, ma
vwadd.wv v8, v8, v25 ; widening add: i16+i32 -> i32
vwadd.wv v22, v22, v26
...
bnez a5, .LBB0_1
Upstream patch
Patches stored at patches/upstream/:
- riscv64-mmt4d-i8-rvv-kernel-only.patch (606 lines) — just the RVV kernel
- riscv64-mmt4d-i8-full.patch (865 lines) — all files
- README.md — extraction guide
The RVV i8 kernel uses only <riscv_vector.h> intrinsics (no inline assembly,
no vendor extensions). Fully extractable from mmt4d_riscv_64_v_i8.c lines
1-201.
FP8 vfmadot Progress (2026-03-18)
Runtime side: complete
All runtime infrastructure for FP8 f8E4M3FN x f8E4M3FN -> f16 is in place:
| File | Change |
|---|---|
exported_bits.h |
Added IREE_UK_FLAG_MMT4D_TYPE_F8E4M3F8E4M3F16 = 0x0B |
common.h |
Added IREE_UK_TYPE_FLOAT_8 = FLOAT_IEEE \| 3 |
mmt4d_internal.h |
Added iree_uk_mmt4d_type_f8e4m3f8e4m3f16 enum + routing |
exported_bits.h |
Added QUERY_TILE_SIZES_OPERATION_MATMUL_F8E4M3F8E4M3F16 = 0x0700 |
query_tile_sizes_riscv_64_entry_point.c |
Returns {M=4, K=8, N=4} for xsmtvdot |
mmt4d_riscv_64_tiles.inl |
Registered f8e4m3, f8e4m3, f16, 4, 8, _xsmtvdot |
mmt4d_riscv_64_v_i8.c |
iree_uk_mmt4d_tile_f8e4m3f8e4m3f16_4x4x8_riscv_64_xsmtvdot_native() |
The FP8 vfmadot kernel:
// vfmadot: Opcode 0x2b, Funct3 0, Funct7 0x75 (OPFMMA).
// Accumulator: VR (single register, 16 x fp16 = 256 bits).
// Inputs: 32 bytes of f8E4M3FN packed as i8.
asm volatile(
"li t0, 32\n\t"
".word 0x0c02f057\n\t" // vsetvli zero, t0, e8, m1, ta, ma
"vle8.v v0, (%0)\n\t" // Load LHS (32 bytes f8)
"vle8.v v4, (%1)\n\t" // Load RHS (32 bytes f8)
".insn r 0x2b, 0, 0x75, v8, v0, v4\n\t" // vfmadot v8, v0, v4
:
: "r"(lhs_ptr), "r"(rhs_ptr)
: "memory", "t0");
Key differences from int8 vmadot:
- Accumulator: e16, m1 (fp16) not e32, m2 (i32)
- Load/store: vle16.v/vse16.v for accumulator
- Encoding: funct7=0x75 (OPFMMA) and funct3=0 (standard FP)
Compiler side: blocking
Compiler side: complete (2026-03-18)
Added FP8 ukernel routing in CPULowerToUKernels.cpp:
// In the mmt4d type matching (line ~215):
} else if (isa<FloatType>(lhsElemType) &&
lhsElemType.getIntOrFloatBitWidth() == 8 &&
isa<FloatType>(rhsElemType) &&
rhsElemType.getIntOrFloatBitWidth() == 8 &&
outElemType.isF16()) {
flags = IREE_UK_FLAG_MMT4D_TYPE_F8E4M3F8E4M3F16;
// In the query tile sizes (line ~507):
} else if (isa<FloatType>(lhs) && lhs.getIntOrFloatBitWidth() == 8 &&
isa<FloatType>(rhs) && rhs.getIntOrFloatBitWidth() == 8 &&
out.isF16()) {
return IREE_UK_FLAG_QUERY_TILE_SIZES_OPERATION_MATMUL_F8E4M3F8E4M3F16;
Assembly verification: vfmadot present
After rebuilding iree-compile, the FP8 matmul hot loop now contains:
.LBB0_2:
li t0, 32 ; AVL = 32
.word 201519191 ; vsetvli zero, t0, e8, m1, ta, ma
vle8.v v0, (a0) ; load LHS (32 bytes f8E4M3FN)
vle8.v v4, (a1) ; load RHS (32 bytes f8E4M3FN)
.insn r 43, 0, 117, v8, v0, v4 ; vfmadot v8, v0, v4 (fp8->fp16)
addi a0, a0, 32
addi a1, a1, 32
Only 12 remaining software conversions (for pack/unpack boundary ops), down from 652 in the baseline.
Board test: vfmadot SIGILL
The SpacemiT X60 board has vmadot (OPMMA, integer) but not vfmadot
(OPFMMA, floating-point). The FP8 VMFB correctly generates vfmadot
instructions but they trap with SIGILL on the current hardware revision.
The int8 vmadot works and gives 9x speedup. The FP8 vfmadot will work on future hardware that implements the full OPFMMA instruction set.
Saturn OPU Per-Operand Encoding Optimization (2026-03-18)
Problem: strided loads dominate OPU kernel throughput
The original OPU kernel used vlse8.v (strided vector loads) because IREE's
mmt4d packs tiles as [M0, K0] row-major. To extract a column of M0=16
elements for one k0 value, the kernel needs stride=K0 access:
; BEFORE: strided loads (1 element/cycle on Saturn VLSU)
vlse8.v v16, (lhs_ptr + k0), stride ; 16 cycles
vlse8.v v18, (rhs_ptr + k0), stride ; 16 cycles
.insn r 87, 2, 81, zero, v16, v18 ; VOPACC ~4 cycles
; Total per k0: ~36 cycles (load-bound, 16x overhead)
Saturn's documentation (Section 4.6) confirms: "Saturn's VLSU is designed towards deployment as a DSP system, and thus fundamentally has limited performance on indexed or strided accesses, as it can only generate one element's address per cycle." Contiguous loads run at full dLen bandwidth (16 elements/cycle for dLen=128).
The Saturn OPU benchmarks (third_party/saturn-vectors/benchmarks/opu-gemm/)
avoid this entirely by pre-transposing the A matrix so M is innermost,
enabling vle8.v.
Solution: per-operand encoding for xopu targets only
IREE's getEncodingInfoImpl() is called once per operand (LHS, RHS,
result), and the encoding attribute carries operandIdx. We exploit this to
swap the inner dimension order for LHS/RHS when +xopu is present, without
affecting the result operand or any other target.
Compiler change (CPUEncodingExternalModels.cpp, ~10 lines):
// In getEncodingInfoImpl(), after getEncodingInfoForMatmul():
if (hasFeature(layoutAttr.getConfiguration(), "+xopu")) {
int64_t operandIdx = encoding.getOperandIndex().getInt();
if (operandIdx != IREE::Encoding::MATMUL_RESULT &&
info.innerDimsPos.size() >= 2) {
size_t sz = info.innerDimsPos.size();
std::swap(info.innerDimsPos[sz - 2], info.innerDimsPos[sz - 1]);
std::swap(info.innerTileSizes[sz - 2], info.innerTileSizes[sz - 1]);
}
}
This changes:
- LHS tile: [M0, K0] → [K0, M0] (M innermost = contiguous)
- RHS tile: [N0, K0] → [K0, N0] (N innermost = contiguous)
- Result: unchanged ([M0, N0], has no K dim, skipped by condition)
- outerDimsPerm: unchanged (mmt4d outer iteration order preserved)
What's NOT affected:
- RVV _v path (K0=1, stride=1, already contiguous)
- SpacemiT _xsmtvdot path (no +xopu feature)
- ARM, x86, or any other architecture
- The result operand encoding
Kernel change (mmt4d_riscv_64_v_i8.c): replace vlse8.v with vle8.v:
; AFTER: contiguous loads (16 elements/cycle on Saturn VLSU)
vle8.v v16, (lhs_ptr + k0*16) ; 1 cycle
vle8.v v18, (rhs_ptr + k0*16) ; 1 cycle
.insn r 87, 2, 81, zero, v16, v18 ; VOPACC ~4 cycles
; Total per k0: ~6 cycles (compute-bound, optimal)
The transpose cost is absorbed into the linalg.pack operation that runs
once before the mmt4d kernel. The pack already copies and tiles the data;
with the encoding swap it simply writes [K0, M0] order instead of [M0, K0].
This is a one-time cost amortized over all K iterations.
Verified assembly
Compiled matmul_i8.mlir with +xopu:
.LBB0_2: ; hot loop (k0 unrolled by 2)
vle8.v v16, (a1) ; contiguous LHS load
vle8.v v18, (a0) ; contiguous RHS load
.insn r 87, 2, 81, zero, a6, s2 ; VOPACC m0, v16, v18
vle8.v v20, (a5) ; LHS (k0+1)
vle8.v v22, (a3) ; RHS (k0+1)
.insn r 87, 2, 81, zero, s4, s6 ; VOPACC m0, v20, v22
addi a0, a0, 32
addi a1, a1, 32
bltu a2, a4, .LBB0_2
No vlse8.v (strided) in the hot loop. All loads are vle8.v (contiguous).
No stack spills. 6x improvement on the inner loop (36 → 6 cycles per k0).
Additional fixes in this session
-
VOPACC/OPFMACC operand order bug: rs1 and rs2 were swapped, producing transposed output. Cross-referenced with
bme.h: rs1=LHS(rows), rs2=RHS(cols). Fixed for both int8 VOPACC and fp8 OPFMACC. -
OPMVINBCAST register: changed from mc0 (x16, column-broadcast) to m0 (x0, row-broadcast) to match the Saturn benchmark pattern.
-
Removed duplicate function definitions: cleaned up broken sed-edit remnants (K0=1 fallback functions conflicting with K0=16 tile functions).
Why K0=16 (and the case for K0=128)
With contiguous loads, the inner compute per k0 is identical regardless of
K0. The total VOPACC calls and loads are always 2 * total_K. However, K0
controls the split between the inner k0 loop and the outer K-tile loop, and
the outer loop has real overhead per iteration:
Per K-tile overhead (~5 cycles):
- 2x pointer arithmetic (lhs_k_ptr, rhs_k_ptr)
- 1x vsetvli e8,m1 (may stall Shuttle pipeline on vtype transition)
- 1x loop branch + counter increment
- potential branch misprediction on small trip counts
For total K=128:
| K0 | K-tiles | Overhead (cycles) | Compute (cycles) | Overhead % |
|---|---|---|---|---|
| 16 | 8 | 40 | 768 | 5.2% |
| 32 | 4 | 20 | 768 | 2.6% |
| 64 | 2 | 10 | 768 | 1.3% |
| 128 | 1 | 5 | 768 | 0.7% |
The overhead scales linearly with K-tiles, so the ~5% penalty is consistent regardless of total K. For large matrices (K=1024): K0=16 → 64 K-tiles × 5 = 320 overhead cycles vs K0=128 → 8 K-tiles × 5 = 40 cycles.
Additional costs beyond raw cycle count: - vsetvli pipeline stall: each K-tile iteration transitions from e32 (accumulator) to e8 (loads). Shuttle may bubble on this vtype change. With K0=16, this happens 8x per tile; with K0=128, only 1x. - Branch prediction: the K-tile loop has a small trip count that may not predict well on Shuttle's in-order pipeline. - Icache pressure: more outer iterations = more fetch cycles for the loop preamble.
Current choice: K0=16. This was originally motivated by strided loads
(stride=K0=16 = 1 cache line). With contiguous loads that constraint is
gone, but K0=16 still has practical advantages:
- Divides all common K dimensions (64, 128, 256, 512) cleanly
- Pack tiles are small (256B), good for L1 locality
- The kernel code already uses params->K0 dynamically
TODO: evaluate K0=128. See follow-up task below for what this requires.
Upstream RVV i8 patch
Updated clean patch at patches/upstream/riscv64-mmt4d-i8-rvv-only.patch.
Contains only standard RVV code (no xsmtvdot, no xopu, no fp8). 6 file diffs:
- New
mmt4d_riscv_64_v_i8.c(201 lines, pure RVV intrinsics) mmt4d_riscv_64_tiles.inl(5_vs8s8s32 entries)query_tile_sizes_riscv_64_entry_point.c(i8i8i32 query with_vbranch)CPUEncodingExternalModels.cpp(standard RVV i8 tile enumeration)CMakeLists.txt(add source to bitcode library)BUILD.bazel(add source to srcs)
Follow-Up Tasks
- Saturn FPGA/simulator test: verify OPU int8 and fp8 correctness
- Upstream PR: submit the RVV i8 patch to IREE upstream
- FP8 OPU on-board: test
OPFMACCassembly on Saturn hardware - E5M2 altfmt: add
VSETVLI_ALTFMTsupport (vtypei bit 8 = 1)
TODO: Evaluate K0=128 for OPU
Increasing K0 from 16 to 128 eliminates ~5% outer-loop overhead per tile.
The kernel code already uses params->K0 dynamically, so no kernel changes
are needed. The changes required are compiler-side only:
Files to modify (3 files, ~6 lines total):
CPUEncodingExternalModels.cpp— change xopu tile from{16, 16, 16}to{16, 16, 128}inenumerateMatmulTileRiscv64()for both i8 and fp8query_tile_sizes_riscv_64_entry_point.c— change xopu return from{.M=16, .K=16, .N=16}to{.M=16, .K=128, .N=16}mmt4d_riscv_64_tiles.inl— change K0 from 16 to 128 in all xopu entries (10 lines: 5 for s8s8s32, 5 for f8e4m3f8e4m3f32)
Implications:
| Aspect | K0=16 | K0=128 |
|---|---|---|
| Pack tile size (LHS) | 256B | 2KB |
| Pack tile size (RHS) | 256B | 2KB |
| Outer loop overhead | ~5% | ~0.7% |
| K divisibility | K%16==0 | K%128==0 |
| L1 cache pressure | minimal | still fits (32KB L1) |
Risks:
- K must be divisible by 128 (or IREE pads, adding waste). Common NN
dimensions (128, 256, 512, 1024) divide cleanly. Odd dimensions
(e.g. K=192) would need 128+64 split with K0=128 and padding/fallback
for the remainder.
- Larger pack tiles may delay pipeline start: the first mmt4d invocation
can't begin until a full 2KB tile is packed, vs 256B with K0=16.
- No impact on correctness — the kernel uses params->K0 dynamically.
Validation plan:
1. Change the 3 files above
2. Rebuild host (--config release --with-plugin)
3. Compile matmul_i8.mlir with +xopu and verify .s shows same
vle8.v + VOPACC hot loop (just more inner iterations)
4. Benchmark K0=16 vs K0=128 on Saturn simulator for 1024x1024 matmul
5. If K0=128 shows measurable improvement, adopt it as default
Dev-blog written by: Agustin Coppari Hollmann