Guide: A/B Benchmarking IREE Ukernels for Saturn OPU on FireSim
This guide details the complete workflow to perform an A/B comparison benchmark between:
-
Baseline: The default, generic
linalg.genericimplementation of a matrix multiplication, as compiled by IREE. -
Optimized: The new
linalg.mmt4dimplementation that is lowered to your custom-patched Saturn OPU microkernel.
In order to make possible the integration of the OPU instructions we modified a few files in the IREE code generation.
Particularly:
third_party/iree_bar/compiler/src/iree/compiler/Codegen/ExternalInterfaces/CPUEncodingExternalModels.cppthird_party/iree_bar/runtime/src/iree/builtins/ukernel/arch/riscv_64/mmt4d_riscv_64_tiles.inlthird_party/iree_bar/runtime/src/iree/builtins/ukernel/arch/riscv_64/mmt4d_riscv_64_v.c
I recommend you to have a look at those files if you want to understand how we integrated the Outer Product as a replacement to the regular matrix multiplication ukernel of mmt4d.
Part 1: Build the IREE Toolchain
You should follow the first 5 steps in the iree_setup.md documentation.
Key steps are:
- Set up the Conda environment.
- Set the
WORKSPACE_DIR,IREE_SRC,BUILD_HOST_DIR, etc. - Build the host tools (like
iree-compile). - Build the RISC-V tools (like
iree-benchmark-module).
Part 2: Generate and Compile the Model (A/B Test)
This is the most critical stage. We will compile the same model twice: once with our ukernels enabled (Optimized) and once with them disabled (Baseline).
Step 2.1: Generate ONNX quantized model
From your samples/custom_dispatch_ukernels_saturn directory, run the export script.
cd samples/custom_dispatch_ukernels_saturn
# Use for a simple MLP model we just include a batch size of 16 to trigger the instruction
python export_models_onnx.py --model fc
Step 2.2: Convert ONNX to MLIR
Convert the new, batched ONNX model to an MLIR file.
# This uses the compiler you built in Part 1
${BUILD_HOST_DIR}/bin/iree-import-onnx \
compilation_phases_fc/model_quantized_ort.onnx \
--opset-version 20 \
-o model_quantized_ort.mlir
Step 2.3: Compile A/B Benchmark Artifacts
Now we compile model_quantized_ort.mlir twice to generate the self-contained benchmark .vmfb files.
We will use the riscv64 target triple and the +zvl128b feature, which is the VLEN we are targeting.
- Compile the Optimized (
_s) Kernels
# Compile with ukernels Enabled
${BUILD_HOST_DIR}/tools/iree-compile \
model_quantized_ort.mlir \
-o /dev/null \
--iree-hal-target-backends=llvm-cpu \
--iree-llvmcpu-target-triple=riscv64-unknown-linux-gnu \
--iree-llvmcpu-target-cpu-features="+m,+a,+f,+d,+v,+zvl128b,+zvfh,+zvbb" \
--iree-llvmcpu-target-abi=lp64d \
--iree-dispatch-creation-data-tiling \
--iree-llvmcpu-enable-ukernels="all" \
--iree-flow-export-benchmark-funcs \
--iree-opt-level=O3 \
--iree-hal-dump-executable-files-to=/scratch2/agustin/merlin/samples/custom_dispatch_ukernels_saturn/compilation_phases_fc/riscv/executables_opu
# --- This creates the self-contained benchmark .mlir files ---
# We now compile those .mlir files into the final .vmfb binaries
${BUILD_HOST_DIR}/tools/iree-compile \
riscv/executables_opu/module_main_graph\$async_dispatch_1_embedded_elf_riscv_64_benchmark.mlir \
-o ukernel_1_s.vmfb \
--iree-hal-target-backends=llvm-cpu \
--iree-llvmcpu-target-triple=riscv64-unknown-linux-gnu \
--iree-llvmcpu-enable-ukernels="all" \
--iree-llvmcpu-target-cpu-features="+m,+a,+f,+d,+v,+zvl128b,+zvfh,+zvbb" \
--iree-llvmcpu-target-abi=lp64d \
--iree-opt-level=O3
${BUILD_HOST_DIR}/tools/iree-compile \
riscv/executables_opu/module_main_graph\$async_dispatch_2_embedded_elf_riscv_64_benchmark.mlir \
-o ukernel_2_s.vmfb \
--iree-hal-target-backends=llvm-cpu \
--iree-llvmcpu-target-triple=riscv64-unknown-linux-gnu \
--iree-llvmcpu-enable-ukernels="all" \
--iree-llvmcpu-target-cpu-features="+m,+a,+f,+d,+v,+zvl128b,+zvfh,+zvbb" \
--iree-llvmcpu-target-abi=lp64d \
--iree-opt-level=O3
- Compile the Baseline (normal) Kernels
This command disables ukernels, forcing the compiler to use the generic CPUDoubleTilingExpert pipeline.
# Compile with ukernels Disabled
${BUILD_HOST_DIR}/tools/iree-compile \
model_quantized_ort.mlir \
-o /dev/null \
--iree-hal-target-backends=llvm-cpu \
--iree-llvmcpu-target-triple=riscv64-unknown-linux-gnu \
--iree-llvmcpu-target-cpu-features="+m,+a,+f,+d,+v,+zvl128b,+zvfh,+zvbb" \
--iree-llvmcpu-target-abi=lp64d \
--iree-dispatch-creation-data-tiling \
--iree-llvmcpu-enable-ukernels="none" \
--iree-flow-export-benchmark-funcs \
--iree-opt-level=O3 \
--iree-hal-dump-executable-files-to=/scratch2/agustin/merlin/samples/custom_dispatch_ukernels_saturn/compilation_phases_fc/riscv/executables_baseline
# --- Compile the baseline .mlir benchmark files ---
${BUILD_HOST_DIR}/tools/iree-compile \
riscv/executables_baseline/module_main_graph\$async_dispatch_1_embedded_elf_riscv_64_benchmark.mlir \
-o ukernel_1.vmfb \
--iree-hal-target-backends=llvm-cpu \
--iree-llvmcpu-target-triple=riscv64-unknown-linux-gnu \
--iree-llvmcpu-enable-ukernels="none" \
--iree-llvmcpu-target-cpu-features="+m,+a,+f,+d,+v,+zvl128b,+zvfh,+zvbb" \
--iree-llvmcpu-target-abi=lp64d \
--iree-opt-level=O3
${BUILD_HOST_DIR}/tools/iree-compile \
riscv/executables_baseline/module_main_graph\$async_dispatch_2_embedded_elf_riscv_64_benchmark.mlir \
-o ukernel_2.vmfb \
--iree-hal-target-backends=llvm-cpu \
--iree-llvmcpu-target-triple=riscv64-unknown-linux-gnu \
--iree-llvmcpu-enable-ukernels="none" \
--iree-llvmcpu-target-cpu-features="+m,+a,+f,+d,+v,+zvl128b,+zvfh,+zvbb" \
--iree-llvmcpu-target-abi=lp64d \
--iree-opt-level=O3
You now have your four target files: ukernel_1.vmfb, ukernel_1_s.vmfb, ukernel_2.vmfb, and ukernel_2_s.vmfb.
Part 3: Prepare the FireSim Workload
- Copy binaries from
${BUILD_RISCV_DIR}\tools\into your overlay folder. Specifically copyiree-benchmark-executable,iree-benchmark-moduleandiree-run-module. - Copy the generated vmfb files for each uKernel or model you want to test into that same folder.
- Cross-compile or use your favorite way to measure cycles. Mine is:
#include <stdio.h>
int main() {
unsigned long cycles;
// This assembly instruction reads the 'mcycle' CSR
asm volatile ("rdcycle %0" : "=r"(cycles));
printf("%lu\n", cycles);
return 0;
}
- Create a
run_iree.shto run the script:
#!/bin/bash
cd /
echo "--- Running IREE Microbenchmark Tests ---"
# --- Test Definitions ---
FUNC_1='main_graph$async_dispatch_1_embedded_elf_riscv_64_main_graph$async_dispatch_1_matmul_like_16x128x1024_i8xi8xi32'
FUNC_2='main_graph$async_dispatch_2_embedded_elf_riscv_64_main_graph$async_dispatch_2_matmul_like_16x10x128_i8xi8xi32'
# --- Array of modules to test ---
MODULES_TO_TEST=(
"ukernel_1.vmfb"
"ukernel_1_s.vmfb"
"ukernel_2.vmfb"
"ukernel_2_s.vmfb"
)
# --- Array of corresponding functions ---
FUNCTIONS_TO_CALL=(
"$FUNC_1"
"$FUNC_1"
"$FUNC_2"
"$FUNC_2"
)
# --- Run all 4 tests ---
for i in {0..3}; do
MODULE_FILE=${MODULES_TO_TEST[$i]}
FUNCTION_NAME=${FUNCTIONS_TO_CALL[$i]}
TEST_NUM=$((i + 1))
echo "--- Test $TEST_NUM: Benchmarking $MODULE_FILE ---"
echo "--- Capturing Start Cycle ---"
./get_cycle > /start_cycle_$TEST_NUM.txt
./iree-benchmark-module \
--device=local-sync \
--benchmark_report_aggregates_only=true \
--benchmark_display_aggregates_only=true \
--benchmark_time_unit=ns \
--benchmark_min_warmup_time=1 \
--benchmark_repetitions=10 \
--module=$MODULE_FILE > /output_$TEST_NUM.txt
echo "--- Capturing End Cycle ---"
./get_cycle > /end_cycle_$TEST_NUM.txt
done
echo "--- All Benchmarks Finished ---"
echo
# --- Calculate and print all results ---
for i in {0..3}; do
TEST_NUM=$((i + 1))
MODULE_FILE=${MODULES_TO_TEST[$i]}
START_CYCLE=$(cat /start_cycle_$TEST_NUM.txt)
END_CYCLE=$(cat /end_cycle_$TEST_NUM.txt)
TOTAL_CYCLES=$((END_CYCLE - START_CYCLE))
echo "========================================="
echo "Results for: $MODULE_FILE"
echo "========================================="
echo "TOTAL SIMULATION CYCLES (from ./get_cycle): $TOTAL_CYCLES"
echo "--- iree-benchmark-module Output (use 'Time' for exec cycles) ---"
cat /output_$TEST_NUM.txt
echo
done
poweroff
Step 4: Benchmark the executable instead of the module (WiP)
After copying the iree-benchmark-executable and our .vmfb files, we must now extract the .so files out of them.
# Do this on your host machine before building the workload
# IMPORTANT Correct the names of the .so files so that it doesnt get overwritten
cd /path/to/your/workload/overlay/
unzip ukernel_1.vmfb # Extracts ukernel_1.so (placeholder name)
unzip ukernel_1_s.vmfb # Extracts ukernel_1_s.so (placeholder name)
unzip ukernel_2.vmfb # ...
unzip ukernel_2_s.vmfb # ...
# IMPORTANT: Ensure all files are readable
chmod u+r *.so *.vmfb get_cycle iree-benchmark-executable
We now create a new run_iree.sh that can execute the iree-benchmark-executable:
#!/bin/bash
cd /
echo "--- Running IREE Microbenchmark Tests (Kernel Computation Only) ---"
# --- Tool Definitions ---
BENCH_TOOL="./iree-benchmark-executable"
CYCLE_TOOL="./get_cycle"
# --- Kernel .so Files (Extracted from VMFBs) ---
MODULES_TO_TEST=(
"ukernel_1.so" # Kernel 1: Generic 16x128x1024
"ukernel_2.so" # Kernel 2: Generic 16x10x128
"ukernel_1_s.so" # Kernel 3: Ukernel 16x128x1024
"ukernel_2_s.so" # Kernel 4: Ukernel 16x10x128
)
# --- Parameters for EACH Kernel (Derived from MLIR) ---
# Kernel 1: Generic 16x128x1024 (CPUDoubleTilingExpert)
PARAMS_1="--workgroup_count=4,4,1 --binding=18432xi8 --binding=132864xi8 --binding=18432xi8"
# Kernel 2: Generic 16x10x128 (CPUDoubleTilingExpert)
PARAMS_2="--workgroup_count=2,8,1 --binding=18432xi8 --binding=132864xi8 --binding=18432xi8"
# Kernel 3: Microkernel 16x128x1024 (Mmt4dTilingExpert)
PARAMS_3="--workgroup_count=4,1,1 --binding=20480xi8 --binding=133696xi8 --binding=20480xi8"
# Kernel 4: Microkernel 16x10x128 (Mmt4dTilingExpert)
PARAMS_4="--workgroup_count=1,1,1 --binding=20480xi8 --binding=133696xi8 --binding=20480xi8"
PARAMS_TO_USE=(
"$PARAMS_1"
"$PARAMS_2"
"$PARAMS_3"
"$PARAMS_4"
)
# --- Benchmark Settings ---
# Run 1000 dispatches per measurement (amortization)
BATCH_SIZE=1000
# Run the whole benchmark 10 times (statistical stability)
REPETITIONS=10
TOTAL_DISPATCHES=$((BATCH_SIZE * REPETITIONS))
# --- Run all 4 tests ---
for i in {0..3}; do
SO_FILE=${MODULES_TO_TEST[$i]}
PARAMS=${PARAMS_TO_USE[$i]}
TEST_NUM=$((i + 1))
echo "--- Test $TEST_NUM: Benchmarking $SO_FILE ---"
echo "--- Capturing Start Cycle ---"
$CYCLE_TOOL > /start_cycle_$TEST_NUM.txt
# Run the benchmark. This will run (BATCH_SIZE * REPETITIONS) total dispatches.
$BENCH_TOOL \
--device=local-sync \
--executable_file=/$SO_FILE \
--executable_format=embedded-elf-riscv_64 \
--entry_point=0 \
$PARAMS \
--batch_size=$BATCH_SIZE \
--benchmark_repetitions=$REPETITIONS \
--benchmark_out=/output_$TEST_NUM.json \
--benchmark_out_format=json
echo "--- Capturing End Cycle ---"
$CYCLE_TOOL > /end_cycle_$TEST_NUM.txt
done
echo "--- All Benchmarks Finished ---"
echo
# --- Calculate and print all results ---
echo "--- Benchmark Results (Cycles per Dispatch) ---"
echo "Total Dispatches per Test: $TOTAL_DISPATCHES (Batch=$BATCH_SIZE, Reps=$REPETITIONS)"
echo
for i in {0..3}; do
TEST_NUM=$((i + 1))
MODULE_FILE=${MODULES_TO_TEST[$i]}
START_CYCLE=$(cat /start_cycle_$TEST_NUM.txt)
END_CYCLE=$(cat /end_cycle_$TEST_NUM.txt)
TOTAL_CYCLES=$((END_CYCLE - START_CYCLE))
# This is your final number:
AVG_CYCLES=$((TOTAL_CYCLES / TOTAL_DISPATCHES))
echo "========================================="
echo "Results for: $MODULE_FILE"
echo "========================================="
echo "TOTAL SIMULATION CYCLES (from ./get_cycle): $TOTAL_CYCLES"
echo "AVERAGE CYCLES PER DISPATCH: $AVG_CYCLES"
echo "--- (Sanity Check: Mean Time from JSON) ---"
grep "real_time_mean" /output_$TEST_NUM.json || echo "JSON output not found."
echo
done
poweroff
Tips
unzip: short read: This error means your .vmfb file was corrupted or truncated when you copied it into the FireSim workload. Re-build the workload image.Illegal instruction(SIGILL): If running on spike: This is expected. spike is a generic RISC-V emulator and does not implement your custom VOPACC instruction. If running on RTL (FireSim): This means your VOPACC implementation in the processor RTL has a bug, or the opcode bits in your IREE mmt4d C-file (.insn r ...) do not match the decoder in your hardware.Segmentation fault(SIGSEGV): This almost always means your --binding=... or --workgroup_count=... parameters are wrong. You are telling the kernel to access memory that wasn't allocated. Re-check the MLIR files to derive the correct parameters.FAILED_PRECONDITION(Version Mismatch): The iree-compile you used to build the .vmfb is from a different commit than the iree-benchmark-executable you are using to run it. Rebuild both from the same source.
Debugging
I recommend to have a look at the whole compilation process by running:
iree-compile model_quantized_ort.mlir -o model_quantized_ort_riscv.vmfb \
--iree-hal-target-backends=llvm-cpu \
--iree-llvmcpu-target-triple=riscv64-unknown-linux-gnu \
--iree-llvmcpu-target-cpu-features="+m,+a,+f,+d,+v,+zvl512b,+zvfh,+zvbb" \
--iree-llvmcpu-target-abi=lp64d \
--dump-compilation-phases-to=riscv \
--iree-dispatch-creation-data-tiling \
--iree-llvmcpu-enable-ukernels="all" \
--iree-opt-level=O3 \
-mlir-disable-threading \
-mlir-print-ir-after-all 2>log.mlir
DUMP
(iree-dev) agustin@garden:/scratch2/agustin/merlin/samples/custom_dispatch_ukernels_saturn/compilation_phases_fc$ ${BUILD_HOST_DIR}-deb-tracy/tools/iree-compile model_quantized_ort.mlir -o /dev/null --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-triple=riscv64-unknown-linux-gnu --iree-llvmcpu-target-cpu-features="+m,+a,+f,+d,+v,+zvl128b,+zvfh,+zvbb" --iree-llvmcpu-target-abi=lp64d --iree-dispatch-creation-data-tiling --iree-llvmcpu-enable-ukernels="none" --iree-opt-level=O3 --iree-hal-dump-executable-files-to=/scratch2/agustin/merlin/samples/custom_dispatch_ukernels_saturn/compilation_phases_fc/riscv/executables --iree-hal-executable-debug-level=3 --iree-llvmcpu-debug-symbols=true --iree-llvmcpu-link-embedded=false --iree-vm-bytecode-module-strip-source-map=false
(iree-dev) agustin@garden:/scratch2/agustin/merlin/samples/custom_dispatch_ukernels_saturn/compilation_phases_fc$ ${BUILD_HOST_DIR}-deb-tracy/tools/iree-compile model_quantized_ort.mlir -o /dev/null --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-triple=riscv64-unknown-linux-gnu --iree-llvmcpu-target-cpu-features="+m,+a,+f,+d,+v,+zvl128b,+zvfh,+zvbb" --iree-llvmcpu-target-abi=lp64d --iree-dispatch-creation-data-tiling --iree-llvmcpu-enable-ukernels="none" --iree-opt-level=O3 --iree-hal-dump-executable-files-to=/scratch2/agustin/merlin/samples/custom_dispatch_ukernels_saturn/compilation_phases_fc/riscv/executables --iree-hal-executable-debug-level=3 \
--iree-llvmcpu-debug-symbols=true \
--iree-llvmcpu-link-embedded=false \
--iree-vm-bytecode-module-strip-source-map=false
(iree-dev) agustin@garden:/scratch2/agustin/merlin/samples/custom_dispatch_ukernels_saturn/compilation_phases_fc$ ${BUILD_HOST_DIR}-deb-tracy/tools/iree-compile riscv/executables_opu/module_main_graph\$async_dispatch_1_system_elf_riscv_64_benchmark.mlir -o ukernel_1_s.vmfb --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-triple=riscv64-unknown-linux-gnu --iree-llvmcpu-enable-ukernels="all" --iree-llvmcpu-target-cpu-features="+m,+a,+f,+d,+v,+zvl128b,+zvfh,+zvbb" --iree-llvmcpu-target-abi=lp64d--iree-opt-level=O3 --iree-hal-executable-debug-level=3 --iree-llvmcpu-debug-symbols=true --iree-llvmcpu-link-embedded=false --iree-vm-bytecode-module-strip-source-map=false
(iree-dev) agustin@garden:/scratch2/agustin/merlin/samples/custom_dispatch_ukernels_saturn/compilation_phases_fc$ ${BUILD_HOST_DIR}-deb-tracy/tools/iree-compile riscv/executables_opu/module_main_graph\$async_dispatch_2_system_elf_riscv_64_benchmark.mlir -o ukernel_2_s.vmfb --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-triple=riscv64-unknown-linux-gnu --iree-llvmcpu-enable-ukernels="all" --iree-llvmcpu-target-cpu-features="+m,+a,+f,+d,+v,+zvl128b,+zvfh,+zvbb" --iree-llvmcpu-target-abi=lp64d --iree-opt-level=O3 --iree-hal-executable-debug-level=3 --iree-llvmcpu-debug-symbols=true --iree-llvmcpu-link-embedded=false --iree-vm-bytecode-module-strip-source-map=false
(iree-dev) agustin@garden:/scratch2/agustin/merlin/samples/custom_dispatch_ukernels_saturn/compilation_phases_fc$ ${BUILD_HOST_DIR}-deb-tracy/tools/iree-compile riscv/executables/module_main_graph\$async_dispatch_2_system_elf_riscv_64_benchmark.mlir -o ukernel_2.vmfb --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-triple=riscv64-unknown-linux-gnu --iree-llvmcpu-enable-ukernels="none" --iree-llvmcpu-target-cpu-features="+m,+a,+f,+d,+v,+zvl128b,+zvfh,+zvbb" --iree-llvmcpu-target-abi=lp64d --iree-opt-level=O3 --iree-hal-executable-debug-level=3 --iree-llvmcpu-debug-symbols=true --iree-llvmcpu-link-embedded=false --iree-vm-bytecode-module-strip-source-map=false
(iree-dev) agustin@garden:/scratch2/agustin/merlin/samples/custom_dispatch_ukernels_saturn/compilation_phases_fc$ ${BUILD_HOST_DIR}-deb-tracy/tools/iree-compile riscv/executables/module_main_graph\$async_dispatch_1_system_elf_riscv_64_benchmark.mlir -o ukernel_1.vmfb --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-triple=riscv64-unknown-linux-gnu --iree-llvmcpu-enable-ukernels="none" --iree-llvmcpu-target-cpu-features="+m,+a,+f,+d,+v,+zvl128b,+zvfh,+zvbb" --iree-llvmcpu-target-abi=lp64d --iree-opt-level=O3 --iree-hal-executable-debug-level=3 --iree-llvmcpu-debug-symbols=true --iree-llvmcpu-link-embedded=false --iree-vm-bytecode-module-strip-source-map=false
Simple placeholder on last version compiled for Firesim
# 2. Compile
${BUILD_HOST_DIR}/tools/iree-compile \
model_quantized_ort.mlir \
-o model_quantized_ort.vmfb \
--iree-hal-target-backends=llvm-cpu \
--iree-llvmcpu-target-triple=riscv64-pc-linux-elf \
--iree-llvmcpu-target-abi=lp64d \
--iree-opt-level=O3 \
\
--iree-llvmcpu-enable-ukernels="all" \
--iree-opt-data-tiling \
--iree-dispatch-creation-data-tiling \
\
--iree-llvmcpu-target-cpu-features="+m,+a,+f,+d,+c,+v,+zvl128b,+zvfh,+zvbb" \
--iree-llvmcpu-target-vector-width-in-bytes= 16 \
--riscv-v-fixed-length-vector-lmul-max=2 \
\
--iree-hal-dump-executable-files-to="$DUMP_DIR" \
--iree-llvmcpu-debug-symbols=false