`benchmarks.SaturnNPU.kernel_library.matmul_acc`

Source: benchmarks/SaturnNPU/kernel_library/matmul_acc.py

`benchmarks.SaturnNPU.kernel_library.matmul_acc`

K-tiled matmul kernel variants that share a single MXU accumulator.

The base matmul kernel computes C = A @ B for one (M=32, K=32, N=32) tile and immediately drains the accumulator. To support matmuls whose K dimension exceeds 32, the compiler emits a sequence of these three variants along a K-loop, all targeting the same accumulator state:

matmul_acc_first  — vmatmul.mxu0       (overwrite accumulator, no drain)
matmul_acc_mid    — vmatmul.acc.mxu0   (add to accumulator, no drain)
matmul_acc_last   — vmatmul.acc.mxu0   (add then drain + DMA store)

The scalar/DMA prefix and the vmatpush.weight setup are identical across variants — only the multiply mnemonic and the trailing pop/store block differ.