Skip to content

benchmarks.SaturnNPU.kernel_library.attention_acc

Source: benchmarks/SaturnNPU/kernel_library/attention_acc.py

benchmarks.SaturnNPU.kernel_library.attention_acc

Multi-tile flash-attention kernel variants (compiler-side plumbing).

The compiler emits a sequence of three attention variants per query row-block, chained by an outer K/V tile loop. The variants share an online-softmax recurrence: _first initializes the running (max, denom, partial output), _mid advances them, and _last divides by the final denominator and emits the result tile.

Status: the manifest-backed ISA bodies are CURRENTLY STUBS that clone the single-tile attention kernel (npu_uk_attention). They produce the same DMA in/out layout the eventual flash-attention kernels will use, so the compiler's loop-emission, address-patching and stitching paths can be exercised end-to-end. Replacing the bodies with the real online-softmax recurrence is tracked as a follow-up.