benchmarks.SaturnNPU.kernel_library.rope_frequency
Source: benchmarks/SaturnNPU/kernel_library/rope_frequency.py
benchmarks.SaturnNPU.kernel_library.rope_frequency
Per-element cosine on a 32x32 bf16 tile.
Despite the name, this kernel does not compose the full rotary embedding — it
computes y = cos(x) on a 32x32 bf16 tile (stored in VMEM as two 32x16
halves per the bf16_split_halves layout). Pair it with a sibling sin
kernel (or torch-side pre-computation) to form the full rotary transform.