`models.opu_bench_suite.opu_bench_models`

Source: models/opu_bench_suite/opu_bench_models.py

`models.opu_bench_suite.opu_bench_models`

OPU Benchmark Model Suite — Realistic architectures for Saturn OPU evaluation.

Four models designed to stress-test the OPU 16×16 outer-product accelerator while using real architectural patterns from production neural networks. All dimensions are multiples of 16 for clean OPU tile alignment.

Models

vit_block — Vision Transformer encoder block (attention + FFN)
convnet — CNN backbone (3×3 + 1×1 convolutions, ResNet-style)
hybrid — Conv stem → Transformer blocks (MobileViT-style)
large_mlp — Dense GEMM stress test (large hidden dims)

Usage

python opu_bench_models.py --all # Export all 4 python opu_bench_models.py --model vit # Export one

`ConvNet`

Bases: Module

Small ResNet-style CNN: 3 stages × 2 blocks, channels 32→64→128.

`HybridModel`

Bases: Module

Conv stem (64→dim) → reshape to sequence → transformer blocks → head.

`LargeMLP`

Bases: Module

Deep MLP sized to keep matmul K small enough to fit L1 on the OPU (working-set per 32×32 output tile ≈ K·64 B; K=512 → 32 KB, fits comfortably). 6 layers × 128×512×512 ≈ 200 M i8 FMAs total.

`MLPFast`

Bases: Module

MLP designed to maximize OPU matmul dominance. Modeled on the same principles as TinyLlama: all matmul shapes are TinyLlama-scale (K ≥ 1024), no per-layer non-matmul overhead beyond ReLU, so the matmul-% of total compute approaches 99%. Expected speedup > 2× (and in practice closer to 5× because the OPU 32×32 tile amortizes memory bandwidth well at K=1024).

Matmul shapes produced

Layer 0: 128 × 1024 × 128 (input 128 → hidden 1024) Layer i: 128 × 1024 × 1024 (depth-2 copies) Last: 128 × 16 × 1024 (hidden 1024 → 16 classes; N=16 < 32 so the encoding resolver picks 16×16)

All dims are multiples of 32. Compute ≈ depth × 128 × 1024² ≈ 500M MACs for depth=4, dominated by matmul.

`ViTAllTokens`

Bases: Module

ViT with conv stem + transformer blocks + all-token head.

The conv stem (3→64→dim, two 3×3 stride-2 convs) produces "direct conv" dispatches in the decomposition, similar to DroNet. The transformer blocks produce OPU 32×32 matmul dispatches. The head Linear applied to ALL tokens (not CLS-only) produces an encoding 16×16 dispatch (N=output_dim < 32). This gives a rich mixed decomposition: direct conv + OPU 32×32 + encoding 16×16 + reduction/softmax + elementwise.

Input: spatial image [B, 3, H, H] where H = sqrt(seq_len) * 4. Conv stem: 3→64→dim with two stride-2 convs reduces spatial to sqrt(seq_len) × sqrt(seq_len), then reshaped to [B, seq_len, dim].

`ViTBlockExplicit`

Bases: Module

ViT block with EXPLICIT Q/K/V linear layers (instead of nn.MultiheadAttention's combined in_proj_weight). This exports to ONNX as three separate Gemm nodes, avoiding the B=3 batch_matmul pattern that the IREE encoding resolver does not accelerate. Each QKV projection becomes a clean 2D matmul → OPU 32×32 tile.

`ViTModel`

Bases: Module

2-block ViT encoder (enough to show OPU utilization, small enough to run).