models.opu_bench_suite.opu_bench_models
Source: models/opu_bench_suite/opu_bench_models.py
models.opu_bench_suite.opu_bench_models
OPU Benchmark Model Suite — Realistic architectures for Saturn OPU evaluation.
Four models designed to stress-test the OPU 16×16 outer-product accelerator while using real architectural patterns from production neural networks. All dimensions are multiples of 16 for clean OPU tile alignment.
Models
- vit_block — Vision Transformer encoder block (attention + FFN)
- convnet — CNN backbone (3×3 + 1×1 convolutions, ResNet-style)
- hybrid — Conv stem → Transformer blocks (MobileViT-style)
- large_mlp — Dense GEMM stress test (large hidden dims)
Usage
python opu_bench_models.py --all # Export all 4 python opu_bench_models.py --model vit # Export one
ConvNet
Bases: Module
Small ResNet-style CNN: 3 stages × 2 blocks, channels 32→64→128.
HybridModel
Bases: Module
Conv stem (64→dim) → reshape to sequence → transformer blocks → head.
LargeMLP
Bases: Module
Deep MLP sized to keep matmul K small enough to fit L1 on the OPU (working-set per 32×32 output tile ≈ K·64 B; K=512 → 32 KB, fits comfortably). 6 layers × 128×512×512 ≈ 200 M i8 FMAs total.
MLPFast
Bases: Module
MLP designed to maximize OPU matmul dominance. Modeled on the same principles as TinyLlama: all matmul shapes are TinyLlama-scale (K ≥ 1024), no per-layer non-matmul overhead beyond ReLU, so the matmul-% of total compute approaches 99%. Expected speedup > 2× (and in practice closer to 5× because the OPU 32×32 tile amortizes memory bandwidth well at K=1024).
Matmul shapes produced
Layer 0: 128 × 1024 × 128 (input 128 → hidden 1024) Layer i: 128 × 1024 × 1024 (depth-2 copies) Last: 128 × 16 × 1024 (hidden 1024 → 16 classes; N=16 < 32 so the encoding resolver picks 16×16)
All dims are multiples of 32. Compute ≈ depth × 128 × 1024² ≈ 500M MACs for depth=4, dominated by matmul.
ViTAllTokens
Bases: Module
ViT with conv stem + transformer blocks + all-token head.
The conv stem (3→64→dim, two 3×3 stride-2 convs) produces "direct conv" dispatches in the decomposition, similar to DroNet. The transformer blocks produce OPU 32×32 matmul dispatches. The head Linear applied to ALL tokens (not CLS-only) produces an encoding 16×16 dispatch (N=output_dim < 32). This gives a rich mixed decomposition: direct conv + OPU 32×32 + encoding 16×16 + reduction/softmax + elementwise.
Input: spatial image [B, 3, H, H] where H = sqrt(seq_len) * 4. Conv stem: 3→64→dim with two stride-2 convs reduces spatial to sqrt(seq_len) × sqrt(seq_len), then reshaped to [B, seq_len, dim].
ViTBlockExplicit
Bases: Module
ViT block with EXPLICIT Q/K/V linear layers (instead of nn.MultiheadAttention's combined in_proj_weight). This exports to ONNX as three separate Gemm nodes, avoiding the B=3 batch_matmul pattern that the IREE encoding resolver does not accelerate. Each QKV projection becomes a clean 2D matmul → OPU 32×32 tile.
ViTModel
Bases: Module
2-block ViT encoder (enough to show OPU utilization, small enough to run).