Writing quick GPU code is likely one of the most grueling specializations in machine studying engineering. Researchers from RightNow AI wish to automate it fully.
The RightNow AI analysis staff has launched AutoKernel, an open-source framework that applies an autonomous LLM agent loop to GPU kernel optimization for arbitrary PyTorch fashions. The strategy is simple: give it any mannequin earlier than you go to mattress, and get up to quicker Triton kernels — no GPU experience required.
https://arxiv.org/pdf/2603.21331
Why GPU Kernels Are So Exhausting to Optimize
A GPU kernel is a perform that runs in parallel throughout 1000’s of GPU cores. Whenever you run a transformer mannequin like LLaMA or GPT-2, the majority of compute time is spent inside kernels for operations like matrix multiplication (matmul), softmax, layer normalization, and a focus. These kernels stay in libraries like cuBLAS and cuDNN, or get generated mechanically by PyTorch’s compilation pipeline.
The issue is that squeezing most efficiency out of those kernels requires reasoning concurrently about arithmetic depth, reminiscence coalescing, register strain, tile sizes, warp-level synchronization, and tensor core instruction choice — a mix of abilities that takes years to develop. A single high-performance matmul kernel could contain 200+ strains of CUDA or Triton code with dozens of interdependent parameters. This experience is scarce, and the guide tuning course of scales poorly as mannequin architectures evolve.
The benchmark suite KernelBench, which evaluates frontier LLMs on 250 GPU kernel issues, discovered that even the most effective fashions matched PyTorch baseline efficiency in fewer than 20% of circumstances utilizing one-shot technology. AutoKernel was constructed instantly in response to that hole.
The Loop: Edit, Benchmark, Hold or Revert
AutoKernel’s core perception is that an skilled kernel engineer’s workflow is itself a easy loop: write a candidate, benchmark it, hold enhancements, discard regressions, repeat. The framework mechanizes this loop. An LLM agent modifies a single file — kernel.py — a set benchmark harness verifies correctness and measures throughput, and the consequence determines whether or not the change persists. Crucially, each experiment maps to a git commit. Stored experiments advance the department; reverted experiments are erased cleanly with git reset. All the historical past is browsable with normal git instruments, and experiment outcomes are logged to a plain tab-separated outcomes.tsv file — dependency-free, human-readable, and trivially parseable by the agent.
Every iteration takes roughly 90 seconds — 30 seconds for correctness checking, 30 seconds for efficiency benchmarking through Triton’s do_bench, and 30 seconds for agent reasoning and code modification. At roughly 40 experiments per hour, an in a single day 10-hour run yields 300 to 400 experiments throughout a number of kernels.
This design attracts instantly from Andrej Karpathy’s autoresearch mission, which demonstrated that an AI agent operating a hold/revert loop on LLM coaching code may uncover 20 optimizations throughout 700 experiments in two days on a single GPU. AutoKernel transplants this loop to kernel code, with a special search house and a correctness-gated benchmark because the analysis perform as an alternative of validation loss.
The agent reads a 909-line instruction doc referred to as program.md, which encodes skilled data right into a six-tier optimization playbook. The tiers progress from block measurement tuning (sweeping tile dimensions by means of powers of two, adjusting num_warps and num_stages) by means of reminiscence entry patterns (coalesced hundreds, software program prefetching, L2 swizzling), compute optimizations (TF32 accumulation, epilogue fusion), superior strategies (split-Okay, persistent kernels, Triton autotune, warp specialization), architecture-specific methods (TMA on Hopper, cp.async on Ampere, adjusted sizes for L4/RTX), and eventually kernel-specific algorithms like on-line softmax for consideration and Welford’s algorithm for normalization. The instruction doc is deliberately complete so the agent can run 10+ hours with out getting caught.
https://arxiv.org/pdf/2603.21331
Profiling First, Optimizing The place It Issues
Not like prior work that treats kernel issues in isolation, AutoKernel begins from an entire PyTorch mannequin. It makes use of torch.profiler with form recording to seize per-kernel GPU time, then ranks optimization targets utilizing Amdahl’s regulation — the mathematical precept that the general speedup you may obtain is bounded by how a lot of the overall runtime that part represents. A 1.5× speedup on a kernel consuming 60% of whole runtime yields a 1.25× end-to-end acquire. The identical speedup on a kernel consuming 5% of runtime yields only one.03×.
The profiler detects GPU {hardware} from a database of recognized specs protecting each NVIDIA (H100, A100, L40S, L4, A10, RTX 4090/4080/3090/3080) and AMD (MI300X, MI325X, MI350X, MI355X) accelerators. For unknown GPUs, it estimates peak FP16 throughput from SM rely, clock charge, and compute functionality — making the system usable throughout a wider vary of {hardware} than simply the most recent NVIDIA choices.
The orchestrator (orchestrate.py) transitions from one kernel to the subsequent when any of 4 situations are met: 5 consecutive reverts, 90% of GPU peak utilization reached, a two-hour elapsed time price range, or a 2× speedup already achieved on that kernel. This prevents the agent from spending extreme time on kernels with diminishing returns whereas higher-impact targets wait.
5-Stage Correctness Harness
Efficiency with out correctness is ineffective, and AutoKernel is especially thorough on this entrance. Each candidate kernel passes by means of 5 validation phases earlier than any speedup is recorded. Stage 1 runs a smoke check on a small enter to catch compilation errors and form mismatches in below a second. Stage 2 sweeps throughout 8 to 10 enter configurations and three knowledge sorts — FP16, BF16, and FP32 — to catch size-dependent bugs like boundary dealing with and tile the rest logic. Stage 3 checks numerical stability below adversarial inputs: for softmax, rows of huge an identical values; for matmul, excessive dynamic vary; for normalization, near-zero variance. Stage 4 verifies determinism by operating the identical enter 3 times and requiring bitwise an identical outputs, which catches race situations in parallel reductions and non-deterministic atomics. Stage 5 checks non-power-of-two dimensions like 1023, 4097, and 1537 to show masking bugs and tile the rest errors.
Tolerances are dtype-specific: FP16 makes use of atol = 10⁻², BF16 makes use of 2 × 10⁻², and FP32 makes use of 10⁻⁴. Within the paper’s full analysis throughout 34 configurations on an NVIDIA H100, all 34 handed correctness with zero failures throughout keen, compiled, and customized kernel outputs.
Twin Backend: Triton and CUDA C++
AutoKernel helps each Triton and CUDA C++ backends inside the identical framework. Triton is a Python-like domain-specific language that compiles JIT in 1 to five seconds, making it ideally suited for speedy iteration — the agent can modify block sizes, warp counts, pipeline phases, accumulator precision, and loop construction. Triton routinely reaches 80 to 95% of cuBLAS throughput for matmul. CUDA C++ is included for circumstances requiring direct entry to warp-level primitives, WMMA tensor core directions (utilizing 16×16×16 fragments), vectorized hundreds through float4 and half2, bank-conflict-free shared reminiscence layouts, and double buffering. Each backends expose the identical kernel_fn() interface, so the benchmark infrastructure runs identically no matter backend.
The system covers 9 kernel sorts spanning the dominant operations in fashionable transformer architectures: matmul, flash_attention, fused_mlp, softmax, layernorm, rmsnorm, cross_entropy, rotary_embedding, and cut back. Every has a PyTorch reference implementation in reference.py serving because the correctness oracle, and the benchmark computes throughput in TFLOPS or GB/s alongside roofline utilization in opposition to detected GPU peak.
Benchmark Outcomes on H100
Measured on an NVIDIA H100 80GB HBM3 GPU (132 SMs, compute functionality 9.0, CUDA 12.8) in opposition to PyTorch keen and torch.compile with max-autotune, the outcomes for memory-bound kernels are important. RMSNorm achieves 5.29× over keen and a couple of.83× over torch.compile on the largest examined measurement, reaching 2,788 GB/s — 83% of H100’s 3,352 GB/s peak bandwidth. Softmax reaches 2,800 GB/s with a 2.82× speedup over keen and three.44× over torch.compile. Cross-entropy achieves 2.21× over keen and a couple of.94× over torch.compile, reaching 2,070 GB/s. The features on these kernels come from fusing multi-operation ATen decompositions into single-pass Triton kernels that reduce HBM (Excessive Bandwidth Reminiscence) visitors.
AutoKernel outperforms torch.compile on 12 of the 16 consultant configurations benchmarked within the paper, regardless of torch.compile with max-autotune operating its personal Triton autotuning. TorchInductor’s generic fusion and autotuning doesn’t all the time discover the specialised tiling and discount methods that kernel-specific implementations exploit.
Matmul is notably more durable — PyTorch’s cuBLAS backend is extensively tuned per GPU structure. The Triton starter reaches 278 TFLOPS, properly under cuBLAS. Nevertheless, on the 2048³ measurement, AutoKernel beats torch.compile by 1.55×, demonstrating that TorchInductor’s matmul autotuning shouldn’t be all the time optimum both. Closing the cuBLAS hole stays the first goal for continued agent iteration.
In neighborhood deployment, an AutoKernel-optimized kernel took first place on the vectorsum_v2 B200 leaderboard with a latency of 44.086µs, outperforming the second-place entry at 44.249µs and third place at 46.553µs. A neighborhood consumer additionally reported {that a} single AutoKernel immediate — requiring roughly three minutes of agent interplay — produced a Triton FP4 matrix multiplication kernel that outperforms CUTLASS by 1.63× to 2.15× throughout a number of shapes on H100. CUTLASS represents hand-optimized C++ template code particularly designed for NVIDIA tensor cores, making this consequence significantly notable.
Key Takeaways
- AutoKernel turns weeks of skilled GPU tuning into an in a single day autonomous course of. By mechanizing the write-benchmark-keep/revert loop that skilled kernel engineers already comply with, the system runs 300 to 400 experiments per in a single day session on a single GPU with none human intervention.
- Correctness is non-negotiable earlier than any speedup is recorded. Each candidate kernel should go a five-stage harness protecting smoke checks, form sweeps throughout 10+ configurations, numerical stability below adversarial inputs, determinism verification, and non-power-of-two edge circumstances — eliminating the danger of the agent “optimizing” its solution to incorrect outputs.
- Reminiscence-bound kernels see the largest features over each PyTorch keen and torch.compile. On an NVIDIA H100, AutoKernel’s Triton kernels obtain 5.29× over keen on RMSNorm, 2.82× on softmax, and a couple of.21× on cross-entropy — with the features coming from fusing multi-operation ATen decompositions into single-pass kernels that reduce HBM visitors.
- Amdahl’s regulation drives the place the agent spends its time. Somewhat than optimizing kernels in isolation, AutoKernel profiles your complete PyTorch mannequin and allocates effort proportionally to every kernel’s share of whole GPU runtime — making certain that enhancements compound on the mannequin stage, not simply the kernel stage.
Take a look at the Paper and Repo. Additionally, be happy to comply with us on Twitter and don’t neglect to affix our 120k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you may be a part of us on telegram as properly.
Must associate with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and many others.? Join with us

