The race to make massive language fashions sooner and cheaper to run has largely been fought at two ranges: the mannequin structure and the {hardware}. However there’s a third, typically underappreciated frontier — the GPU kernel. A kernel is the low-level computational routine that really executes a mathematical operation on the GPU. Writing a superb one requires understanding not simply the maths, however the precise reminiscence format, instruction scheduling, and {hardware} quirks of the chip you’re focusing on. Most ML professionals by no means write kernels immediately; they depend on libraries like FlashAttention or Triton to do it for them.
Meet FlashQLA: a QwenLM’s contribution to this layer. Launched beneath the MIT License and constructed on the TileLang compiler framework, it’s a high-performance linear consideration kernel library particularly optimized for the Gated Delta Community (GDN) consideration mechanism — the linear consideration structure that powers the Qwen3.5 and Qwen3.6 mannequin households.
What’s Linear Consideration and Why Does It Matter?
To know what FlashQLA solves, it helps to know what commonplace softmax consideration prices. In a traditional Transformer, the eye mechanism has O(n²) complexity — which means that doubling the sequence size quadruples the computation. That is the basic bottleneck that makes processing lengthy paperwork, lengthy code information, or lengthy conversations costly.
Linear consideration replaces the softmax with a formulation that reduces this to O(n) complexity, making it scale far more favorably with sequence size. The Gated Delta Community (GDN) is one such linear consideration mechanism, and it has been built-in into Qwen’s hybrid mannequin structure, the place GDN layers alternate with commonplace full consideration layers. This hybrid design makes an attempt to get the perfect of each worlds: the expressiveness of full consideration the place it’s most wanted, and the effectivity of linear consideration all over the place else.
GDN makes use of what is named a ‘gated’ formulation — it applies an exponentially decaying gate to manage how a lot previous context is carried ahead. This gate is essential to how FlashQLA achieves its efficiency positive factors.
The Downside with Present Kernels
Earlier than FlashQLA, the usual implementation for GDN operations got here from the Flash Linear Consideration (FLA) library, which makes use of Triton kernels — Triton being OpenAI’s Python-based GPU programming language. Whereas Triton makes kernel authoring extra accessible, it comes with trade-offs: the kernels it produces aren’t all the time optimally scheduled for particular {hardware}, notably on NVIDIA’s Hopper structure (the H100 and H200 GPU era).
The Hopper structure launched new options like warpgroup-level Tensor Core operations and asynchronous information pipelines that Triton can’t all the time exploit to their full potential. That is the hole FlashQLA is designed to fill.
What FlashQLA Does Otherwise
FlashQLA applies operator fusion and efficiency optimization to each the ahead cross (used throughout inference and coaching) and the backward cross (used throughout coaching for gradient computation) of GDN Chunked Prefill. The result’s a 2–3× speedup on ahead passes and a 2× speedup on backward passes in comparison with the FLA Triton kernel throughout a number of situations on NVIDIA Hopper GPUs.
Three technical improvements drive these positive factors:
1. Gate-driven computerized intra-card context parallelism: Context parallelism (CP) refers to splitting an extended sequence throughout a number of processing items to allow them to work on completely different elements concurrently. FlashQLA exploits the exponential decay property of the GDN gate to make this break up mathematically legitimate — as a result of the gate’s decay signifies that tokens far aside in a sequence have diminishing affect on one another. This enables FlashQLA to robotically allow intra-card CP beneath tensor parallelism (TP), long-sequence, and small-head-count settings, bettering GPU Streaming Multiprocessor (SM) utilization with out requiring guide configuration.
2. {Hardware}-friendly algebraic reformulation: FlashQLA reformulates, to a sure extent, the mathematical computation of GDN Chunked Prefill’s ahead and backward flows to cut back overhead on three varieties of GPU {hardware} items: Tensor Cores (which deal with matrix multiplications), CUDA Cores (which deal with scalar and vector operations), and the Particular Operate Unit (SFU, which handles operations like exponentials and sq. roots). Critically, that is executed with out sacrificing numerical precision — an necessary assure when the reformulation is getting used for mannequin coaching.
3. TileLang fused warp-specialized kernels: Quite than decomposing the computation into unbiased sequential kernels (too sluggish) or fusing all the pieces right into a single monolithic kernel (too inflexible to optimize), FlashQLA takes a center path. It makes use of TileLang to construct a number of key fused kernels and manually implements warpgroup specialization — a method that assigns completely different warpgroups (teams of 128 threads on Hopper) to specialised roles, comparable to one warpgroup transferring information from world reminiscence to shared reminiscence whereas one other concurrently runs Tensor Core matrix multiplications. This overlap of information motion, Tensor Core computation, and CUDA Core computation is what permits FlashQLA to strategy the theoretical peak throughput of the {hardware}.
Benchmarks
FlashQLA was benchmarked in opposition to two baselines: the FLA Triton kernel (model 0.5.0, Triton 3.5.1) and FlashInfer (model 0.6.9), utilizing TileLang 0.1.8, on NVIDIA H200 GPUs. The benchmarks used the pinnacle configurations from the Qwen3.5 and Qwen3.6 mannequin households, with head dimensions hv ∈ 64, 48, 32, 24, 16, 8, equivalent to tensor parallelism settings from TP1 by TP8.
The ahead (FWD) benchmarks measure single-kernel latency for various fashions and TP settings beneath various batch lengths. The backward (BWD) benchmarks study the connection between complete token depend inside a batch and latency throughout a single replace step.
https://qwen.ai/weblog?id=flashqla
Key Takeaways
- FlashQLA is a high-performance linear consideration kernel library constructed by the Qwen workforce on TileLang, particularly optimized for the Gated Delta Community (GDN) Chunked Prefill ahead and backward passes.
- It achieves 2–3× ahead speedup and a couple of× backward speedup over the FLA Triton kernel throughout a number of situations on NVIDIA Hopper GPUs (SM90+), with effectivity positive factors most pronounced in pretraining and edge-side agentic inference.
- Three core improvements drive the efficiency positive factors: gate-driven computerized intra-card context parallelism, hardware-friendly algebraic reformulation that reduces Tensor Core, CUDA Core, and SFU overhead with out dropping numerical precision, and TileLang fused warp-specialized kernels that overlap information motion, Tensor Core computation, and CUDA Core computation.
- GDN is a linear consideration mechanism with O(n) complexity, utilized in Qwen’s hybrid mannequin structure alongside commonplace full consideration layers — making environment friendly GDN kernels essential for each coaching and long-context inference at scale.
- FlashQLA is open-source beneath the MIT License and requires SM90 or above, CUDA 12.8+, and PyTorch 2.8+, with a easy pip set up and each high-level and low-level Python APIs obtainable for integration.
Try the GitHub Repo and Technical particulars. Additionally, be at liberty to observe us on Twitter and don’t neglect to hitch our 130k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you possibly can be a part of us on telegram as properly.
Have to accomplice with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and many others.? Join with us

