The staff behind Kimi.ai (Moonshot AI) simply made a major contribution to the open-source AI infrastructure area. The analysis staff has made a major contribution to the open-source AI infrastructure area. They launched FlashKDA (Flash Kimi Delta Consideration), a high-performance CUTLASS-based kernel implementation of the Kimi Delta Consideration (KDA) mechanism. The FlashKDA library is on the market on GitHub below an MIT license. It delivers prefill speedups of 1.72× to 2.22× over the flash-linear-attention baseline on NVIDIA H20 GPUs, and works as a drop-in backend for the favored flash-linear-attention library.
What Is Kimi Delta Consideration, and Why Does It Matter?
To grasp FlashKDA, it helps to first perceive the place it sits within the LLM consideration panorama.
Normal softmax consideration has quadratic complexity with respect to sequence size — which means that as you feed longer context right into a mannequin, compute prices develop extraordinarily quick. This has pushed a wave of analysis into linear consideration mechanisms, which approximate or exchange the softmax operation to realize linear scaling. Kimi Delta Consideration (KDA) is Moonshot AI’s contribution to this area: a linear consideration mechanism that refines the Gated DeltaNet with a finer-grained, channel-wise gating mechanism, enabling simpler use of restricted finite-state RNN reminiscence.
KDA is not only a analysis prototype. It’s the core consideration mechanism in Kimi Linear, Moonshot AI’s open-source hybrid mannequin with 48B complete parameters and 3B activated parameters. Kimi Linear makes use of a 3:1 KDA-to-MLA (Multi-Head Latent Consideration) ratio — three KDA layers for each one world consideration layer — which reduces KV cache utilization by as much as 75% throughout long-sequence technology whereas reaching as much as 6× increased decoding throughput at 1 million context size in comparison with full consideration. FlashKDA is the production-grade CUDA kernel that makes that structure quick throughout prefill.
Concretely, the KDA ahead move takes in queries (q), keys (okay), values (v), a gate earlier than activation (g), and beta logits (beta), together with a scale issue, an output tensor (out), and gate parameters: A_log (log-gate parameter per head), dt_bias (gate bias), and lower_bound (gate decrease certain, starting from -5.0 to 0). The sigmoid activation on beta is utilized internally by the kernel. The mechanism additionally helps optionally available preliminary and ultimate recurrent states — helpful for multi-turn inference the place you need to carry state throughout requests.
The recurrent formulation means the mannequin can effectively course of lengthy sequences throughout technology. However environment friendly prefill of those architectures nonetheless requires extremely optimized GPU kernels — which is precisely what FlashKDA delivers.
Underneath the Hood: CUTLASS on Hopper
FlashKDA is constructed on CUTLASS, NVIDIA’s open-source library of CUDA C++ template abstractions for high-performance linear algebra and customized kernel improvement. CUTLASS permits builders to jot down kernels that take full benefit of NVIDIA’s Tensor Core structure, and it’s the identical basis utilized by libraries like FlashAttention-3.
The library targets SM90 and above — which means NVIDIA’s Hopper structure (H100, H20) and newer. The minimal necessities are CUDA 12.9 and PyTorch 2.4. The codebase is predominantly CUDA (56.4%), with Python (36.2%) bindings and C++ (6.7%) glue code.
The core API is flash_kda.fwd, which takes the next inputs:
- q, okay, v, g: all in bf16 with form [B, T, H, K] or [B, T, H, V] (the place g is the gate earlier than activation)
- beta: bf16 beta logits in form [B, T, H] (sigmoid utilized internally)
- scale: fp32 scalar scaling issue
- out: bf16 output tensor in form [B, T, H, V]
- A_log, dt_bias, lower_bound: fp32 gate parameters
- initial_state, final_state: optionally available bf16 or fp32 recurrent states
- cu_seqlens: optionally available int64 cumulative sequence lengths for variable-length batching
One present constraint: the kernel requires Ok = V = 128 for head dimension.
The variable-length batching help through cu_seqlens is especially notable for manufacturing use. In actual inference serving, requests in a batch not often share the identical sequence size. Having the ability to pack a number of sequences of various lengths right into a single kernel name is a key requirement for high-throughput serving methods.
Benchmark Outcomes: 1.72× to 2.22× on H20
The benchmark outcomes (as of April 20, 2026) evaluate flash_kda towards fla_chunk_kda (the present flash-linear-attention implementation) throughout a sequence size of T=8192, head dimension D=128, and two head depend configurations: H=96 and H=64. Every benchmark ran with 30 warmup iterations, 200 measurement iterations, and 5 repeats.
For H=96:
Caseflash_kda (ms)fla_chunk_kda (ms)SpeedupFixed2.62194.50521.72×Varlen, seq_lens=[1300, 547, 2048, 963, 271, 3063]2.34204.57171.95×Varlen, seq_lens=1024 × 82.01004.46682.22×
For H=64:
Caseflash_kda (ms)fla_chunk_kda (ms)SpeedupFixed1.61992.95871.83×Varlen, seq_lens=[1300, 547, 2048, 963, 271, 3063]1.70273.05951.80×Varlen, seq_lens=1024 × 81.39303.04122.18×
The height speedup of two.22× seems within the uniform variable-length case (seq_lens=1024 × 8, eight sequences of size 1024 summing to T=8192). The fixed-length case delivers the ground of the vary at 1.72×. Throughout each head configurations and all three sequence eventualities, FlashKDA constantly outperforms the flash-linear-attention baseline by a major margin.
Integration with flash-linear-attention
One of the vital sensible features of FlashKDA is its integration story. As soon as put in, FlashKDA is auto-dispatched from flash-linear-attention’s chunk_kda — which implies present codebases utilizing flash-linear-attention don’t want guide wiring to benefit from the quicker kernel. The mixing is tracked in flash-linear-attention PR #852.
Set up is easy:
git clone https://github.com/MoonshotAI/FlashKDA.git flash-kda
cd flash-kda
git submodule replace –init –recursive
pip set up -v .
The correctness check suite (assessments/test_fwd.py) runs exact-match verification towards a PyTorch reference implementation and cross-validates towards flash-linear-attention. This offers AI devs a dependable baseline for auditing kernel habits earlier than deploying in manufacturing.
Key Takeaways
- FlashKDA is Moonshot AI’s open-source CUTLASS-based CUDA kernel for Kimi Delta Consideration (KDA), delivering 1.72×–2.22× prefill speedup over the flash-linear-attention baseline on NVIDIA H20 GPUs.
- KDA extends Gated DeltaNet with fine-grained, channel-wise gating — it’s the core consideration mechanism behind Kimi Linear, a 48B-total / 3B-active-parameter hybrid mannequin that reduces KV cache utilization by as much as 75% and achieves as much as 6× increased decoding throughput at 1M context size.
- The kernel targets SM90+ {hardware} (NVIDIA Hopper — H100, H20 and above), requires CUDA 12.9+ and PyTorch 2.4+, and at the moment helps a set head dimension of Ok = V = 128.
- Variable-length batching is natively supported through the cu_seqlens parameter, permitting a number of sequences of various lengths to be packed right into a single kernel name — a essential characteristic for high-throughput inference serving.
- As soon as put in, FlashKDA is auto-dispatched from flash-linear-attention‘s chunk_kda, making it a drop-in efficiency improve for any present codebase already utilizing the flash-linear-attention library — no structure modifications required.
Take a look at the GitHub Repo. Additionally, be at liberty to observe us on Twitter and don’t overlook to hitch our 130k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you may be a part of us on telegram as effectively.
Must companion with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and many others.? Join with us

