For years, the best way massive language fashions deal with inference has been caught inside a field — actually. The high-bandwidth RDMA networks that make trendy LLM serving work have confined each prefill and decode to the identical datacenter, typically even the identical rack. A workforce of researchers at Moonshot AI and Tsinghua College is making the case that this constraint is about to interrupt down — and that the correct structure can already exploit that shift.
The analysis workforce introduces Prefill-as-a-Service (PrfaaS), a cross-datacenter serving structure that selectively offloads long-context prefill to standalone, compute-dense prefill clusters and transfers the ensuing KVCache over commodity Ethernet to native PD clusters for decode. The outcome, in a case examine utilizing an inside 1T-parameter hybrid mannequin, is 54% greater serving throughput than a homogeneous PD baseline and 32% greater than a naive heterogeneous setup — whereas consuming solely a fraction of accessible cross-datacenter bandwidth. The analysis workforce word that when put next at equal {hardware} value, the throughput achieve is roughly 15%, reflecting that the total 54% benefit comes partly from pairing higher-compute H200 GPUs for prefill with H20 GPUs for decode.
https://arxiv.org/pdf/2604.15039v1
Why the Current Structure Has Hit a Wall
To know what PrfaaS solves, it helps to know why LLM serving is break up into two phases within the first place. Prefill is the step the place the mannequin processes the entire enter tokens and generates the KVCache — it’s compute-intensive. Decode is the place the mannequin generates output tokens one by one — it’s memory-bandwidth-intensive. Prefill-decode (PD) disaggregation separates these two phases onto totally different {hardware}, which improves utilization and permits every part to be independently optimized.
The issue is that separating prefill from decode creates a transport downside. As soon as prefill runs on one set of machines and decode runs on one other, the KVCache produced by prefill have to be transferred to the decode facet earlier than output technology can start. In standard dense-attention fashions — these utilizing Grouped Question Consideration (GQA) — this KVCache is gigantic. The analysis workforce benchmarks MiniMax-M2.5, a consultant dense mannequin with GQA, producing KVCache at roughly 60 Gbps for a 32K-token request on a single 8×H200 occasion. That quantity of information requires RDMA-class interconnects to switch with out stalling compute, which is why standard PD disaggregation is tightly sure to a single datacenter-scale community material. Transferring prefill and decode to separate clusters, not to mention throughout datacenters, has merely not been possible.
Hybrid Consideration Modifications the Math
What makes PrfaaS well timed is an architectural shift occurring on the mannequin stage. A rising class of fashions — together with Kimi Linear, MiMo-V2-Flash, Qwen3.5-397B, and Ring-2.5-1T — undertake hybrid consideration stacks that interleave a small variety of full-attention layers with a bigger variety of linear-complexity or bounded-state layers equivalent to Kimi Delta Consideration (KDA), Multi-head Latent Consideration (MLA), and Sliding Window Consideration (SWA). In these architectures, solely the full-attention layers produce KVCache that scales with sequence size. The linear-complexity layers preserve fixed-size recurrent states whose footprint is negligible at lengthy context.
The KV throughput numbers — outlined as KVCache measurement divided by prefill latency — inform the story clearly. At 32K tokens, MiMo-V2-Flash produces KVCache at 4.66 Gbps versus 59.93 Gbps for MiniMax-M2.5, a 13× discount. Qwen3.5-397B reaches 8.25 Gbps versus 33.35 Gbps for Qwen3-235B, a 4× discount. For Ring-2.5-1T particularly, the paper decomposes the financial savings: MLA contributes roughly a 4.5× compression over GQA, and the 7:1 hybrid ratio contributes one other roughly 8× discount, yielding an total KV reminiscence saving of roughly 36×. For the interior 1T mannequin used within the case examine, KV throughput at 32K tokens is simply 3.19 Gbps — a stage that trendy inter-datacenter Ethernet hyperlinks can really maintain.
However the analysis workforce is cautious to make a distinction that issues for AI devs constructing actual programs: a smaller KVCache is important however not adequate to make cross-datacenter PD disaggregation sensible. Actual workloads are bursty, request lengths are skewed, prefix caches are distributed inconsistently throughout nodes, and inter-cluster bandwidth fluctuates. A naive design that routes each prefill to a distant cluster nonetheless runs into congestion and unstable queuing.
https://arxiv.org/pdf/2604.15039v1
What PrfaaS Truly Does
The PrfaaS-PD structure sits on prime of three subsystems: compute, community, and storage. The compute subsystem separates clusters into two sorts — native PD clusters that deal with end-to-end inference for brief requests, and PrfaaS clusters with high-compute-throughput accelerators devoted to long-context prefill. The community subsystem makes use of intra-cluster RDMA for quick native transfers and commodity Ethernet for cross-cluster KVCache transport. The storage subsystem builds a distributed hybrid prefix cache pool that handles linear consideration recurrent states (request-level, fixed-size, exact-match solely) and full-attention KVCache blocks (block-level, rising linearly with enter size, supporting partial prefix matching) in separate teams backed by a unified block pool.
The important thing routing mechanism is length-based threshold routing. Let l denote the incremental prefill size of a request after subtracting any cached prefix, and t a routing threshold. If l > t, the request goes to the PrfaaS cluster and its KVCache is shipped over Ethernet to a decode node. If l ≤ t, it stays on the native PD path. Within the case examine, the optimum threshold is t = 19.4K tokens, which routes roughly 50% of all requests — the longer ones — to the PrfaaS cluster.
Making the Ethernet path dependable in follow requires extra than simply low KV throughput. The analysis workforce specifies three concrete transport mechanisms: layer-wise prefill pipelining to overlap KVCache technology with transmission, multi-connection TCP transport to completely make the most of accessible bandwidth, and congestion monitoring built-in with the scheduler to detect loss and retransmission indicators early and stop congestion accumulation.
On prime of this, the analysis workforce introduces a dual-timescale scheduler. At brief timescales, it displays PrfaaS egress utilization and queue depth, adjusting routing when the hyperlink approaches its bandwidth ceiling. It additionally handles cache-affine routing: when bandwidth is scarce, every cluster’s prefix cache is evaluated independently; when bandwidth is considerable, the scheduler considers one of the best cached prefix throughout all clusters and performs a cross-cluster cache switch if it reduces redundant computation. At longer timescales, the scheduler rebalances prefill and decode node counts inside the native PD cluster as visitors patterns shift, preserving the system close to the throughput-optimal working level.
The Numbers
Within the case examine, a PrfaaS cluster of 32 H200 GPUs is paired with a neighborhood PD cluster of 64 H20 GPUs, linked by a VPC community offering roughly 100 Gbps of cross-cluster bandwidth. The mixture PrfaaS egress load underneath the optimum configuration is roughly 13 Gbps — simply 13% of accessible Ethernet capability — and the paper notes that the PrfaaS cluster stays compute-bound with substantial bandwidth headroom to spare. The analysis additionally initiatives this to bigger deployments: even on the scale of a ten,000-GPU datacenter, the mixture egress bandwidth required for KVCache switch totals solely about 1.8 Tbps, effectively inside the capability of contemporary inter-datacenter hyperlinks.
Imply Time to First Token (TTFT) drops by 50% and P90 TTFT drops by 64% in comparison with the homogeneous baseline. The naive heterogeneous configuration — all prefill on H200, all decode on H20, with no routing or scheduling logic — achieves only one.16× throughput over the homogeneous baseline, in comparison with 1.54× for the total PrfaaS-PD system. The hole between 1.16× and 1.54× isolates the contribution of the scheduling layer and reveals it accounts for almost all of the sensible achieve.
The analysis workforce positions PrfaaS not as a near-future idea however as a design that’s viable in the present day for hybrid-architecture fashions — and argues that as context home windows develop, KVCache compression methods mature, and phase-specialized {hardware} equivalent to NVIDIA’s Rubin CPX for prefill and LPU-style chips for decode turn out to be extra extensively accessible, the case for cross-datacenter PD disaggregation will solely strengthen.
Try the Paper right here. Additionally, be happy to comply with us on Twitter and don’t overlook to affix our 130k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you may be part of us on telegram as effectively.
Have to accomplice with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and many others.? Join with us

