We thank Greg Pereira and Robert Shaw from the llm-d crew for his or her help in bringing llm-d to AWS.
Within the agentic and reasoning period, massive language fashions (LLMs) generate 10x extra tokens and compute via complicated reasoning chains in comparison with single-shot replies. Agentic AI workflows additionally create extremely variable calls for and one other exponential enhance in processing, bogging down the inference course of and degrading the person expertise. Because the world transitions from prototyping AI options to deploying AI at scale, environment friendly inference is changing into the gating issue.
LLM inference consists of two distinct phases: prefill and decode. The prefill section is compute sure. It processes the complete enter immediate in parallel to generate the preliminary set of key-value (KV) cache entries. The decode section is reminiscence sure. It autoregressively generates one token at a time whereas requiring substantial reminiscence bandwidth to entry mannequin weights and the ever-growing KV cache. Including to this complexity, inference requests range extensively in computational necessities based mostly on enter and output size, making environment friendly useful resource utilization notably difficult.
Conventional approaches usually contain deploying fashions on predetermined infrastructure and topology or utilizing primary distributed methods that don’t account for these distinctive phases of LLM inference. This results in suboptimal useful resource utilization, with GPUs both underutilized or overloaded throughout totally different inference phases. Whereas vLLM has emerged as a well-liked open supply inference engine that improves effectivity via almost steady batching and PagedAttention, organizations deploying at scale nonetheless face challenges in orchestrating deployments and optimizing routing choices throughout a number of nodes.
We’re asserting a joint effort with the llm-d crew to deliver highly effective disaggregated inference capabilities to AWS in order that clients can enhance efficiency, maximize GPU utilization, and enhance prices for serving large-scale inference workloads. This launch is the results of a number of months of shut collaboration with the llm-d group to ship a brand new container ghcr.io/llm-d/llm-d-aws that features libraries which can be particular to AWS, reminiscent of Elastic Cloth Adapter (EFA) and libfabric, together with integration of llm-d with the NIXL library to help essential options reminiscent of multi-node disaggregated inference and knowledgeable parallelism. We’ve additionally carried out in depth benchmarking via a number of iterations to reach at a secure launch that permits clients to entry these highly effective capabilities out of the field on AWS Kubernetes programs reminiscent of Amazon SageMaker HyperPod and Amazon Elastic Kubernetes Service (Amazon EKS).
All through this weblog put up, we introduce the ideas behind next-generation inference capabilities, together with disaggregated serving, clever request scheduling, and knowledgeable parallelism. We focus on their advantages and stroll via how one can implement them on Amazon SageMaker HyperPod EKS to attain vital enhancements in inference efficiency, useful resource utilization, and operational effectivity.
What’s llm-d?
llm-d is an open supply, Kubernetes-native framework for distributed massive language mannequin (LLM) serving. Constructed on high of vLLM, llm-d extends the core inference engine with production-grade orchestration, superior scheduling, and high-performance interconnect help to allow scalable, multi-node mannequin serving.
Relatively than treating inference as a single-node execution downside, llm-d introduces architectural patterns for disaggregated serving—separating and bettering levels reminiscent of prefill, decode, and KV-cache administration throughout distributed GPU assets. This permits operators to effectively use high-speed materials reminiscent of AWS Elastic Cloth Adapter (EFA), whereas sustaining compatibility with Kubernetes-native deployment workflows.
To make these capabilities accessible, llm-d supplies a set of well-lit paths—reference serving architectures that package deal confirmed optimization methods for various efficiency, scalability, and workload objectives:
Clever inference scheduling
Whereas the clever scheduling instance makes routing choices based mostly on different elements, reminiscent of queue depth, its distinctive strategy to routing is that it makes an attempt to guess the locality of requests within the KVcache, with out requiring it to have visibility into the state of the KVCache. In a single-instance surroundings, engines like vLLM use Computerized Prefix Caching to scale back redundant computation by reusing prior KV cache entries, driving sooner and extra environment friendly efficiency. Nonetheless, the second you scale to a distributed, multi-replica surroundings, assumptions about which kvblocks exist on which GPUs can’t maintain. With out consciousness of the locality of requests of their middleman states, requests could be routed to cases that lack related cached context, negating the advantages of prefix caching fully.
The llm-d scheduler addresses this by sustaining visibility into the cache state throughout the serving replicas and routing requests accordingly. For workloads with excessive prefix reuse, reminiscent of multi-turn conversations or agentic workflows, this cache-aware routing can result in vital enhancements in throughput and latency by ensuring that requests are directed to servers that already maintain related KV cache entries.
Prefill and Decode disaggregation
As described earlier, the prefill and decode phases of LLM inference have essentially totally different useful resource profiles, with prefill being compute-intensive and decode being memory-bandwidth-intensive. In a conventional deployment, each phases share the identical {hardware}, which means neither could be independently optimized. Separating these two phases unlocks a number of optimization alternatives. For instance, in case your output context size is larger than your enter size, you’ll be able to assign extra GPUs to decode than prefill. You too can place these two phases on several types of {hardware}, every tuned for its respective workload traits.
In llm-d, prefill servers are optimized for processing enter prompts effectively, whereas decode servers are centered on producing output tokens with low latency. The clever scheduler decides which cases ought to obtain a given request, and the switch is coordinated utilizing a sidecar operating alongside decode cases. The sidecar instructs vLLM to carry out point-to-point KV cache transfers over quick interconnects to be sure that the decode server receives the required cached context from the prefill server with minimal overhead. This disaggregation considerably improves each time to first token (TTFT) and total throughput, notably for workloads with lengthy prompts or when processing massive fashions.
Vast knowledgeable parallelism
For Combination-of-Consultants (MoE) fashions reminiscent of DeepSeek-R1, Qwen3.5, Minimax, and Kimi K2.5, llm-d supplies optimized deployment patterns that use knowledge parallelism and knowledgeable parallelism. This strategy allows environment friendly deployment of huge MoE fashions by distributing consultants horizontally throughout a number of nodes whereas sustaining efficiency. By spreading mannequin consultants throughout accelerators and utilizing improved communication patterns, llm-d can considerably scale back end-to-end latency and enhance throughput for these complicated architectures. Nonetheless, scaling MoE fashions introduces extra complicated parallelism, communication, and scheduling necessities that have to be fastidiously tuned for every deployment state of affairs.
Tiered prefix caching
Prefix caching avoids performing repetitive and costly KV cache computations, bettering metrics reminiscent of TTFT and total throughput. Whereas inference engines like vLLM have native prefix caching in-built, they’re constrained by the quantity of GPU reminiscence obtainable on a given occasion. To broaden the efficient dimension of the KV cache past GPU reminiscence limits, llm-d affords a tiered caching path that offloads KV cache entries from GPU reminiscence to different storage tiers reminiscent of CPU reminiscence or native disk.
These well-lit paths are provided as beginning factors for configuration and deployment of mannequin servers. They’re designed as composable constructing blocks for vLLM deployments and inference scheduler configuration, which means options throughout a number of paths could be mixed and configured collectively to go well with particular workload necessities.
Working llm-d on AWS
Amazon SageMaker HyperPod EKS
Amazon SageMaker HyperPod affords resilient, high-performance Kubernetes infrastructure optimized for large-scale mannequin coaching and inference. It supplies persistent, high-performance clusters that handle lots of the infrastructure challenges organizations face when deploying massive fashions. Well being monitoring is constructed into the system, with proactive detection and remediation of {hardware} failures to keep up excessive availability for manufacturing workloads. Native Kubernetes help simplifies container orchestration, making it a perfect basis for llm-d’s Kubernetes-native structure.
Reference Structure
To grasp how llm-d operates effectively on AWS infrastructure, you will need to perceive the communication layers that allow high-performance distributed inference. For GPU to GPU communication on a single node, NVLink and NVSwitch are used for high-bandwidth transfers between prefill and decode employees. The next sections describe the important thing parts and the way they work collectively.
NIXL for Level-to-Level Inference Transfers
NCCL, which is extensively utilized in LLM coaching excels at collective communication patterns, disaggregated inference architectures require environment friendly point-to-point knowledge transfers, for instance, shifting KV cache knowledge from a prefill node to a decode node. NVIDIA Inference Xfer Library (NIXL) is purpose-built for this state of affairs. NIXL supplies a reminiscence abstraction layer that spans CPU reminiscence, GPU reminiscence, and storage backends together with file, block, and object shops reminiscent of Amazon S3. It features as an abstraction layer over totally different switch strategies, together with libfabric for EFA interfaces, UCCL, and GPUDirect Storage.
By means of NIXL, cases switch KV cache knowledge between prefill and decode servers utilizing RDMA. RDMA permits GPUs to bypass the working system and skim peer system reminiscence straight, which is essential for inference workloads the place TTFT is a key efficiency metric. Within the llm-d structure, vLLM servers are deployed in InferencePools for routing, and prefill/decode disaggregation is configured utilizing NIXL because the connector for KV cache sharing. NIXL leverages the EFA interfaces related to cases for high-bandwidth communication, ensuring that the overhead of transferring cached context between disaggregated phases stays minimal.
UCX and the Transport Layer
Unified Communication X (UCX) is a lower-level communication framework that gives the transport layer NIXL can use for inter-node communication. UCX helps RDMA operations that allow zero-copy, kernel-bypass networking, which is essential for minimizing latency and maximizing bandwidth in distributed workloads. Importantly, UCX has native help for AWS Elastic Cloth Adapter (EFA) via the libfabric interface, offering the high-performance plumbing that NCCL depends on when GPUs want to speak throughout nodes.
Elastic Cloth Adapter (EFA)
EFA supplies excessive efficiency networking interface on AWS, which is important for scaling distributed inference throughout a number of nodes. EFA makes use of libfabric as its userspace interface, and UCX features a libfabric transport layer that may leverage EFA straight. This integration implies that when llm-d deploys vLLM throughout a number of nodes, the underlying communication stack can take full benefit of EFA’s low-latency, high-bandwidth networking with out requiring adjustments on the utility degree.
We will configure the AWS Load Balancer Controller to provision load balancers for connecting to the Inference Gateway. The Inference Gateway (IGW) sits in entrance of vLLM cases, offering clever request scheduling and routing based mostly on varied elements together with cache locality and server load. The KV Cache Supervisor allows cache-aware routing and distributed cache administration, monitoring which KV cache blocks reside on which nodes. These parts work collectively to create a versatile, extensible system for LLM inference that addresses the distinctive challenges of serving massive fashions at scale.
With SageMaker HyperPod’s observability dashboards, you’ll be able to monitor key metrics throughout inference time reminiscent of GPU utilization, EFA metrics and error counts for proactively monitoring and optimizing your inference workloads.
Greatest Practices
Disaggregated inference permits you to scale your prefill nodes individually to your decode nodes, permitting you to tune your efficiency on your workloads. For instance, bigger enter sequence lengths with quick output sequence lengths are a prefill-heavy workload. Disaggregated inference permits you to scale your prefill pods to deal with extra requests effectively with out a rise in price. It’s not for all workloads nonetheless. You’ll be able to strive it with bigger fashions, longer enter sequences, and sparse MoE architectures.
llm-d additionally supplies paths for intelligently routing visitors to particular pods based mostly on metrics reminiscent of request queues and KV cache occasions through the inference gateway. This works to enhance efficiency and KV cache hits for LLM inference workloads for bettering throughput. The mission continues to be growing and including extra paths and enhancements for internet hosting LLM workloads.
Deployment Overview
Conditions
Earlier than we proceed with deploying both sample, you want the next parts arrange domestically in your system:
llm-d Setup
llm-d makes use of the Gateway Inference API Extension, which requires the set up of the CRDs and an implementation reminiscent of Istio. Clone the llm-d repository and navigate to the set up helper:
git clone https://github.com/llm-d/llm-d.git
cd guides/prereq/gateway-provider
Set up the supplier and implementation
./install-gateway-provider-dependencies.sh
helmfile apply -f istio.helmfile.yaml #or kgateway if utilizing Kgateway
As soon as they’re put in, you can begin deploying the guides.
Mannequin Deployment
The llm-d repository supplies various well-lit paths for inference on Kubernetes positioned on their GitHub. Every information is configured utilizing a helmfile and is break up below two folders. One for the Gateway AI Extension, which configures the Kubernetes Gateway and one for the mannequin service, which configures the mannequin internet hosting configuration.
Docker picture with AWS libraries :ghcr.io/llm-d/llm-d-aws:v0.5.1
To reveal a Gateway with an AWS Load Balancer, you’ll be able to configure the required Kind and Annotations below ./guides/prereq/gateway-provider/common-configurations.
For instance, we configured ./guides/prereq/gateway-provider/common-configurations/istio.yaml as
# Infra values
gateway:
gatewayClassName: istio
gatewayParameters:
accessLogging: false
logLevel: error
assets:
limits:
cpu: “16”
reminiscence: 16Gi
requests:
cpu: “4”
reminiscence: 4Gi
service:
sort: LoadBalancer
annotations:
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
service.beta.kubernetes.io/aws-load-balancer-scheme: inside
service.beta.kubernetes.io/aws-load-balancer-type: exterior
# GAIE values
inferenceExtension:
flags:
v: 1
supplier:
title: istio
istio:
destinationRule:
trafficPolicy:
connectionPool:
http:
http1MaxPendingRequests: 256000
maxRequestsPerConnection: 256000
http2MaxRequests: 256000
idleTimeout: “900s”
tcp:
maxConnections: 256000
maxConnectionDuration: “1800s”
connectTimeout: “900s”
# MS values
routing:
proxy:
zapLogLevel: error
When the Istio Gateway is created, it can provision a community load balancer in your VPC to be used. From right here, you’ll be able to configure the instance as per the directions within the README file to deploy the stack. To get began operating the inference-scheduling instance, from the llm-d listing run:
cd guides/inference-scheduling
Right here you will note the construction showing like:
❯ tree
.
├── gaie-inference-scheduling
│ └── values.yaml
├── helmfile.yaml.gotmpl
├── httproute.gke.yaml
├── httproute.yaml
├── ms-inference-scheduling
│ ├── digitalocean-values.yaml
│ ├── values_amd.yaml
│ ├── values_cpu.yaml
│ ├── values_tpu.yaml
│ ├── values_xpu.yaml
│ ├── values-hpu.yaml
│ └── values.yaml
└── README.md
The ms-inference-scheduling folder comprises the configuration values for operating vLLM replicas in your nodes. gaie-inference-scheduling will configure the inference gateway utilizing your chosen supplier from beforehand.
As soon as you’re able to deploy, run helmfile apply to deploy the information in your cluster.
Deploying with Prefill-Decode Disaggregation
The information to deploy with prefill/decode disaggregation is positioned in guides/pd-disaggregation. For operating inside an surroundings reminiscent of a SageMaker HyperPod cluster, it’s essential to configure the replicas to run utilizing an EFA-enabled picture and ensure to allocate EFA interfaces to the pods.
Inside the ms-pd/values.yaml, you configure it much like:
containers:
– title: “vllm”
picture: ghcr.io/llm-d/llm-d-aws
modelCommand: vllmServe
args:
– “–block-size”
– “128”
– “–kv-transfer-config”
– ‘{“kv_connector”:”NixlConnector”, “kv_role”:”kv_both”,”kv_connector_extra_config”: {“backends”: [“LIBFABRIC”]}}’
– “–disable-uvicorn-access-log”
– “–max-model-len”
– “32000”
env:
– title: VLLM_NIXL_SIDE_CHANNEL_HOST
valueFrom:
fieldRef:
fieldPath: standing.podIP
– title: HF_HOME
worth: “/model-cache”
ports:
– containerPort: 8000
title: vllm
protocol: TCP
– containerPort: 5600
title: nixl
protocol: TCP
assets:
limits:
reminiscence: 64Gi
cpu: “8”
# be aware: GPU assets get managed by parallelism + accelerators above
vpc.amazonaws.com/efa:4
requests:
reminiscence: 64Gi
cpu: “8”
# be aware: GPU assets get managed by parallelism + accelerators above
vpc.amazonaws.com/efa: 4
The picture want to make use of llm-d’s AWS-compatible container. vLLM is configured the place NIXL will use the libfabric backend to maximise community bandwidth. For configuring the variety of EFA interfaces, it is best to allocate based mostly on the variety of GPUs every Pod is operating with and the variety of EFA interfaces obtainable on the occasion. For instance, a p5.48xlarge occasion has 8 H100 GPUs with 32 Elastic Cloth Adapter interfaces, so it is best to configure every duplicate to have 4 EFA interfaces per GPU.
Optionally, you can even configure “enable_cross_layers_blocks”: “True” for the kv_connector_extra_config for lowering the quantity of knowledge that vLLM will switch.
Working Inference
As soon as deployed, EKS can have created an AWS Community Load Balancer for deployment. To get the Load Balancer DNS title, run kubectl get gateways. You’ll be able to then invoke this with curl:
export OPENAI_API_BASE=
curl $OPENAI_API_BASE/chat/completions
-H “Content material-Kind: utility/json”
-d ‘{
“mannequin”: “RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic”,
“messages”: [
{
“role”: “user”,
“content”: “Hello! Who are you?”
}
],
“max_tokens”: 256
}’ | jq
Disaggregated Inference
Benchmarking
We deployed OpenAI’s GPT-OSS on vLLM with a tensor parallel diploma of 4 on an ml.p6-b200.48xlarge. We in contrast it in opposition to llm-d’s path for prefill/decode disaggregation with 4 prefill pods every with a tensor parallel diploma of 1 and 1 decode pods with a tensor parallel diploma of 4. The pods had been related utilizing NIXL with Libfabric because the underlying transport backend for utilizing Elastic Cloth Adapter networking on the cases.
In our testing, we discovered that utilizing llm-d’s prefill/decode disaggregation path will increase tokens per second by as much as 70% as concurrency will increase in comparison with utilizing a normal vLLM deployment when load testing with an enter sequence of 1024 enter tokens and receiving 1024 output tokens as much as a concurrency of 128. This efficiency profile varies based mostly in your vLLM configuration and workload. Tuning your prefill/decode ratio and different parameters obtainable from vLLM server can doubtlessly deliver extra efficiency.
Conclusion
llm-d supplies paths for deployment strategies reminiscent of prefill/decode disaggregation, exact KV conscious routing and tiered KVcaching. These present additional strategies to enhance efficiency for internet hosting at scale. You’ll be able to tune the vLLM settings as required to enhance metrics reminiscent of TTFT, ITL, or cache hits. You too can use frameworks reminiscent of LMCache for KV offloading. Checkout llm-d at https://llm-d.ai/docs/structure
In regards to the authors
Vivek Gangasani
Vivek Gangasani is a Worldwide Chief for Options Structure, SageMaker Inference. He leads Answer Structure, Technical Go-to-Market (GTM) and Outbound Product technique for SageMaker Inference. He additionally helps enterprises and startups deploy and optimize a GenAI fashions and construct AI workflows with SageMaker and GPUs. At the moment, he’s centered on growing methods and content material for optimizing inference efficiency and use-cases reminiscent of Agentic workflows, RAG and so on. In his free time, Vivek enjoys climbing, watching films, and attempting totally different cuisines
Andrew Smith
Andrew Smith is a Sr. Cloud Help Engineer within the SageMaker, Imaginative and prescient & Different crew at AWS, based mostly in Sydney, Australia. He helps clients utilizing many AI/ML providers on AWS with experience in working with Amazon SageMaker. Outdoors of labor, he enjoys spending time with family and friends in addition to studying about totally different applied sciences.
Goutham Annem
Goutham Annem is a Senior Technical Account Supervisor at AWS, based mostly in Bay Space, California. He companions with clients to design and optimize cloud infrastructure with a give attention to scalability, reliability, and efficiency, supporting the implementation of containerized workloads, GenAI options, MLOps pipelines, and technical methods that drive enterprise outcomes. He’s a sports activities fanatic with a selected fondness for badminton and cricket, and often indulges in hikes within the Bay Space to attach with nature.

