Organizations are racing to deploy generative AI fashions into manufacturing to energy clever assistants, code technology instruments, content material engines, and customer-facing functions. However deploying these fashions to manufacturing stays a weeks-long technique of navigating GPU configurations, optimization strategies, and handbook benchmarking, delaying the worth these fashions are constructed to ship.
Right this moment, Amazon SageMaker AI helps optimized generative AI inference suggestions. By delivering validated, optimum deployment configurations with efficiency metrics, Amazon SageMaker AI retains your mannequin builders targeted on constructing correct fashions, not managing infrastructure.
We evaluated a number of benchmarking instruments and selected NVIDIA AIPerf, a modular part of NVIDIA Dynamo, as a result of it exposes detailed, constant metrics and helps numerous workloads out of the field. Its CLI, concurrency controls, and dataset choices give us the flexibleness to iterate rapidly and check throughout totally different situations with minimal setup.
“With the combination of modular elements of the open supply NVIDIA Dynamo distributed inference framework instantly into Amazon SageMaker AI, AWS is making it simpler for enterprises to deploy generative AI fashions with confidence. AWS has been instrumental in advancing AIPerf via deep collaboration and technical contributions. The mixing of NVIDIA AIPerf demonstrates how standardized benchmarking can eradicate weeks of handbook testing and ship validated, deployment-ready configurations to finish customers.”
– Eliuth Triana, Developer Relations Supervisor of NVIDIA.
The problem: From mannequin to manufacturing takes weeks
Deploying fashions at scale requires manufacturing inference endpoints that fulfill clear efficiency objectives, whether or not that could be a latency service degree settlement (SLA), a throughput goal, or a value ceiling. Reaching that requires discovering the best mixture of GPU occasion sort, serving container, parallelism technique, and optimization strategies, all tuned to the particular mannequin and site visitors patterns.
Determine 1: The three core challenges groups face when deploying generative AI fashions to manufacturing
The choice house is impossibly giant. A single deployment includes selecting from over a dozen GPU occasion varieties, a number of serving containers, varied parallelism levels, and a rising set of optimization strategies reminiscent of speculative decoding. These all work together with one another, and there’s no validated steering to slender the search. The one technique to discover the best configuration is to check, and that’s the place the actual price begins. Groups provision cases, deploy the mannequin, run load assessments, analyze outcomes, and repeat. This cycle takes two to a few weeks per mannequin and requires experience in GPU infrastructure, serving frameworks, and efficiency optimization that the majority groups don’t have in-house.
Many groups begin manually: they decide just a few occasion varieties, deploy the mannequin, run load assessments, examine latency, throughput, and price, then repeat. Extra mature groups typically script elements of the method utilizing benchmarking instruments, deployment templates, or steady integration and steady supply (CI/CD) pipelines. Even when workloads are scripted, groups nonetheless face important work. They should check and validate their scripts, select which configurations to benchmark, arrange the benchmarking surroundings, interpret the outcomes, and stability trade-offs between latency, throughput, and price.
Groups are sometimes left making high-stakes infrastructure choices with out figuring out whether or not a greater, cheaper possibility exists. They default to over-provisioning, selecting costlier GPU infrastructure than they want and operating configurations that don’t totally use the compute sources they’re paying for. The danger of under-performing in manufacturing is much worse than overspending on compute. The result’s wasted GPU spend that compounds with each mannequin deployed and each month the endpoint runs.
How optimized generative AI inference suggestions work
You convey your personal generative AI mannequin, outline your anticipated site visitors patterns, and specify a single efficiency purpose: optimize for price, reduce latency, or maximize throughput. From there, SageMaker AI takes over in three levels.
Stage 1: Slender the configuration house
SageMaker AI analyzes the mannequin’s structure, measurement, and reminiscence necessities to establish the occasion varieties and parallelism methods that may realistically meet your purpose. As a substitute of testing each attainable mixture, it narrows the search to the configurations price evaluating, throughout the occasion varieties you choose (as much as three).
Stage 2: Apply goal-aligned optimizations
Based mostly in your chosen efficiency purpose, SageMaker AI applies the optimization strategies to every candidate configuration reminiscent of:
- For throughput objectives, it trains speculative decoding fashions (reminiscent of EAGLE 3.0) that enable the mannequin to generate a number of tokens per ahead go, considerably rising tokens per second.
- For latency objectives, it tunes compute kernels to cut back per-token processing time, reducing time to first token.
- Tensor parallelism is utilized primarily based on mannequin measurement and occasion functionality, distributing the mannequin throughout out there GPUs to deal with fashions that exceed single-GPU reminiscence.
You don’t want to know which method is true in your purpose. SageMaker AI selects and applies the optimizations robotically.
Stage 3: Benchmark and return ranked suggestions
SageMaker AI benchmarks every optimized configuration on actual GPU infrastructure utilizing NVIDIA AIPerf, measuring time to first token, inter-token latency, P50/P90/P99 request latency, throughput, and price. The result’s a set of ranked, deployment-ready suggestions with validated metrics for every configuration and occasion sort. Here’s what the workflow seems like out of your perspective utilizing SageMaker AI APIs.
Determine 2: Generative AI inference suggestions workflow
- Put together your mannequin. Deliver your generative AI mannequin from Amazon Easy Storage Service (Amazon S3) or the SageMaker Mannequin Registry, together with Hugging Face checkpoint codecs with SafeTensor weights, base fashions, and customized or fine-tuned fashions educated by yourself information.
- Outline your workload (elective). Describe anticipated site visitors patterns, together with enter and output token distributions and concurrency ranges. You may present these inline or use a consultant dataset from Amazon S3.
- Set your optimization purpose. Select a single goal: optimize for price, reduce latency, or maximize throughput. Choose as much as three occasion varieties to check.
- Evaluation ranked suggestions. SageMaker AI returns deployment-ready configurations with validated metrics reminiscent of Time to First Token, inter-token latency, P50/P90/P99 request latency, throughput, and price projections. Evaluate the suggestions and choose the most effective match.
- Deploy the chosen configuration. Deploy the chosen configuration to a SageMaker inference endpoint programmatically via the API.
Further choices: You too can benchmark present manufacturing endpoints to validate present efficiency or examine them towards new configurations. SageMaker AI can use present machine studying (ML) Reservations (Versatile Coaching Plans) at no further compute price, or use on-demand compute provisioned robotically.
Pricing
There isn’t a further prices for producing optimized generative AI inference suggestions. Clients incur normal compute prices for the optimization jobs that generate optimized configurations and for the endpoints provisioned throughout benchmarking. Clients with present ML Reservations (Versatile Coaching Plans) can run benchmarking on their reserved capability at no further price, which means the one price is the optimization job itself.
Getting began with optimized generative AI inference suggestions requires only some API calls with SageMaker AI.
For detailed API walkthroughs, code examples, and pattern notebooks, see the SageMaker AI documentation and the pattern notebooks on GitHub.
Benchmarking rigor in-built
Each suggestion from SageMaker AI is grounded in actual measurements, not estimates or simulations. Below the hood, SageMaker AI benchmarks each configuration on actual GPU infrastructure utilizing NVIDIA AIPerf, an open-source benchmarking instrument that measures key inference metrics together with time to first token, inter-token latency, throughput, and requests per second.
AWS has contributed to AIPerf to strengthen the statistical basis of benchmarking outcomes. These contributions embody multi-run confidence reporting, enabling you to measure variance throughout repeated benchmark trials and quantify outcome high quality with statistically grounded confidence intervals. This strikes you past fragile single-run numbers towards benchmark outcomes you’ll be able to belief when making choices about mannequin choice, infrastructure sizing, and efficiency regressions. AWS additionally contributed adaptive convergence and early stopping, permitting benchmarks to cease as soon as metrics have stabilized as a substitute of at all times operating a set variety of trials. This implies decrease benchmarking price and quicker time to outcomes with out sacrificing rigor. For the broader inference group, it raises the standard of benchmarking methodology by specializing in repeatability, statistical confidence, and distribution-aware evaluation relatively than headline numbers from a single trial.
Optimizations in motion
To see what these goal-aligned optimizations appear like in observe, contemplate an actual instance. A buyer deploying GPT-OSS-20B on a single ml.p5en.48xlarge (H100) occasion selects maximize throughput as their efficiency purpose. SageMaker AI identifies speculative decoding as the best optimization for this purpose, trains an EAGLE 3.0 draft mannequin, applies it to the serving configuration, and benchmarks each the baseline and the optimized configuration on actual GPU infrastructure.
Determine 3: GPT-OSS-20B (mxfp4) on 1x H100 (p5en.48xlarge) (3500 ip / 200 op)
The graph exhibits that after operating throughput optimization on the OSS-20B mannequin, the identical occasion can serve 2x extra tokens on the identical request latency. After throughput optimization, the identical occasion delivers 2x extra tokens/s at 1,000ms latency means you’ll be able to serve twice as many customers on the identical {hardware}, successfully reducing inference price per token in half. That is precisely the type of optimization that SageMaker AI applies robotically when you choose a throughput purpose. You don’t want to know that speculative decoding is the best method, or tips on how to practice a draft mannequin, or tips on how to configure it in your particular mannequin and {hardware}. SageMaker AI handles it finish to finish and returns the validated outcomes as a part of the ranked suggestions.
Buyer worth
Price effectivity and transparency: Clear price-performance comparisons throughout occasion varieties of your alternative allow right-sizing as a substitute of defaulting to the costliest possibility. As a substitute of over-provisioning since you can not afford to danger under-performing, you’ll be able to choose the configuration that delivers the efficiency you want on the proper price. Financial savings compound with each mannequin deployed and each month the endpoint runs.
Velocity to manufacturing: Groups iterate quicker, check extra configurations, and get to manufacturing sooner. Every single day saved in deployment is a day your generative AI funding is delivering worth to clients.
Confidence in manufacturing: Each suggestion is backed by actual measurements on actual GPU infrastructure utilizing NVIDIA AIPerf, not estimates or simulations. Deploy figuring out your configuration has been validated towards your particular mannequin and workload, at percentile-level precision that matches manufacturing situations.
Use instances
- Pre-deployment validation: Optimize and benchmark a brand new mannequin earlier than committing to a manufacturing deployment. Know precisely the way it will carry out earlier than you spend money on scaling it.
- Regression testing after updates: Validate efficiency after a container replace, framework improve, or serving library launch. Verify that your configuration continues to be optimum earlier than pushing to manufacturing.
- Proper-sizing when situations change: When site visitors patterns shift or new occasion varieties grow to be out there, re-run optimized generative AI inference suggestions in hours relatively than restarting a weeks-long handbook course of.
- Mannequin comparability: Evaluate the efficiency and price of various mannequin variants throughout occasion varieties to make an knowledgeable choice earlier than manufacturing deployment.
- Price optimization: Benchmark present manufacturing endpoints to establish over-provisioned infrastructure. Use the outcomes to right-size and scale back recurring inference spend.
Benchmark inference endpoints
An AI benchmark job runs efficiency benchmarks towards your SageMaker AI inference endpoints utilizing a predefined workload configuration. Use benchmark jobs to measure the efficiency of your generative AI inference infrastructure earlier than and after optimization. When the benchmark job is accomplished, the outcomes are saved within the Amazon S3 output location that you just specified. As soon as the benchmark job completes, all outcomes are written to your S3 output path in output folder as proven under screenshot:
When you obtain and extract the zip output file, you’ll get under recordsdata
output/
├── profile_export_aiperf.json # aggregated metrics
├── profile_export_aiperf.csv # identical metrics in CSV
├── profile_export.jsonl # uncooked per-request information
├── inputs.json # prompts despatched throughout the run
├── benchmark_summary.txt # completion abstract
├── MANIFEST.txt # index of all recordsdata with sizes
├── plot_generation.log # plot technology log
├── plots/
│ ├── ttft_timeline.png # TTFT per request over time
│ ├── ttft_over_time.png # TTFT aggregated over run period
│ ├── abstract.txt # checklist of generated plots
│ └── aiperf_plot.log # plot technology hint
└── logs/
└── aiperf.log # full AIPerf execution log
The principle output is profile_export_aiperf.json and its CSV counterpart profile_export_aiperf.csv each include the identical aggregated metrics: latency percentiles (p50, p90, p99), output token throughput, time-to-first-token (TTFT), and inter-token latency (ITL). These are the numbers you’d use to guage how the mannequin carried out beneath the simulated load.
Alongside that, profile_export.jsonl provides you the uncooked per-request information each particular person request logged with its personal latency, token counts, and timestamp. That is helpful if you wish to do your personal evaluation or spot outliers that the aggregated stats would possibly cover.
Now we have created a pattern pocket book in Github which benchmarks openai/gpt-oss-20b deployed on a ml.g6.12xlarge occasion (4× NVIDIA L40S GPUs), served by way of the vLLM container as an Inference Part. It simulates a practical workload utilizing artificial prompts: 300 requests at 10 concurrent customers, with ~500 enter and ~150 output tokens per request, to measure how the mannequin performs beneath that load.
Deploying mannequin from suggestions
After the AI Suggestion Job completes, the output is a SageMaker Mannequin Package deal which is a versioned useful resource that bundles all instance-specific deployment configurations right into a single artifact.
To deploy, you first convert the Mannequin Package deal right into a Deployable Mannequin by calling CreateModel with the ModelPackageName and the InferenceSpecificationName for the occasion you wish to goal, then create an endpoint configuration and deploy as an ordinary SageMaker real-time endpoint or Inference Part.
- Decide the advice you wish to deploy
resp = shopper.describe_ai_recommendation_job(
AIRecommendationJobName=”my-recommendation-job”
)rec = resp[“Recommendations”][0]
model_package_arn = rec[“ModelDetails”][“ModelPackageArn”]
inference_spec_name = rec[“ModelDetails”][“InferenceSpecificationName”]
instance_type = rec[“InstanceDetails”][0][“InstanceType”]print(f”Mannequin Package deal : {model_package_arn}”)
print(f”Inference Spec: {inference_spec_name}”)
print(f”Occasion Kind : {instance_type}”) - Convert Mannequin Package deal → Deployable Mannequin
sm.create_model(
ModelName=”oss20b-deployable-model”,
ModelPackageName=model_package_arn,
InferenceSpecificationName=inference_spec_name,
ExecutionRoleArn=”arn:aws:iam::123456789012:function/SageMakerExecutionRole”,
) - Create endpoint config
sm.create_endpoint_config(
EndpointConfigName=”oss20b-endpoint-config”,
ProductionVariants=[
{
“VariantName”: “AllTraffic”,
“ModelName”: “oss20b-deployable-model”,
“InstanceType”: instance_type,
“InitialInstanceCount”: 1,
}
],
) - Deploy and wait
sm.create_endpoint(
EndpointName=”oss20b-endpoint”,
EndpointConfigName=”oss20b-endpoint-config”,
)
Alternatively, if you wish to use Inference Parts as a substitute of a single-model endpoint, You may comply with the pocket book for particulars. This design means a single Suggestion Job produces one Mannequin Package deal with a number of InferenceSpecifications, one per evaluated occasion sort. So you’ll be able to decide the configuration that matches your latency, throughput, or price goal and deploy it instantly with out re-running the job.
Getting began
This functionality is offered right this moment in seven AWS Areas: US East (N. Virginia), US West (Oregon), US East (Ohio), Asia Pacific (Tokyo), Europe (Eire), Asia Pacific (Singapore), and Europe (Frankfurt). Entry it via the SageMaker AI APIs.
Conclusion
On this publish, we confirmed how optimized generative AI inference suggestions in Amazon SageMaker AI scale back deployment time from weeks to hours. With this functionality, you’ll be able to deal with constructing correct fashions and the merchandise that matter to your clients, not on infrastructure tuning. Each configuration is validated on actual GPU infrastructure towards your particular mannequin and workload, so you’ll be able to deploy with confidence and right-size with readability.
To be taught extra, go to the SageMaker AI documentation and take a look at the pattern notebooks on GitHub.
Concerning the authors
Mona Mona
Mona Mona at the moment works as Sr AI/ML specialist Options Architect at Amazon. She labored in Google beforehand as Lead generative AI specialist. She is a printed creator of two books Pure Language Processing with AWS AI Providers: Derive strategic insights from unstructured information with Amazon Textract and Amazon Comprehend and Google Cloud Licensed Skilled Machine Studying Examine Information. She has authored 19 blogs on AI/ML and cloud expertise and a co-author on a analysis paper on CORD19 Neural Search which received an award for Finest Analysis Paper on the prestigious AAAI (Affiliation for the Development of Synthetic Intelligence) convention.
Vinay Arora
Vinay is a Specialist Resolution Architect for Generative AI at AWS, the place he collaborates with clients in designing cutting-edge AI options leveraging AWS applied sciences. Previous to AWS, Vinay has over twenty years of expertise in finance—together with roles at banks and hedge funds—he has constructed danger fashions, buying and selling methods, and market information platforms. Vinay holds a grasp’s diploma in laptop science and enterprise administration.
Lokeshwaran Ravi
Lokeshwaran Ravi is a Senior Deep Studying Compiler Engineer at AWS, specializing in ML optimization, mannequin acceleration, and AI safety. He focuses on enhancing effectivity, decreasing prices, and constructing safe ecosystems to democratize AI applied sciences, making cutting-edge ML accessible and impactful throughout industries.
Dmitry Soldatkin
Dmitry Soldatkin is a Worldwide Chief for Specialist Options Structure, SageMaker Inference at AWS. He leads efforts to assist clients design, construct, and optimize GenAI and AI/ML options throughout the enterprise. His work spans a variety of ML use instances, with a major deal with Generative AI, deep studying, and deploying ML at scale. He has partnered with corporations throughout industries together with monetary providers, insurance coverage, and telecommunications. You may join with Dmitry on LinkedIn.
Kareem Syed-Mohammed
Kareem Syed-Mohammed is a Product Supervisor at AWS. He’s focuses on enabling Gen AI mannequin improvement and governance on SageMaker HyperPod. Previous to this, at Amazon Fast Sight, he led embedded analytics, and developer expertise. Along with Fast Sight, he has been with AWS Market and Amazon retail as a Product Supervisor. Kareem began his profession as a developer for name heart applied sciences, Native Skilled and Adverts for Expedia, and administration advisor at McKinsey.

