Agentic software calling is what makes AI brokers helpful in manufacturing. It’s how they question databases, set off workflows, retrieve real-time knowledge, and act on a person’s behalf. However base fashions ceaselessly hallucinate instruments, move unhealthy parameters, and try actions when they need to ask for clarification. These failures erode belief and block manufacturing deployment.
You should utilize Serverless mannequin customization in Amazon SageMaker AI to repair these issues with out managing infrastructure. With Reinforcement Studying with Verifiable Rewards (RLVR), the mannequin generates its personal candidate responses, receives a reward sign indicating high quality, and updates its habits to favor what works. You choose a mannequin, configure a way, level to your knowledge and reward perform, and SageMaker AI handles the remaining. On this publish, we stroll by means of how we fine-tuned Qwen 2.5 7B Instruct for software calling utilizing RLVR. We cowl dataset preparation throughout three distinct agent behaviors, reward perform design with tiered scoring, coaching configuration and outcomes interpretation, analysis on held-out knowledge with unseen instruments, and deployment. By the top, our fine-tuned mannequin improved software name reward by 57% over the bottom mannequin on situations that it didn’t see throughout coaching.
As a result of software calling has a naturally verifiable goal, whether or not the mannequin referred to as the precise perform with the precise parameters, it maps nicely to RLVR. The problem with self-managed reinforcement studying (RL) is the operational overhead. GPU procurement, reminiscence orchestration between rollout and coaching phases, reward infrastructure, and checkpointing add up rapidly. Hyperparameter sensitivity provides one other layer of complexity. SageMaker AI takes on that work so you may focus in your mannequin, your knowledge, and your reward perform.
SageMaker AI helps mannequin households together with Amazon Nova, GPT-OSS, Llama, Qwen, and DeepSeek, with methods together with Supervised Wonderful-Tuning (SFT), Direct Desire Optimization (DPO), RLVR, and Reinforcement Studying from AI Suggestions (RLAIF). Coaching and validation metrics are tracked by means of built-in MLflow.
Why RLVR for software calling
SFT requires labeled examples of every habits that you really want the mannequin to be taught. For software calling, meaning examples of calling a software, asking for clarification, and refusing. However software calling additionally requires the mannequin to resolve between these behaviors, and SFT can wrestle to generalize that decision-making past the particular patterns in its coaching knowledge.
RLVR works in another way. For every immediate, the mannequin generates a number of candidate responses (we use eight). A reward perform verifies which of them are right. The mannequin then updates its coverage to favor what labored, utilizing Group Relative Coverage Optimization (GRPO). GRPO compares every candidate’s reward rating towards the imply rating of the group and reinforces responses that rating above common. Over time, the mannequin learns the format of a software name and when to name in comparison with when to ask.
Conditions
To make use of serverless mannequin customization in SageMaker AI, you should have the next conditions:
Wonderful-tune Qwen 2.5 7B Instruct in SageMaker AI
To get began, we open Amazon SageMaker AI Studio and select Fashions within the left navigation pane to browse the muse fashions (FM) which might be out there for personalisation.
Within the Customise mannequin menu, choose Qwen 2.5 7B Instruct, and select Customise with UI. This opens the customization configuration web page the place you choose your approach, level to your coaching knowledge and reward perform, and configure hyperparameters. We chosen Reinforcement Studying from Verifiable Rewards (RLVR) as our customization approach.
Put together your coaching knowledge
A software calling dataset wants to show greater than right API invocations. Manufacturing brokers face three distinct conditions:
- The person gives sufficient data, and the mannequin ought to name a software.
- The person’s request is lacking required parameters, and the mannequin ought to ask for clarification.
- The request is dangerous or out of scope, and the mannequin ought to refuse.
We generated 1,500 artificial coaching examples from our software schemas (climate, flights, translation, forex conversion, statistics) utilizing Kiro, the Amazon AI-powered IDE, to supply prompts with sensible variation in phrasing and specificity throughout the three behaviors. Right here’s an instance of the immediate we used:
Generate 1,500 JSONL coaching examples for RLVR tool-calling
fine-tuning throughout 5 software schemas: get_weather_forecast,
search_flights, translate_text, currency_convert, and
get_statistics.
Every line should observe this format:
{“immediate”: [{“role”: “system”, “content”: “…”}, {“role”: “user”, “content”: “…”}], “reward_model”: {“ground_truth”: “…”}}
Distribute examples throughout three behaviors:
1. Execute (60%): Consumer gives all required params → ground_truth is the software name JSON
2. Make clear (25%): Consumer is lacking required params → ground_truth is a clarifying query
3. Refuse (15%): Request is dangerous or out of scope → ground_truth is a well mannered refusal
Fluctuate phrasing between formal, informal, and terse.
Output legitimate JSONL solely, no commentary.
This can be a sensible path for groups that don’t but have manufacturing logs to attract from. For organizations already operating agentic workflows, actual person prompts and gear calls from manufacturing will yield even higher-quality coaching knowledge.
Every coaching instance comprises a immediate (a system instruction and person request) and a floor reality within the reward_model discipline that the reward perform scores towards. Listed here are examples of every habits.
Execute when the person gives every part the software wants:
{
“immediate”: [
{“role”: “system”, “content”: “You are a helpful assistant. When using tools, respond with: […]”},
{“position”: “person”, “content material”: “Get climate for San Francisco”}
],
“reward_model”: {
“ground_truth”: “[{“name”: “get_weather_forecast”, “arguments”: {“city”: “San Francisco”}}]”
}
}
Make clear when a required parameter is lacking:
{
“immediate”: [
{“role”: “system”, “content”: “You are a helpful assistant. When using tools, respond with: […]”},
{“position”: “person”, “content material”: “Get the climate”}
],
“reward_model”: {
“ground_truth”: “To give you the climate data, may you please specify the situation?”
}
}
Execute with a number of parameters:
{
“immediate”: [
{“role”: “system”, “content”: “You are a helpful assistant. When using tools, respond with: […]”},
{“position”: “person”, “content material”: “Convert 50 EUR to USD”}
],
“reward_model”: {
“ground_truth”: “[{“name”: “currency_convert”, “arguments”: {“amount”: 50, “from”: “EUR”, “to”: “USD”}}]”
}
}
Discover the distinction between “Get climate for San Francisco” (software name) and “Get the climate” (clarification). That is the form of distinction GRPO learns nicely. For every immediate, the mannequin generates eight candidates, the reward perform scores them, and the scores are averaged throughout the group. Candidates above the imply get bolstered, and over time the mannequin picks up when to name and when to ask.
Outline your reward perform
The reward perform defines what right means for our use case. We write it as a Python perform that receives the mannequin’s response and the bottom reality from the coaching knowledge and returns a numerical rating. Ours extracts software calls from the mannequin’s response, parses them as JSON, and compares towards the bottom reality.
The total perform handles response extraction, versatile parsing for various codecs throughout early coaching, and edge instances round JSON sort mismatches. Right here is the core scoring logic:
# After extracting and parsing software calls from mannequin response and floor reality:
# Examine software names
pred_names = {software.get(‘identify’, ”) for software in pred_tools}
gt_names = {software.get(‘identify’, ”) for software in gt_tools}
if pred_names == gt_names:
# Proper perform(s) – examine if arguments additionally match
perfect_match = True
for pred_tool in pred_tools:
for gt_tool in gt_tools:
if pred_tool.get(‘identify’) == gt_tool.get(‘identify’):
if pred_tool.get(‘arguments’) != gt_tool.get(‘arguments’):
perfect_match = False
rating = 1.0 if perfect_match else 0.5
elif pred_names & gt_names:
# Partial overlap in perform names
rating = 0.5
else:
# Mistaken perform completely
rating = 0.0
The three tiers (1.0, 0.5, and 0.0) give GRPO a richer studying sign. If a number of of the eight candidates get the perform proper however miss a parameter, the 0.5 rating distinguishes them from fully improper solutions. This helps the mannequin acknowledge that it’s heading in the right direction.
For clarification and refusal instances the place the bottom reality is pure language (no TOOLCALL tags), the reward perform checks whether or not the mannequin additionally prevented calling a software. An pointless API name when the mannequin ought to have requested a query earns 0.0.
Configure and launch coaching
On the customization configuration web page, we level to our coaching dataset and reward perform, then set our hyperparameters. We use a batch measurement of 128, studying charge of 5e-6, 3 epochs, and eight rollouts per immediate.
The rollouts setting is the core GRPO mechanism. For every coaching immediate, the mannequin generates eight completely different responses, the reward perform scores each, and responses that rating above the group common get bolstered. Coaching and validation metrics are logged to MLflow. On this instance, coaching takes roughly 40 minutes.
Coaching outcomes
Practice Reward Statistics (prime left) is the chart to give attention to. The imply reward throughout the roll outs began round 0.28 and climbed to 0.65–0.68 over 30 steps, greater than doubling. The steepest positive factors occur within the first 10 steps because the mannequin learns the essential software calling format and determination construction. It then flattens after step 20 because it converges.
The opposite charts verify wholesome coaching:
- Coverage Entropy decreases, which means the mannequin is getting extra assured quite than guessing.
- Gradient Norm stabilizes, which means updates are getting smaller and extra refined.
- Imply Benefit Estimate converges towards zero, indicating that the mannequin’s coverage is stabilizing and the typical response high quality is aligning with the reward baseline.
Consider the fine-tuned mannequin
After the coaching job is full, you may see the fashions that you simply created within the My Fashions tab. To develop the main points, select View particulars on one in all your fashions.
You’ll be able to select Proceed customization to iterate additional by adjusting hyperparameters or coaching with a unique approach. Select Consider to check your custom-made mannequin towards the bottom mannequin.
We consider on a separate take a look at set of 300 examples that have been excluded from coaching. The analysis dataset covers the identical three behaviors however consists of instruments, phrasings, and situations that the mannequin hasn’t seen. It assessments search_restaurants, get_stock_price, and calculate_standard_deviation, none of which appeared throughout coaching. It additionally consists of refusal instances for dangerous requests like producing violent content material or creating malware, testing whether or not the mannequin generalizes secure habits to new threats.
The analysis runs normal NLP metrics alongside our customized reward perform towards the held-out set.
Software Name Reward is our customized metric and probably the most direct measure of what we educated for. It jumped from 0.35 to 0.55, a 57% enchancment. In sensible phrases, which means the fine-tuned mannequin makes the right software calling determination considerably extra typically. It calls the precise perform with the precise parameters when it ought to, asks for clarification when data is lacking, and refuses when applicable.
F1 Rating Quasi, Rouge1, and RougeL all improved by 14–19 proportion factors, reflecting higher era of right perform names, parameter keys, and values throughout the board.
Actual Match doubled from 11% to 21%. This metric requires character-for-character output matching, so even small formatting variations depend as a miss. The 21% precise match alongside 64.6% F1 means that the mannequin typically will get the semantics proper even when floor formatting differs barely.
These outcomes are all on held-out knowledge with instruments and situations that the mannequin didn’t see throughout coaching. The enhancements carry over to new instruments, which suggests the mannequin picked up common software calling patterns quite than memorizing the coaching set.
Listed here are responses from the fine-tuned mannequin on prompts that it didn’t see throughout coaching:
Software execution:
> Immediate: “What’s the inventory worth of AMZN?” > > Mannequin response: [{“name”: “get_stock_price”, “arguments”: {“ticker”: “AMZN”}}]
Clarification:
> Immediate: “Discover me a restaurant close by” > > Mannequin response: “I’d be completely satisfied that can assist you discover a restaurant! Might you please present the delicacies sort or location you’re all in favour of?”
Close to-miss:
> Immediate: “What’s the usual deviation of 10, 20, 30, 40, 50?” > > Mannequin response: [{“name”: “calculate_standard_deviation”, “arguments”: {“numbers”: “10, 20, 30, 40, 50”}}]
Within the near-miss case, the mannequin chosen the right software however handed the numbers as a string as an alternative of an array. This earns a 0.5 reward rating (proper perform, improper parameter format) and represents the form of error that you simply’d goal within the subsequent iteration by means of further coaching knowledge or reward perform refinement.
Deploy the fine-tuned mannequin
With analysis confirming enchancment, deploy the fine-tuned mannequin immediately from the mannequin particulars web page. Select Deploy, and choose your deployment goal: both a SageMaker AI endpoint or Amazon Bedrock. You can even obtain the mannequin weights from Amazon S3 for self-managed deployment.
Conclusion
On this publish, we fine-tuned Qwen 2.5 7B Instruct for agentic software calling utilizing RLVR and GRPO by means of serverless mannequin customization in Amazon SageMaker AI. We ready a dataset spanning three tool-calling behaviors (execute, make clear, refuse), outlined a tiered reward perform, educated the mannequin in about 40 minutes, evaluated on held-out knowledge with unseen instruments and situations, and deployed. The fine-tuned mannequin improved software name reward by 57% over the bottom mannequin.
To push accuracy additional, you may develop your coaching knowledge with further instruments, edge instances, and multi-turn conversations to cowl extra of the situations that your brokers encounter in manufacturing. You can even refine your reward perform to penalize particular failure modes, just like the string-vs-array parameter difficulty proven within the earlier part, or add partial credit score for different near-miss patterns. In the event you’re operating agentic workflows, your manufacturing logs are a high-quality supply of coaching knowledge that may make the mannequin much more efficient on your particular use case. Past software calling, RLVR applies to different reasoning duties the place correctness is verifiable, resembling multi-step planning, structured knowledge extraction, or code era.
Whereas this publish walks by means of the UI workflow, an SDK for programmatic entry can be out there. To be taught extra, see the SageMaker AI mannequin customization documentation.
To get began, attempt serverless AI mannequin customization in Amazon SageMaker AI with your personal use instances.
Concerning the authors
Lauren Mullennex
Lauren is a Senior GenAI/ML Specialist Options Architect at AWS. She has over a decade of expertise in ML, DevOps, and infrastructure. She is a broadcast writer of a guide on pc imaginative and prescient. Outdoors of labor, you will discover her touring and mountaineering along with her two canine.
Eric Saleh
Eric is a Senior GenAI Specialist at AWS, specializing in basis mannequin coaching and inference. He’s partnering with prime basis mannequin builders and AWS service groups to allow distributed coaching and inference at scale on AWS and lead joint GTM motions with strategic prospects. Earlier than becoming a member of AWS, Eric led product groups constructing enterprise AI/ML options, which included frontier GenAI providers for fine-tuning, RAG, and managed inference. He holds a grasp’s diploma in Enterprise Analytics from UCLA Anderson.
Surya Kari
Surya is a Senior Generative AI Knowledge Scientist at AWS, specializing in growing options leveraging state-of-the-art basis fashions. He has in depth expertise working with superior language fashions together with DeepSeek-R1, the LLama household, and Qwen, specializing in their fine-tuning and optimization for particular scientific functions. His experience extends to implementing environment friendly coaching pipelines and deployment methods utilizing AWS SageMaker, enabling the scaling of basis fashions from growth to manufacturing. He collaborates with prospects to design and implement generative AI options, serving to them navigate mannequin choice, fine-tuning approaches, and deployment methods to realize optimum efficiency for his or her particular use instances.

