Coaching giant language fashions requires correct suggestions alerts, however conventional reinforcement studying (RL) typically struggles with reward sign reliability. The standard of those alerts instantly influences how fashions study and make choices. Nevertheless, creating sturdy suggestions mechanisms could be complicated and error susceptible. Actual-world coaching situations typically introduce hidden biases, unintended incentives, and ambiguous success standards that may derail the educational course of, resulting in fashions that behave unpredictably or fail to fulfill desired aims.
On this put up, you’ll learn to implement reinforcement studying with verifiable rewards (RLVR) to introduce verification and transparency into reward alerts to enhance coaching efficiency. This strategy works finest when outputs could be objectively verified for correctness, resembling in mathematical reasoning, code era, or symbolic manipulation duties. Additionally, you will learn to layer strategies like Group Relative Coverage Optimization (GRPO) and few-shot examples to additional enhance outcomes. You’ll use the GSM8K dataset (Grade College Math 8K: a set of grade faculty math issues) to enhance math downside fixing accuracy, however the strategies used right here could be tailored to all kinds of different use circumstances.
Technical overview
Earlier than diving into implementation, it’s useful to grasp the RL ideas that underpin this strategy. RL addresses challenges in mannequin coaching by establishing a structured suggestions system via reward alerts. This paradigm permits fashions to study via interplay, receiving suggestions that guides them towards optimum conduct. RL offers a framework for fashions to iteratively enhance their responses primarily based on clearly outlined alerts in regards to the high quality of their outputs, making it extremely efficient for coaching fashions that work together with customers and should adapt their conduct primarily based on outcomes. Conventional RL has highlighted an necessary consideration: the standard of the reward sign issues considerably. When reward features are imprecise or incomplete, fashions can interact in “reward hacking,” discovering unintended methods to maximise scores with out attaining the specified conduct. Recognizing this limitation has led to the event of extra rigorous approaches that concentrate on creating dependable, well-defined reward features.
RLVR addresses reward hacking via rule-based suggestions outlined by the mannequin tuner. It makes use of programmatic reward features that routinely rating outputs towards particular standards, enabling fast iteration with out the bottleneck of gathering human scores. These “verifiable” rewards come from goal, reproducible guidelines, making RLVR perfect for evolving necessities as a result of it learns common optimization methods and adapts shortly to new situations. GRPO is a reinforcement studying algorithm that improves AI mannequin studying by evaluating efficiency inside teams relatively than throughout all knowledge without delay. It organizes coaching knowledge into significant teams and optimizes efficiency relative to every group’s baseline, giving applicable consideration to every class. This group-aware optimization reduces coaching variance, accelerates convergence, and might produce fashions that carry out persistently throughout varied classes. Combining RLVR with GRPO creates a framework the place automated rewards information studying whereas group-relative optimization helps drive balanced efficiency.
You outline reward features for various activity points, and GRPO treats these as distinct teams throughout coaching, facilitating simultaneous enchancment throughout dimensions. This mix delivers fast adaptation and sturdy efficiency, perfect for dynamic environments requiring generalization past coaching distribution. Including few-shot studying enhances this framework in 3 ways. First, few-shot examples present templates that present the mannequin what good outputs seem like, narrowing the search area for exploration. Second, GRPO leverages these examples by producing a number of candidate responses per immediate and studying from their relative efficiency inside every group. Third, verifiable rewards instantly affirm which approaches succeed. This mix accelerates studying: the mannequin begins with concrete examples of the specified format, explores variations effectively via group-based comparability, and receives definitive suggestions on correctness.
Answer overview
On this part, you’ll stroll via easy methods to fine-tune a Qwen2.5-0.5B mannequin on SageMaker AI utilizing Amazon Amazon SageMaker Coaching Jobs. Amazon SageMaker Coaching jobs help distributed multi-GPU and multi-node configurations, so you’ll be able to spin up high-performance clusters on demand, practice billion-parameter fashions sooner, and routinely shut down assets when the job finishes.
Be aware: Whereas Qwen2.5-0.5B was chosen for this use case, others like code era would require a bigger mannequin (e.g. Qwen2.5-Coder-7B) and subsequently bigger coaching cases.
Conditions
To run the instance from this put up on Amazon SageMaker AI, you should fulfill the next stipulations:
Setting arrange
You should use your most popular IDE, resembling VS Code or PyCharm, however be certain your native atmosphere is configured to work with AWS, as mentioned within the stipulations.
To make use of SageMaker Studio JupyterLab areas full the next steps:
- On the Amazon SageMaker AI console, select Domains within the navigation pane, then open your area.
- Within the navigation pane below Functions and IDEs, select Studio.
- On the Person profiles tab, find your person profile, then select Launch and Studio.
- In Amazon SageMaker Studio, launch an ml.t3.medium JupyterLab pocket book occasion with at the very least 50 GB of storage.
A big pocket book occasion isn’t required, as a result of the fine-tuning job will run on a separate ephemeral coaching occasion with GPU acceleration.
- To start fine-tuning, begin by cloning the GitHub repo and navigating to 3_distributed_training/reinforcement-learning/grpo-with-verifiable-reward listing, then launch the model-finetuning-grpo-rlvr.ipynb
- Pocket book with a Python 3.12 or greater model kernel
Put together the dataset for fine-tuning
Operating GRPO with RLVR requires you to have the ultimate reply to every query to calculate reward. First, put together the info by extracting the ultimate reply for every query.
dataset = GSM8K(break up=”practice”, include_answer=False, include_reasoning=True, few_shot=True, num_shots=8, seed=None, cot=True).dataset.shuffle(seed=42)
Dataset({
options: [‘question’, ‘answer’, ‘prompt’, ‘final_answer’],
num_rows: 7473
})
As well as, this instance makes use of few-shot examples (8 pictures) to enhance mannequin coaching efficiency. For extra info on few-shot examples in reinforcement studying, consult with the paper “Reinforcement Studying for Reasoning in Massive Language Fashions with One Coaching Instance”. Whereas the analysis paper focuses on single-shot examples, this put up will present you each single and multi-shot efficiency.
Every enter will include 8 examples, adopted by the issue to be solved:
“Query: Mark has $50 and buys a toy that prices $35. How a lot cash does he have left?
Answer: Let’s assume step-by-step. To learn the way a lot cash Mark has left, subtract the price of the toy from the entire amount of cash Mark has. So, $50 – $35 = $15.
#### The ultimate reply is 15
Query: Emily has 3 instances as many pencils as Alice. If Alice has 15 pencils, what number of pencils does Emily have?
Answer: Let’s assume step-by-step. To learn the way many pencils Emily has, we multiply the variety of pencils Alice has by 3. Alice has 15 pencils, so Emily has 15 * 3 = 45 pencils.
#### The ultimate reply is 45
Query: Jack has collected 12 extra marbles than Kevin. If Kevin has 27 marbles, what number of marbles does Jack have?
Answer: Let’s assume step-by-step. To seek out what number of marbles Jack has, we add 12 to the variety of marbles Kevin has. So, Jack has 27 + 12 = 39 marbles.
#### The ultimate reply is 39
Query: There are 24 college students in a classroom. If every group will need to have 4 college students, what number of teams could be fashioned?
Answer: Let’s assume step-by-step. To seek out what number of teams could be fashioned, we divide the variety of college students by the variety of college students per group. So, 24 / 4 = 6 teams could be fashioned.
#### The ultimate reply is 6
Query: Samantha baked 40 cookies and needs to divide them equally into luggage, with every bag containing 5 cookies. What number of luggage will Samantha want?
Answer: Let’s assume step-by-step. To seek out the variety of luggage wanted, divide the entire variety of cookies by the variety of cookies per bag. Thus, 40 divided by 5 equals 8.
#### The ultimate reply is 8
Query: A pack of pencils prices $4. When you purchase 7 packs, how a lot will you spend in whole?
Answer: Let’s assume step-by-step. The entire price is discovered by multiplying the associated fee per pack by the variety of packs. Therefore, you spend 7 * $4 = $28.
#### The ultimate reply is 28
Query: A guide has 240 pages, and Sarah reads 20 pages every day. What number of days will it take her to complete the guide?
Answer: Let’s assume step-by-step. Sarah reads 20 pages per day, so we divide the entire pages by the variety of pages she reads per day. Subsequently, it takes her 240 / 20 = 12 days to complete the guide.
#### The ultimate reply is 12
Query: A farmer has a complete of 80 apples and oranges. If he has 30 apples, what number of oranges does he have?
Answer: Let’s assume step-by-step. To find out the variety of oranges, we subtract the variety of apples from the entire variety of fruits. So, the variety of oranges is 80 – 30 = 50.n
#### The ultimate reply is 50
Query: Mimi picked up 2 dozen seashells on the seashore. Kyle discovered twice as many shells as Mimi and put them in his pocket. Leigh grabbed one-third of the shells that Kyle discovered. What number of seashells did Leigh have?
Answer: Let’s assume step-by-step.
After the info has been ready, preserve 10 p.c of the info as a validation set and push each coaching and validation set to S3.
The Verifiable Reward Operate
This GRPO implementation for mathematical reasoning employs a dual-reward system that gives goal, verifiable suggestions throughout coaching. This strategy leverages the inherent verifiability of mathematical issues to create dependable coaching alerts with out requiring human annotation or subjective analysis.You’ll implement two complementary reward features that work collectively to information the mannequin towards each right response formatting and mathematical accuracy of the outcome:
Format Reward Operate
This operate helps confirm the mannequin learns to construction its responses appropriately by:
- Sample Matching: Searches for the precise format #### The ultimate reply is [number]
- Constant Scoring: Awards 0.5 factors for correct formatting, 0.0 for incorrect format
- Coaching Sign: Encourages the mannequin to comply with the anticipated reply construction
#Format reward operate
def format_reward_func_qa(completions, **kwargs):
sample = r”n#### The ultimate reply is d+”
completion_contents = [completion for completion in completions]
matches = [re.search(pattern, content) for content in completion_contents]
return [0.5 if match else 0.0 for match in matches]
Correctness Reward Operate
This operate offers the core mathematical verification by:
- Reply Extraction: Makes use of regex to extract numerical solutions from formatted responses
- Normalization: Removes widespread formatting characters (commas, foreign money symbols, models)
- Precision Comparability: Makes use of a tolerance of 1e-3 to deal with floating-point precision
- Binary Scoring: Awards 1.0 for proper solutions, 0.0 for incorrect ones
#Correctness reward operate
def correctness_reward_func_qa(completions, final_answer, **kwargs):
rewards = []
for completion, ground_truth in zip(completions, final_answer):
attempt:
match = re.search(r’####.*?([d,]+(?:.d+)?)’, completion)
if match:
reply = match.group(1)
for remove_char in [‘,’, ‘$’, ‘%’, ‘g’]:
reply = reply.change(remove_char, ”)
if abs(float(reply)-float(ground_truth)) < 1e-3:
rewards.append(1.0)
else:
rewards.append(0.0)
else:
rewards.append(0.0)
besides ValueError:
rewards.append(0.0)
return rewards
Integrating RLVR with GRPO
The reward features are built-in into the GRPO coaching pipeline via the GRPOTrainer:
rewards_funcs = [format_reward_func_qa, correctness_reward_func_qa]
coach = GRPOTrainer(
mannequin=mannequin,
args=training_args,
train_dataset=train_dataset,
eval_dataset=test_dataset,
processing_class=tokenizer,
peft_config=peft_config,
reward_funcs=rewards_funcs,
)
Throughout coaching, GRPO makes use of these reward features to compute coverage gradients. First the mannequin generates a number of completions for every mathematical downside. Subsequent, the reward for every response is computed for each reward features. The format reward operate will grant as much as 0.5 for correct response construction, and the correctness reward operate will grant as much as 1.0 for the mathematical accuracy of the reply for a most mixed reward of 1.5 per completion. Then GRPO compares the completions inside teams to establish one of the best responses. Lastly, within the coverage replace step, the loss operate makes use of reward variations to replace mannequin parameters. Greater-rewarded completions improve their likelihood, whereas lower-rewarded completions lower their likelihood. This relative rating drives the optimization course of.The next instance demonstrates easy methods to fine-tune Qwen2.5-0.5B. The recipe is offered within the scripts folder, permitting you to customise it or change the bottom mannequin. Right here you’ll use GRPO with verifiable rewards utilizing Quantized Low-Rank Adaptation (QLoRA). QLoRA is used right here as a way to cut back coaching useful resource necessities and velocity up the coaching course of, with a small commerce off in accuracy.
# Mannequin arguments
model_name_or_path: Qwen/Qwen2.5-0.5B
tokenizer_name_or_path: Qwen/Qwen2.5-0.5B
model_revision: essential
torch_dtype: bfloat16
attn_implementation: flash_attention_2
bf16: true
tf32: true
output_dir: /decide/ml/mannequin/Qwen2.5-0.5B-RL-VR-GRPO
# Dataset arguments
train_dataset_id_or_path: /decide/ml/enter/knowledge/practice/dataset.json
test_dataset_id_or_path: /decide/ml/enter/knowledge/val/dataset.json
dataset_splits: ‘practice’
max_seq_length: 2048
packing: true
# LoRA arguments
use_peft: true
load_in_4bit: true
lora_target_modules: [“q_proj”, “k_proj”, “v_proj”, “o_proj”, “up_proj”, “down_proj”, “gate_proj”]
lora_modules_to_save: [“lm_head”, “embed_tokens”]
lora_r: 16
lora_alpha: 16
# Coaching arguments
num_train_epochs: 2
per_device_train_batch_size: 16
gradient_accumulation_steps: 2
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: True
learning_rate: 1.84e-4
lr_scheduler_type: cosine
warmup_ratio: 0.1
# Logging arguments
logging_strategy: steps
logging_steps: 5
report_to:
– mlflow
save_strategy: “no”
seed: 42
Recipe overview
This recipe implements Group Relative Coverage Optimization (GRPO) with verifiable rewards for fine-tuning the Qwen2.5-0.5B mannequin on mathematical reasoning duties. The recipe makes use of a dual-reward system that objectively evaluates each reply formatting and mathematical correctness with out requiring human annotation.
Vital Hyperparameters:
- learning_rate: 1.84e-4 – Studying charge optimized for GRPO coaching
- num_train_epochs: 2 – Coaching epochs to keep away from overfitting
- per_device_train_batch_size: 16 with gradient_accumulation_steps: 2 – Efficient batch dimension of 32
- max_seq_length: 2048 – Context window for 8-shot prompting
- lora_r: 16 and lora_alpha: 16 – LoRA rank and scaling parameters
- warmup_ratio: 0.1 with cosine scheduler – Studying charge scheduling
- lora_target_modules – Targets consideration and MLP layers for adaptation
As a subsequent step, you’ll use a SageMaker AI coaching job to spin up a coaching cluster and run the mannequin fine-tuning. The SageMaker AI Mannequin Coach. ModelTrainer runs coaching jobs on absolutely managed infrastructure; dealing with atmosphere setup, scaling, and artifact administration. It additionally permits you to specify coaching scripts, enter knowledge, and compute assets with out manually provisioning servers. Library dependencies could be managed via the necessities.txt file in scripts folder. ModelTrainer will routinely detect this file and set up the listed dependencies at runtime.
First, arrange your atmosphere. Right here you’ll specify the occasion kind and variety of cases for coaching and the situation of the coaching container.
from sagemaker.core import image_uris
from sagemaker.core.helper.session_helper import Session
sagemaker_session = Session()
bucket_name = sagemaker_session.default_bucket()
default_prefix = sagemaker_session.default_bucket_prefix
configs = load_sagemaker_config()
instance_type = “ml.g6.48xlarge”
instance_count = 1
config_filename = “Qwen2.5-0.5B.yaml”
image_uri = image_uris.retrieve(
framework=”pytorch”,
area=sagemaker_session.boto_session.region_name,
model=”2.7.1″,
instance_type=instance_type,
image_scope=”coaching”
)
Subsequent, configure the atmosphere variables, code areas, and knowledge paths:
from sagemaker.practice.configs import (
CheckpointConfig,
Compute,
OutputDataConfig,
SourceCode,
StoppingCondition,
)
from sagemaker.practice.distributed import Torchrun
from sagemaker.practice.model_trainer import ModelTrainer
env = {}
env[“FI_PROVIDER”] = “efa”
env[“NCCL_PROTO”] = “easy”
env[“NCCL_SOCKET_IFNAME”] = “eth0”
env[“NCCL_IB_DISABLE”] = “1”
env[“NCCL_DEBUG”] = “WARN”
env[“HF_token”] = os.environ[‘hf_token’]
env[“CONFIG_PATH”] = f”recipes/{config_filename}”
env[“MLFLOW_EXPERIMENT_NAME”]= “grpo-rlvr”
env[“MLFLOW_TAGS”] = ‘{“supply.job”: “sm-training-jobs”, “supply.kind”: “grpo-rlvr”, “supply.framework”: “pytorch”}’
env[“MLFLOW_TRACKING_URI”] = MLFLOW_TRACKING_SERVER_ARN
# Outline the script to be run
source_code = SourceCode(
source_dir=”./scripts”,
necessities=”necessities.txt”,
entry_script=”run_finetuning.sh”,
)
# Outline the compute
compute_configs = Compute(
instance_type=instance_type,
instance_count=instance_count,
keep_alive_period_in_seconds=3600,
)
# outline Coaching Job Identify
job_name = f”train-{config_filename.break up(‘/’)[-1].change(‘.’, ‘-‘).change(‘yaml’, ‘rlvr’)}”
# outline OutputDataConfig path
output_path = f”s3://{bucket_name}/{job_name}”
# Outline the ModelTrainer
model_trainer = ModelTrainer(
training_image=image_uri,
atmosphere=env,
source_code=source_code,
base_job_name=job_name,
compute=compute_configs,
stopping_condition=StoppingCondition(max_runtime_in_seconds=18000),
output_data_config=OutputDataConfig(s3_output_path=output_path),
checkpoint_config=CheckpointConfig(
s3_uri=output_path + “/checkpoint”, local_path=”/decide/ml/checkpoints”
),
)
Arrange the channels for coaching and validation knowledge:
from sagemaker.practice.configs import InputData
# Go the enter knowledge
train_input = InputData(
channel_name=”practice”,
data_source=train_dataset_s3_path, # S3 path the place coaching knowledge is saved
)
val_input = InputData(
channel_name=”val”,
data_source=val_dataset_s3_path, # S3 path the place coaching knowledge is saved
)
# Verify enter channels configured
knowledge = [train_input, val_input]
Then start coaching:model_trainer.practice(input_data_config=knowledge)The next is the listing construction for supply code of this instance:
scripts/
├── accelerate_configs/ # Speed up configuration information
├── run_finetuning.sh # Launch script for distributed coaching with Speed up on SageMaker coaching jobs
├── run_grpo.py # Foremost coaching script for GRPO
├── utils/ # utilities to load knowledge and create immediate
├── recipes/ # Predefined coaching configuration recipes (YAML)
└── necessities.txt # Python dependencies put in at runtime
To fine-tune throughout a number of GPUs, the instance coaching script makes use of Huggingface Speed up and DeepSpeed ZeRO-3, which work collectively to coach giant fashions extra effectively. Huggingface Speed up simplifies launching distributed coaching by routinely dealing with machine placement, course of administration, and blended precision settings. DeepSpeed ZeRO-3 reduces reminiscence utilization by partitioning optimizer states, gradients, and parameters throughout GPUs—permitting billion-parameter fashions to suit and practice sooner.You may run your GRPO coach script with Huggingface Speed up utilizing a easy command like the next:
NUM_GPUS=$(nvidia-smi –list-gpus | wc -l)
echo “Detected ${NUM_GPUS} GPUs on the machine”
# Launch fine-tuning with Speed up + DeepSpeed (Zero3)
speed up launch
–config_file accelerate_configs/deepspeed_zero3.yaml
–num_processes ${NUM_GPUS}
run_grpo.py
–config $CONFIG_PATH
Outcomes
After evaluating the fashions on 100 check samples, the 8-shot GRPO-trained mannequin achieved 41% accuracy in comparison with the bottom mannequin’s 11%, demonstrating a 3.7x enchancment in chain-of-thought mathematical reasoning.
The next chart exhibits a definite threshold associated to context size, revealing an optimum vary of samples for reasoning activation. Whereas 0-shot (6%) and 2-shot (3%) configurations carried out poorly – even worse than the bottom mannequin – efficiency dramatically improved at 4-shot prompting (33%), then peaked at 8-shot context (41%). This non-linear scaling sample means that GRPO coaching creates reasoning patterns that require a sure variety of examples to activate successfully. The mannequin seems to have discovered to leverage group comparisons from a number of examples, according to GRPO’s group-based coverage optimization strategy the place the mannequin learns to check and choose optimum reasoning paths from a number of generated options.
Extending RLVR to different domains
Whereas this put up targeted on mathematical reasoning with GSM8K, the RLVR strategy generalizes to domains with objectively verifiable outputs. Two promising instructions exhibit this versatility:
Code era with execution-based rewards
Code era offers pure verification via execution. Partial rewards could be awarded when code compiles and runs with out errors, whereas full rewards are achieved when outputs cross complete unit assessments. Area consultants specify necessities utilizing pure language prompts, whereas the reward mannequin routinely evaluates correctness via code execution—assuaging subjective human analysis.
Area-specific textual content era with semantic validation
For specialised domains like medical or technical writing, keyword-based rewards can information fashions towards applicable terminology. Partial rewards encourage inclusion of required phrases, whereas full rewards require full key phrase units in semantically applicable contexts. As an example, medical textual content era can reward outputs that mix diagnostic key phrases (“signs,” “prognosis”) with therapy key phrases (“remedy,” “remedy”) in clinically legitimate patterns, instructing area vocabulary via measurable targets. These examples illustrate how verifiable rewards prolong past mathematical reasoning to duties the place correctness could be programmatically validated, establishing the inspiration for broader purposes of this coaching strategy.
Cleansing Up
To scrub up your assets to keep away from incurring extra fees, comply with these steps:
- Delete any unused SageMaker Studio assets.
- Optionally, delete the SageMaker Studio area.
- Delete any S3 buckets created
- Confirm that your coaching job isn’t operating anymore! To take action, in your SageMaker console, select Coaching and verify Coaching jobs.
To study extra about cleansing up your assets provisioned, take a look at Clear up.
Conclusion
On this instance you skilled a Qwen2.5-0.5B mannequin utilizing GRPO (Group Relative Coverage Optimization) on GSM8K: a dataset of 8,500 grade faculty math phrase issues that require multi-step arithmetic reasoning and pure language understanding. Every downside features a query like “Janet’s geese lay 16 eggs per day…” with step-by-step options ending in numerical solutions, making it perfect for verifiable reward coaching.
This implementation demonstrates the effectiveness of Reinforcement Studying with Verifiable Rewards (RLVR) for mathematical reasoning duties. The GRPO-trained Qwen2.5-0.5B mannequin achieved a 3.7x enchancment over the bottom mannequin, reaching 41% accuracy on GSM8K in comparison with the baseline 11%.The analysis outcomes validate RLVR as a promising strategy for domains with objectively verifiable outcomes, providing a substitute for preference-based coaching strategies. The edge conduct suggests GRPO learns to leverage group comparisons from a number of examples, according to its group-based optimization strategy. This work establishes a basis for making use of verifiable reward techniques to different domains requiring logical rigor and mathematical accuracy.
For extra info on Amazon SageMaker AI absolutely managed coaching, consult with the coaching part of the SageMaker AI documentation. The supporting code for this put up could be present in GitHub.
Concerning the authors
Surya Kari is a Senior Generative AI Information Scientist at AWS, specializing in growing options leveraging state-of-the-art basis fashions. He has intensive expertise working with superior language fashions together with DeepSeek-R1, the Llama household, and Qwen, specializing in their fine-tuning and optimization for particular scientific purposes. His experience extends to implementing environment friendly coaching pipelines and deployment methods utilizing AWS SageMaker, enabling the scaling of basis fashions from improvement to manufacturing. He collaborates with prospects to design and implement generative AI options, serving to them navigate mannequin choice, fine-tuning approaches, and deployment methods to realize optimum efficiency for his or her particular use circumstances.
Giuseppe Zappia is a Principal AI/ML Specialist Options Architect at AWS, targeted on serving to giant enterprises design and deploy ML options on AWS. He has over 20 years of expertise as a full stack software program engineer, and has spent the previous 6 years at AWS targeted on the sector of machine studying.
Amin Dashti is a Senior Information Scientist and researcher at AWS who bridges deep theoretical perception with sensible machine studying experience. With a background in theoretical physics and over seven years of expertise, he has designed and deployed scalable fashions throughout domains — from predictive analytics and statistical inference in monetary techniques to cutting-edge purposes in pc imaginative and prescient (CV) and pure language processing (NLP).

