In December 2025, we introduced the provision of Reinforcement fine-tuning (RFT) on Amazon Bedrock beginning with assist for Nova fashions. This was adopted by prolonged assist for Open weight fashions corresponding to OpenAI GPT OSS 20B and Qwen 3 32B in February 2026. RFT in Amazon Bedrock automates the end-to-end customization workflow. This permits the fashions to be taught from suggestions on a number of attainable responses utilizing a small set of prompts, slightly than conventional massive coaching datasets.
On this put up, we stroll by way of the end-to-end workflow of utilizing RFT on Amazon Bedrock with OpenAI-compatible APIs: from organising authentication, to deploying a Lambda-based reward operate, to kicking off a coaching job and working on-demand inference in your fine-tuned mannequin. Right here, we use the GSM8K math dataset as our working instance and goal OpenAI’s gpt-oss-20B mannequin hosted on Bedrock.
How reinforcement fine-tuning works
Reinforcement Tremendous-Tuning (RFT) represents a shift in how we customise massive language fashions (LLMs). Not like conventional supervised fine-tuning (SFT), which requires fashions to be taught from static I/O pairs, RFT allows fashions to be taught by way of an iterative suggestions loop the place they generate responses, obtain evaluations, and constantly enhance their decision-making capabilities.
The core idea: studying from suggestions
At its coronary heart, reinforcement studying is about educating an agent (on this case, an LLM) to make higher choices by offering suggestions on its actions. Consider it like coaching a chess participant. As a substitute of displaying them each attainable transfer in each attainable scenario (which is unattainable), you allow them to play and inform them which strikes led to profitable positions. Over time, the participant learns to acknowledge patterns and make strategic choices that result in success. For LLMs, the mannequin generates a number of attainable responses to a given immediate, receives scores (rewards) for every response primarily based on how properly they meet your standards, and learns to favor the patterns and methods that produce higher-scoring outputs.
Key elements of RFT
Key RFT elements embody the agent/actor (coverage) mannequin, enter states to the mannequin, output actions from the mannequin, and the reward operate as proven within the following diagram:
The actor mannequin is the inspiration mannequin (FM) that you just’re customizing. In Amazon Bedrock RFT, this could possibly be Amazon Nova, Llama, Qwen, or different supported fashions. The state is the present context, together with the immediate, dialog historical past (for multi-turn interactions), and the related metadata. The motion is the mannequin’s response to a immediate. The reward operate assigns a numerical rating to a (state, motion) pair, evaluating the goodness of a mannequin response for a given state. In doing so, the reward operate can use extra data like floor reality responses or unit exams for code technology. That is the essential suggestions sign that drives studying. Increased rewards point out higher responses.
Considered one of RFT’s key benefits is that the mannequin learns from responses it generates throughout coaching, not solely from pre-collected examples. This strategy unlocks a number of compounding advantages. As a result of the mannequin actively explores novel approaches and learns from the outcomes, it may adapt in actual time: because it improves, it naturally encounters new situations that push it additional. This additionally makes the method way more environment friendly, assuaging the necessity to pre-generate and label 1000’s of examples upfront. The result’s a system able to steady enchancment, rising stronger because it encounters an ever-more-diverse vary of conditions. This on-line studying functionality is what allows RFT to realize superior efficiency on complicated duties like code technology, mathematical reasoning, and multi-turn conversations. For verifiable duties like math, that is particularly efficient as a result of correctness checking is absolutely automated – avoiding the necessity for human labeling.
How Amazon Bedrock RFT works
Amazon Bedrock RFT is constructed to make reinforcement fine-tuning sensible on the enterprise degree. It handles the heavy lifting, so groups can concentrate on the issue that they’re fixing slightly than the infrastructure beneath it. The complete RFT pipeline runs routinely. For every immediate in your coaching dataset, Amazon Bedrock generates a number of candidate responses out of your actor mannequin, managing batching, parallelization, and useful resource allocation behind the scenes. Reward computation scales simply as seamlessly. Whether or not you’re utilizing verifiable rewards or an LLM-as-Choose setup, Amazon Bedrock orchestrates analysis throughout 1000’s of prompt-response pairs whereas dealing with concurrency and error restoration with out guide intervention. Coverage optimization runs on GRPO, a state-of-the-art reinforcement studying algorithm, with built-in convergence detection so coaching stops when it ought to. All through the method, Amazon CloudWatch metrics and the Amazon Bedrock console offer you real-time visibility into reward developments, coverage updates, and general mannequin efficiency, so you possibly can know the place coaching stands. The workflow begins out of your growth setting (VS Code, Terminal, Jupyter, or SageMaker AI pocket book) utilizing the usual OpenAI SDK pointed at Bedrock’s Mantle endpoint. From there:
- Add coaching knowledge by way of the Recordsdata API (.jsonl format with messages and reference solutions)
- Deploy a reward operate as an AWS Lambda that scores model-generated responses
- Create the fine-tuning job — Bedrock’s GRPO engine generates responses, sends them to your Lambda grader, and updates weights primarily based on reward scores
- Monitor coaching by way of occasions and checkpoints
- Invoke your fine-tuned mannequin on-demand — no endpoint provisioning, no internet hosting.
Your knowledge doesn’t go away the safe setting of AWS throughout the course of, and isn’t used to coach fashions offered by Amazon Bedrock. Right here, we stroll you thru a selected use case of coaching a OpenAI GPT-OSS mannequin with the GSM8K dataset. For extra particulars, see the Bedrock RFT Consumer Information.
Stipulations
Earlier than you may get began, you want:
- An AWS account with Amazon Bedrock entry in a supported AWS Area
- A Bedrock API key (short-term or long-term). You may as well authenticate utilizing AWS Sigv4 credentials however on this walkthrough we use an Amazon Bedrock API Key. For extra data, see Entry and safety for open-weight fashions within the Amazon Bedrock Consumer Information.
- IAM roles for Lambda execution and Amazon Bedrock fine-tuning
- Python with openai, boto3, and aws-bedrock-token-generator put in. For those who’re engaged on a shell inside a venv, or with a Jupter pocket book, you are able to do:
pip set up openai boto3 aws-bedrock-token-generator
Step 1: Configure the OpenAI consumer
Level the usual OpenAI SDK at your Amazon Bedrock Mantle endpoint. Authentication makes use of an AmazonBedrock API key generated by way of the aws-bedrock-token-generator library:
from openai import OpenAI
from aws_bedrock_token_generator import provide_token
AWS_REGION = “us-west-2″
MANTLE_ENDPOINT = f”https://bedrock-mantle.{AWS_REGION}.api.aws”
consumer = OpenAI(
base_url=f”{MANTLE_ENDPOINT}/v1″,
api_key=provide_token(area=AWS_REGION),
)
That’s it. Each subsequent name makes use of the usual OpenAI SDK interface! Observe: We suggest utilizing and refreshing short-term Amazon bedrock keys as wanted slightly than setting and utilizing long run ones that don’t expire.
Step 2: Put together and add coaching knowledge
Every report within the dataset requires a messages discipline and may optionally embody a reference_answer discipline. The messages discipline incorporates the immediate offered to the mannequin, formatted utilizing the OpenAI message customary the place every message specifies a job (corresponding to “person”) and corresponding content material. The elective reference_answer discipline gives supplementary context for reward computation, corresponding to a ground-truth reply, analysis rule, or scoring dimensions utilized by the reward operate.
For GSM8K examples, every coaching pattern incorporates a mathematical phrase downside within the person message and a reference reply containing the proper numerical answer. The immediate instructs the mannequin to offer its reasoning inside structured tags and current the ultimate reply in a boxed{} format that the reward operate can reliably extract, as within the following instance:
“messages”: [
{
“role”: “user”,
“content”: [
{
“type”: “text”,
“text”: “A chat between a curious User and an artificial intelligence Bot.
The Bot gives helpful, detailed, and polite answers to the User’s questions.
The Bot first thinks about the reasoning process and then provides the User with the answer.
The reasoning process and answer are enclosed within <|begin_internal_thought|> <|end_internal_thought|>
and <|begin_of_solution|> <|end_of_solution|> respectively. The final answer must be enclosed
in boxed{} within the solution block.nnNatalia sold clips to 48 of her friends in April, and then
she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?”
}
]
}
],
“reference_answer”: {
“reply”: “72”
},
“data_source”: “gsm8k_nova”
}
We offer a helper operate to transform the uncooked GSM8K information to JSONL format appropriate with Amazon Bedrock RFT on this GitHub repository.
Observe that the data_source discipline makes positive that the suitable reward operate is utilized throughout coaching whereas the structured immediate codecs align the outputs with the reward operate’s extraction logic.
As beforehand talked about, the coaching knowledge is a JSONL file the place every line incorporates a dialog with messages and a reference reply. For GSM8K, this seems like:
{
“messages”: [
{“role”: “user”, “content”: “Janet’s ducks lay 16 eggs per day. She eats three for breakfast and bakes muffins with four. She sells the rest at $2 each. How much does she make daily? Let’s think step by step and output the final answer after ‘####’.”}
],
“reference_answer”: “#### 18”
}
You should use extra fields right here that could be helpful on your grader Lambda operate within the step we see later, however observe that the messages construction and reference_answer are obligatory.
We are able to then add our ready dataset by way of the Recordsdata API:
with open(“rft_train_data.jsonl”, “rb”) as f:
file_response = consumer.information.create(file=f, objective=”fine-tune”)
training_file_id = file_response.id
print(f”Coaching file uploaded: {training_file_id}”)
Step 3: Deploy a Lambda reward operate
The reward operate is the core of RFT. It receives model-generated responses and returns a rating. For math issues, that is easy: extract the reply and evaluate it to floor reality.
Right here is the reward operate used on this walkthrough (from the pattern repository):
def lambda_handler(occasion, context):
trajectories = occasion if isinstance(occasion, checklist) else occasion.get(“trajectories”, [])
scores = []
for trajectory in trajectories:
trajectory_id = trajectory.get(“id”, “no-id”)
# Get the mannequin’s response from the final assistant message
response = “”
for msg in reversed(trajectory.get(“messages”, [])):
if msg.get(“position”) == “assistant”:
response = msg.get(“content material”, “”)
break
# Extract floor reality from reference reply
reference_answer = trajectory.get(“reference_answer”, {})
reference_text = reference_answer.get(“textual content”, “”)
gt_match = re.findall(r”#### (-?[0-9.,]+)”, reference_text)
ground_truth = gt_match[-1].substitute(“,”, “”) if gt_match else “”
# Rating: 1.0 if appropriate, 0.0 in any other case
end result = compute_score(
trajectory_id=trajectory_id,
solution_str=response,
ground_truth=ground_truth,
)
scores.append(asdict(end result))
return scores
The operate returns a listing of RewardOutput objects, every containing an aggregate_reward_score between 0 and 1. Deploy this as an AWS Lambda operate with a 5-minute timeout and 512 MB reminiscence. Observe that you may utterly customise what occurs inside this reward Lambda operate to fit your use case. Amazon Bedrock additionally helps model-as-a-judge graders for subjective duties the place automated verification isn’t attainable. For extra details about organising reward capabilities, see Establishing reward capabilities for open-weight fashions.
Step 4: Create the fine-tuning job
Now we use the next single API name to begin the job:
job_response = consumer.fine_tuning.jobs.create(
mannequin=”openai.gpt-oss-20b”,
training_file=training_file_id,
extra_body={
“methodology”: {
“kind”: “reinforcement”,
“reinforcement”: {
“grader”: {
“kind”: “lambda”,
“lambda”: {
“operate”: lambda_arn # Change with reward operate Arn
}
},
“hyperparameters”: {
“n_epochs”: 1,
“batch_size”: 4,
“learning_rate_multiplier”: 1.0
}
}
}
}
)
job_id = job_response.id
Discover that the create name for the earlier fine-tuning job makes use of the next hyperparameters:
Parameter
Description
n_epochs
Variety of full passes by way of the coaching knowledge. Begin with 1.
batch_size
Prompts per coaching step. Bigger = extra steady updates.
learning_rate_multiplier
We suggest utilizing a price <1.0 for stability.
Step 5: Monitor coaching
To trace progress of the job, we use the checklist occasions API as follows:
occasions = consumer.fine_tuning.jobs.list_events( fine_tuning_job_id=job_id, restrict=100)
For a GPT-OSS instance job that makes use of the GSM8K knowledge subset, the coaching runs for a complete of 67 steps with varied occasions being emitted because the coaching job progresses. Right here’s a timeline of those steps:
Now let’s dissect certainly one of these occasions throughout coaching progress:
{
“id”: “ftevent-c3c14785-4a3b-4dab-99a5-a15aeb6c0742”,
“created_at”: 1771442218,
“degree”: “data”,
“message”: “Step 4/67: coaching metrics”,
“object”: “fine_tuning.job.occasion”,
“knowledge”: {
“total_steps”: 67,
“actor_grad_norm”: 0.0008667297661304474,
“response_length_mean”: 519.09375,
“step”: 4,
“actor_pg_loss”: 0.10153239965438844,
“critic_rewards_mean”: 0.4375,
“actor_entropy”: 0.6235736012458801,
“critic_advantages_mean”: 0.013622610829770563
},
“kind”: “metrics”
Let’s focus on what these imply:
Metric
That means
step / total_steps
Present coaching step / out of complete
critic_rewards_mean
Common reward rating throughout the batch (0.4375 means ~44% of responses received appropriate solutions out of your grader). That is the first metric to look at — you need it trending up.
actor_pg_loss
Coverage gradient loss. That is the target being optimized — how a lot the mannequin’s coverage is being pushed towards higher-reward responses. Fluctuates naturally; no single “good” worth.
actor_entropy
How unfold out the mannequin’s token chance distribution is. Increased = extra exploratory/various outputs. If it collapses towards 0, the mannequin is changing into too deterministic (mode collapse). You need it to lower regularly, not crash.
actor_grad_norm
Magnitude of the gradient replace to the actor (the mannequin). Giant spikes can point out coaching instability. Yours may be very small (0.0009), which suggests steady, conservative updates.
critic_advantages_mean
Common benefit estimate—how significantly better/worse a response was in comparison with the critic’s baseline prediction. Close to-zero (0.014) implies that the critic is well-calibrated. Giant constructive values imply that the mannequin is doing significantly better than anticipated; massive unfavorable means worse.
response_length_mean
Common token size of generated responses (519). Value monitoring—if it grows unboundedly, the mannequin could also be gaming size for reward.
What to look at for throughout coaching:
- critic_rewards_mean trending upward = mannequin is studying
- actor_entropy collapsing to 0 = mode collapse (dangerous)
- actor_grad_norm spiking = instability
- response_length_mean exploding = reward hacking?
The pattern code additionally gives an instance of plot these metrics.
The reward curve reveals the mannequin bettering from ~0.56 to constantly 0.85–0.97 by mid coaching. Response lengths additionally pattern shorter over time, suggesting the mannequin discovered to be extra concise whereas fixing GSM8K issues accurately. Right here’s Checklist checkpoints as they’re saved:
checkpoints = consumer.fine_tuning.jobs.checkpoints.checklist( fine_tuning_job_id=job_id)
Step 6: Run on-demand inference
After the job succeeds, invoke your fine-tuned mannequin straight. No endpoint provisioning, no internet hosting:
job_details = consumer.fine_tuning.jobs.retrieve(job_id)
fine_tuned_model = job_details.fine_tuned_model
response = consumer.chat.completions.create(
mannequin=fine_tuned_model,
messages=[
{“role”: “user”, “content”: “If a train travels 120 miles in 2 hours, what is its speed in miles per hour?”}
],
)
print(response.selections[0].message.content material)
You may as well use the responses API to stream responses from the fine-tuned mannequin:
stream = consumer.responses.create(
mannequin=fine_tuned_model,
enter=[{“role”: “user”, “content”: “Your prompt here”}],
stream=True,
reasoning={“effort”: “low”}
)
for occasion in stream:
if occasion.kind == “response.output_text.delta”:
print(occasion.delta, finish=””, flush=True)
Conclusion
Reinforcement fine-tuning on Amazon Bedrock brings collectively three issues that make the end-to-end workflow sensible:
- OpenAI SDK compatibility — no new SDK to be taught. Level OPENAI_BASE_URL and OPENAI_API_KEY at Bedrock and use the identical consumer.fine_tuning.jobs.create() calls.
- Lambda-based reward capabilities — write your scoring logic in Python, deploy as Lambda, and Amazon Bedrock handles the coaching loop (GRPO) for you.
- On-demand inference — no endpoint administration. Name consumer.chat.completions.create() along with your fine-tuned mannequin ID and pay per token.
The complete pocket book with end-to-end code for each GPT-OSS 20B and Qwen3 32B is accessible on GitHub:
github.com/aws-samples/amazon-bedrock-samples/tree/fundamental/custom-models/bedrock-reinforcement-fine-tuning
For extra particulars, see the Amazon Bedrock Reinforcement Tremendous-Tuning documentation.
Concerning the authors
Shreyas Subramanian
Shreyas Subramanian is a Principal Information Scientist and helps prospects through the use of Generative AI and deep studying to unravel their enterprise challenges utilizing AWS providers like Amazon Bedrock and AgentCore. Dr. Subramanian contributes to cutting-edge analysis in deep studying, Agentic AI, basis fashions and optimization methods with a number of books, papers and patents to his identify. In his present position at Amazon, Dr. Subramanian works with varied science leaders and analysis groups inside and outdoors Amazon, serving to to information prospects to greatest leverage state-of-the-art algorithms and methods to unravel enterprise essential issues. Outdoors AWS, Dr. Subramanian is a consultant reviewer for AI papers and funding by way of organizations like Neurips, ICML, ICLR, NASA and NSF.
Nick McCarthy
Nick McCarthy is a Senior Generative AI Specialist Options Architect on the Amazon Bedrock staff, primarily based out of the AWS New York workplace. He helps prospects customise their GenAI fashions on AWS. He has labored with shoppers throughout a variety of industries — together with healthcare, finance, sports activities, telecommunications, and power — serving to them speed up enterprise outcomes by way of the usage of AI and machine studying. He holds a Bachelor’s diploma in Physics and a Grasp’s diploma in Machine Studying from UCL, London.
Shreeya sharma
Shreeya Sharma is a Senior Technical Product Supervisor at AWS, the place she has been engaged on leveraging the facility of generative AI to ship modern and customer-centric merchandise. Shreeya holds a grasp’s diploma from Duke College. Outdoors of labor, she loves touring, dancing, and singing.
Shalendra Chhabra
Shalendra Chhabra is at present Head of Product Administration for Amazon SageMaker Human-in-the-Loop (HIL) Companies. Beforehand, Shalendra incubated and led Language and Conversational Intelligence for Microsoft Groups Conferences, was EIR at Amazon Alexa Techstars Startup Accelerator, VP of Product and Advertising and marketing at Focus on.io, Head of Product and Advertising and marketing at Clipboard (acquired by Salesforce), and Lead Product Supervisor at Swype (acquired by Nuance). In complete, Shalendra has helped construct, ship, and market merchandise which have touched greater than a billion lives.

