You should utilize reinforcement High quality-Tuning (RFT) in Amazon Bedrock to customise Amazon Nova and supported open supply fashions by defining what “good” seems like—no massive labeled datasets required. By studying from reward indicators slightly than static examples, RFT delivers as much as 66% accuracy positive factors over base fashions at diminished customization price and complexity. This put up covers greatest practices for RFT on Amazon Bedrock, from dataset design, reward perform technique, and hyperparameter tuning to be used instances like code era, structured extraction, and content material moderation.
On this put up, we discover the place RFT is best, utilizing the GSM8K mathematical reasoning dataset as a concrete instance. We then stroll via greatest practices for dataset preparation and reward perform design, present the right way to monitor coaching progress utilizing Amazon Bedrock metrics, and conclude with sensible hyperparameter tuning pointers knowledgeable by experiments throughout a number of fashions and use instances.
RFT use-cases: The place can RFT shine?
Reinforcement High quality-Tuning (RFT) is a mannequin customization approach that improves basis mannequin (FM) conduct utilizing reward indicators. In comparison with supervised fine-tuning (SFT), it doesn’t immediately practice on appropriate responses (labeled I/O pairs). As an alternative, RFT makes use of a dataset of inputs and a reward perform. The reward perform could be rule-based or one other educated grader mannequin, or massive language mannequin (LLM) as a decide. Throughout coaching, the mannequin generates candidate responses and the reward perform scores every response. Primarily based on the reward, the mannequin weights are up to date to extend the chance of producing responses that obtain a excessive reward. This iterative cycle of pattern responses, rating responses, and replace weights steers the mannequin to study which behaviors result in higher outcomes. RFT is especially invaluable when the specified conduct could be evaluated, however tough to reveal—whether or not as a result of labeled information is impractical to curate or as a result of static examples alone can’t seize the reasoning a job calls for. It excels in two major areas:
- Duties the place a rule or take a look at can confirm correctness routinely
- Subjective duties the place one other mannequin can successfully consider response high quality
Duties within the first class are code era that should go assessments, math reasoning with verifiable solutions, structured information extraction that should match strict schemas, or API/device calls that should parse and execute appropriately. As a result of success standards could be translated immediately into reward indicators, the mannequin can uncover stronger methods than what a small set of labeled examples might train. This sample is called Reinforcement Studying with Verifiable Rewards (RLVR).
As well as, RFT fits subjective duties similar to content material moderation, chatbots, inventive writing, or summarization that lack simply quantifiable correctness. A decide mannequin, guided by an in depth analysis rubric, can function the reward perform. It scores outputs in opposition to standards that will be impractical to encode as static coaching pairs. This method is called Reinforcement Studying with AI Suggestions (RLAIF).
For RFT in Amazon Bedrock, you possibly can implement each rule-based and model-based approaches as a customized AWS Lambda perform, which is the reward perform that Amazon Bedrock calls through the coaching loop.
A comparability of those two approaches is depicted within the following diagram:
The next are a number of frequent use instances that may be tackled via RLVR, RLAIF, or a mixture of each.
Use Case
Reward Sign
Code era for manufacturing companies
Unit-test go charges, linting, and runtime checks
Device and API orchestration
Profitable end-to-end job completion (like, reserving flows, information retrieval pipelines)
Complicated math and algorithmic reasoning
Right remaining solutions and/or intermediate verification steps
Structured information extraction and transformation
Schema validation, actual matches, penalties for malformed outputs
SQL / question synthesis over databases
Question outcomes matching anticipated solutions or satisfying runtime properties
Agentic workflows
Mixture of RLVR and RLAIF; RLVR for device calling correctness; RLAIF for remaining job completion, for instance, measured as usefulness, correctness, or robustness
GSM8K: Utilizing RFT to enhance options to mathematical calculations
For example how reinforcement fine-tuning works in observe, we are able to look at a concrete instance: bettering a mannequin’s capacity to resolve mathematical reasoning issues. RFT is beneficial for mathematical issues as a result of options can typically be objectively verified, making it potential to design clear reward indicators that information the mannequin towards appropriate reasoning and structured outputs. Let’s take a look at an instance from the GSM8K (Grade Faculty Math 8K) dataset:
Tina makes $18.00 an hour. If she works greater than 8 hours per shift, she is eligible for extra time, which is paid by your hourly wage + 1/2 your hourly wage. If she works 10 hours each day for five days, how a lot cash does she make?
Let’s take a look at what a great response may appear like:
I would like to seek out complete pay for five days of 10-hour shifts. As a result of she works over 8 hours every day, I will want to separate every day into common and extra time hours, calculate the extra time fee (1.5x common), then multiply by 5 days.
Additional time fee: $18.00 + (1/2 × $18.00) = $27.00/hour
Day by day earnings (10 hours):
Common (8 hours): 8 × $18 = $144
Additional time (2 hours): 2 × $27 = $54
Day by day complete: $198
Complete for five days: 5 × $198 = $990
boxed{990}
Right here, we see that the issue is damaged down into logical steps and reveals clear reasoning paths, not solely remaining solutions. Moreover, we want the mannequin to reply on this particular format and have the reply precisely match the bottom reality resolution. Different fine-tuning strategies like SFT wrestle with mathematical reasoning as a result of they primarily study to pattern-match coaching information slightly than really purpose. These fashions can memorize resolution templates however typically fail when introduced with novel variations of an issue.
As a result of we are able to use RFT to outline reward capabilities, actual solutions just like the earlier reply of $990 could be objectively evaluated whereas additionally assigning partial credit score for proper intermediate reasoning steps. This permits the mannequin to find legitimate resolution approaches whereas studying to comply with required structured, and in lots of instances achieves sturdy efficiency with comparatively small datasets (round 100–1000 examples).
Greatest practices for getting ready Your dataset
RFT requires rigorously ready datasets to attain efficient outcomes. On Amazon Bedrock, RFT coaching information is offered as a JSONL file, with every file following the OpenAI chat completion format.
Dataset measurement pointers
RFT helps dataset sizes between 100–10,000 coaching samples, although necessities fluctuate relying on job complexity and reward perform design. Duties involving advanced reasoning, specialised domains, or broad software scopes typically profit from bigger datasets and a complicated reward perform. For preliminary experimentation, begin with a small dataset (100–200 examples) to validate that your prompts and reward perform produce significant studying indicators and that the bottom mannequin can obtain measurable reward enhancements. Notice that for sure domains, solely customizing on small datasets can yield restricted generalization and present inconsistent outcomes throughout immediate variations. Typical implementations utilizing 200–5,000 examples present stronger generalization and extra constant efficiency throughout immediate variations. For extra advanced reasoning duties, specialised domains, or refined reward capabilities, 5,000–10,000 examples can enhance robustness throughout various inputs.
For extra details about the dataset necessities, see the Amazon Bedrock documentation.
Dataset high quality ideas
The standard of your coaching information basically determines RFT outcomes. Contemplate the next ideas when getting ready your dataset:
1. Immediate distribution
Be sure that the dataset displays the total vary of prompts that the mannequin will encounter in manufacturing. A skewed dataset can result in poor generalization or unstable coaching conduct.
2. Base mannequin functionality
RFT assumes that the bottom mannequin demonstrates primary job understanding. If the mannequin can’t obtain a non-zero reward in your prompts, the educational sign can be too weak for efficient coaching. A easy validation step is producing a number of responses from the bottom mannequin (like, temperature ≈ 0.6) and confirming that the outputs produce significant reward indicators.
3. Clear immediate design
Prompts ought to clearly talk expectations and constraints. Ambiguous directions result in inconsistent reward indicators and degraded studying. Immediate construction must also align with reward perform parsing. For instance, requiring remaining solutions after a selected marker or implementing code blocks for programming duties, in addition to the immediate construction that the bottom mannequin is aware of from pre-training.
4. Dependable reference solutions
When potential, embody a reference reply that represents the specified output sample, formatting, and correctness standards. Reference solutions anchor reward computation and cut back noise within the studying sign. For instance, mathematical duties may embody an accurate numerical reply, whereas coding duties may embody unit assessments or input-output pairs.
It’s additionally good observe to validate reference solutions by confirming {that a} response aligned with the bottom reality receives the utmost reward rating.
5. Constant reward indicators inside the information
As a result of RFT depends fully on reward indicators to information studying, the standard of these indicators is essential. Your dataset and reward perform ought to work collectively to supply constant, well-differentiated scores. Because of this sturdy responses reliably rating increased than weak ones throughout related inputs. If the reward perform can’t clearly distinguish between good and poor responses, or if related outputs obtain extensively various scores, the mannequin may study the fallacious patterns or fail to enhance altogether.
Within the subsequent part you’ll study what to bear in mind when writing your reward perform.
Getting ready your reward perform
Reward capabilities are central to RFT as a result of they consider and rating mannequin responses, assigning increased rewards to most popular outputs and decrease rewards to much less fascinating ones. This suggestions guides the mannequin towards improved conduct throughout coaching. For goal duties like mathematical reasoning, a candidate response that produces the proper reply may obtain a reward of 1, whereas an incorrect reply receives 0. A response with {a partially} appropriate reasoning hint and an incorrect remaining reply may get a reward of 0.8 (relying on how a lot you need to penalize an incorrect remaining response). For subjective duties, the reward perform encodes desired qualities. For instance, in summarization it’d seize faithfulness, protection, and readability. For extra details about establishing your reward perform, see establishing reward capabilities for Amazon Nova fashions.
Reward design for verifiable duties
For duties that may be deterministically verified, like math reasoning or coding, the only method is to programmatically verify correctness. Efficient reward capabilities sometimes consider each format constraints and efficiency targets. Format checks ensure that the responses could be reliably parsed and evaluated. Efficiency metrics decide whether or not the result’s appropriate. Rewards could be applied utilizing binary indicators (appropriate in comparison with incorrect) or steady scoring relying on the duty.
For GSM8K-style mathematical reasoning duties, reward capabilities should additionally account for a way fashions categorical numerical solutions. Fashions can format numbers with commas, foreign money symbols, percentages, or embed solutions inside explanatory textual content. To deal with this, solutions must be normalized by stripping formatting characters and making use of versatile extraction that prioritizes structured codecs earlier than falling again to sample matching. This method makes positive that the fashions are rewarded for proper reasoning slightly than penalized for stylistic formatting decisions. Yow will discover the total reward perform implementation for GSM8K within the amazon-bedrock-samples GitHub repository.
Reward design for non-verifiable duties
Duties like summarization, inventive writing, or semantic alignment require an LLM-based decide to approximate subjective preferences. On this setting, the decide immediate successfully acts because the reward perform, defining what behaviors are rewarded and the way responses are scored. A sensible decide immediate ought to clearly outline the analysis aim and embody a concise scoring rubric with numeric scales reflecting the qualities the mannequin ought to enhance for.
Choose prompts must also return structured outputs, for instance JSON or tagged codecs containing the ultimate rating and elective reasoning, so reward values could be reliably extracted throughout coaching whereas sustaining observability into how every response was evaluated. An instance of a reward perform that makes use of AI suggestions could be seen on this PandaLM reward perform script in GitHub.
Combining verifiable rewards with AI suggestions
Reward capabilities for verifiable duties will also be augmented with AI suggestions to judge resolution high quality past numerical correctness. For instance, an LLM-as-a-judge can assess the reasoning chain, confirm intermediate calculations, or consider the readability of explanations, offering a reward sign that captures each correctness and reasoning high quality.
Iterating on reward design
Reward capabilities typically require iteration. Early variations may produce noisy indicators or through the coaching loop the mannequin may study to use the reward perform to generate a excessive reward with out studying the specified conduct. Refining the reward logic primarily based on noticed coaching conduct is crucial. Earlier than launching full coaching jobs, it’s additionally good observe to check reward capabilities independently utilizing pattern prompts and identified outputs to make sure that the scoring logic produces steady and significant reward indicators.
Evaluating coaching progress: indicators that the mannequin is studying
After your dataset and reward perform are prepared, you possibly can launch RFT coaching utilizing both the Amazon Bedrock API or via the console. The precise workflow relies on your most popular improvement setting. The Create and handle fine-tuning jobs for Amazon Nova fashions subject within the Amazon Bedrock Consumer Information supplies step-by-step directions for each approaches. After coaching begins, monitoring the coaching metrics is essential. These indicators point out whether or not the reward perform is significant and whether or not the mannequin is studying helpful behaviors slightly than overfitting or collapsing to trivial methods. The next picture reveals the coaching metrics of one among our GSM8K coaching run exhibiting wholesome coaching dynamics.
Coaching rewards plots the typical reward rating at every coaching step. Variance is predicted as a result of the enter prompts in a batch are sampled randomly so problem in batches differ. As well as, the mannequin is exploring totally different methods resulting in variance. What issues is the general development: rewards enhance from roughly 0.5 to round 0.8–0.9, indicating that the mannequin is converging on receiving increased rewards. Validation rewards present a clearer sign as a result of they’re computed on a held-out dataset. Right here we see a steep enchancment through the first ~40 steps adopted by a plateau round 0.88, suggesting the mannequin is generalizing slightly than memorizing coaching examples. Validation rewards that observe carefully with coaching rewards are sometimes an indication that overfitting isn’t occurring.
Coaching episode size measures the typical response size. The drop from roughly 625 tokens to ~400 tokens means that the mannequin is studying to succeed in appropriate solutions extra effectively, producing much less redundant reasoning as coaching progresses. Coverage entropy measures how a lot the mannequin is exploring totally different response methods throughout coaching. Values within the 0.8–1.1 vary point out wholesome exploration. If entropy collapsed towards zero it might recommend the mannequin had prematurely converged, however sustained entropy implies the mannequin continues to be exploring and bettering.
Hyperparameter tuning pointers
On this part, we cowl sensible hyperparameter tuning pointers for Amazon Bedrock RFT. These suggestions are knowledgeable by a sequence of inside experiments that we ran throughout a number of fashions and use instances. This contains reasoning duties like GSM8K and different structured and generative workloads. Whereas efficient values will fluctuate by job, the patterns noticed throughout these experiments present helpful beginning factors when configuring RFT jobs. For extra details about the hyperparameters you can configure earlier than launching an RFT customization job, see the official boto3 docs.
EpochCount
Coaching length and epochCount require adjustment primarily based on dataset measurement and mannequin conduct. Smaller datasets typically present continued enchancment via 6-12 epochs, whereas bigger datasets could obtain optimum efficiency in 3-6 epochs. This relationship isn’t linear and cautious monitoring of validation metrics stays important to stop overfitting whereas guaranteeing adequate mannequin adaptation.
BatchSize
This parameter controls what number of prompts are processed earlier than the up to date mannequin generates a brand new spherical of candidate responses (rollouts). For instance, with a batchSize of 128, the mannequin processes, updates, and generates new rollouts for 128 prompts at a time till it has labored via the total dataset. The overall variety of rollout rounds equals the (filtered) dataset measurement divided by batchSize.
A batchSize of 128 works nicely for many use instances and fashions. Improve it if loss is erratic or reward isn’t bettering. Lower it if iterations take too lengthy.
LearningRate
In Amazon Bedrock RFT, we carry out parameter-efficient RFT utilizing Low Rank Adaptation (LoRA) adapters with a rank of 32. Throughout a spread of use instances, a studying fee of 1e-4 has persistently produced sturdy outcomes. Within the following experiment, we swept studying charges throughout seven orders of magnitude on Qwen3-1.7B utilizing the GSM8K dataset (1K coaching samples, 256 take a look at samples), working a single epoch with batch measurement 64, group measurement 16, and LoRA rank 1.As proven within the following determine, LoRA’s optimum studying fee peaks round 1e-4 to 1e-3, roughly one order of magnitude increased than full fine-tuning (FFT). Even with a rank of 1, LoRA achieves inside ~5.5% of FFT’s greatest validation reward at roughly the identical wall-clock time. In observe, LoRA-based RFT tends to be extra forgiving and performs nicely throughout a wider vary of studying charges than FFT, although each approaches can collapse outdoors their optimum ranges. We suggest monitoring reward curves carefully and reducing the educational fee if they start to oscillate or collapse.
Immediate size and response size
The maxPromptLength defines the utmost allowed size for enter immediate within the dataset. Prompts exceeding this restrict are filtered out throughout coaching. In case your dataset accommodates unusually lengthy prompts or different outliers, set an applicable worth that excludes outliers whereas retaining most samples. In any other case, you possibly can set it to the size of the longest immediate in your dataset. However, inferenceMaxTokens defines the utmost response size for any rollout or response generated throughout RL coaching. You should utilize this argument to manage whether or not the ensuing mannequin generates detailed outputs or concise solutions. We suggest that you just select a price primarily based on the necessities of your job. An excessively massive worth can enhance coaching time whereas a too small worth might degrade mannequin efficiency. For the duties that don’t require advanced reasoning, setting the utmost response size to 1,024 is often adequate. In distinction, for difficult duties like coding or long-form era, utilizing a bigger higher certain (greater than 4,096) is preferable.
Early stopping and analysis interval
Our RFT service supplies two options that optimize coaching effectivity and mannequin high quality. EarlyStopping (enabled by default) routinely stops coaching when efficiency enhancements plateau, stopping overfitting and lowering pointless computation prices. The system constantly displays validation metrics and terminates coaching after it detects that additional iterations are unlikely to yield significant enhancements. In the meantime, evalInterval determines how often the mannequin evaluates its efficiency on the validation dataset throughout coaching. This hyperparameter is routinely calculated as min(10, data_size/batch_size), sustaining not less than one analysis per epoch whereas sustaining affordable frequency. For datasets the place data_size considerably exceeds 10×batch_size, evaluations sometimes happen each 10 steps, offering adequate monitoring granularity with out extreme overhead.
RFT metrics and their which means
Amazon Bedrock exposes a number of coaching metrics via Amazon CloudWatch and the Amazon Bedrock console that offer you a transparent image of whether or not your RFT job is progressing as anticipated. Understanding what every metric represents and what anomalies to observe for makes the distinction between catching an issue early and ready hours for a failed run to complete.
Coaching and validation rewards
The coaching reward is the typical reward on the episodes that you just’re coaching on. The validation reward is identical metric on a held-out set of prompts that don’t contribute gradients. In a wholesome run, practice reward ought to climb steadily early on, with validation reward rising extra slowly however in the identical normal path.
Practice and validation episode lengths
These encode the typical variety of tokens generated per response. Use this to detect verbosity hacking. If lengths explode whereas rewards enhance, the mannequin has discovered that longer = higher no matter high quality. In reasoning duties (like Chain Of Thought (CoT)), a gradual enhance is wholesome (studying to assume), however a sudden vertical spike normally signifies a loop or failure. In some instances, you will notice a gradual lower, and that’s positive too. That would imply that the mannequin was initially exploring extra to get to the reply, however later figures out shorter but rewarding trajectories.
Coverage entropy
Coverage entropy measures how assured the mannequin is in its outputs. Excessive entropy means the mannequin is unsure and nonetheless exploring, whereas low entropy means it’s converging on constant responses. Over a wholesome coaching run, you’d count on a mild decline from the preliminary baseline to a steady plateau because the mannequin learns. A pointy drop to close zero is a warning signal: it sometimes signifies that the mannequin has collapsed into repeating a single response slightly than reasoning via issues. On the opposite finish, a flat line at a persistently excessive worth suggests the mannequin is ignoring the reward sign fully and never studying from suggestions.
Gradient norm
The magnitude (L2 norm) of the gradients utilized to the mannequin at every replace. In a steady run it fluctuates inside an inexpensive band, with occasional spikes; sustained development or excessive spikes can point out points with studying fee, reward scaling, or numeric stability.
Frequent pitfalls
Even well-configured RFT jobs can run into failure modes that aren’t all the time apparent from the metrics alone. The 2 commonest are reward hacking—the place the mannequin learns to recreation the reward perform slightly than enhance on the precise job—and reward instability, the place excessive variance within the reward sign undermines the educational course of. Each are recoverable, however simpler to deal with if you recognize what to search for.
Reward hacking
This happens when the coverage learns to use weaknesses within the reward perform to maximise scores with out bettering high quality. You will note coaching rewards climb steadily whereas human analysis scores degrade or plateau. To mitigate this, make sure that the reward perform captures all features of the conduct you need encoded via fine-tuning. If not, observe the mannequin generations, and iterate on the reward perform. Use strict size penalties within the reward perform if wanted.
Reward variance and instability
Even with an excellent common reward, excessive fluctuation in scores for related inputs creates a loud sign that destabilizes coaching. This manifests as jittery reward curves and wildly oscillating loss metrics. The primary line of protection is rigorous normalization: standardize rewards (zero imply, unit variance) inside each batch, clip excessive outliers, and guarantee your reward inference is deterministic (no dropout), so the optimizer receives a constant and steady studying sign.
Conclusion
On this put up, we demonstrated the right way to apply Reinforcement High quality-Tuning (RFT) in Amazon Bedrock to enhance mannequin efficiency utilizing feedback-driven coaching. Utilizing the GSM8K mathematical reasoning dataset as a concrete instance, we confirmed the place RFT is best, the right way to construction coaching datasets, and the right way to design reward capabilities that reliably consider mannequin outputs. We additionally explored the right way to monitor coaching progress utilizing Bedrock’s coaching metrics and offered sensible hyperparameter tuning pointers knowledgeable by experiments throughout a number of fashions and use instances. Collectively, these parts type the core basis for working profitable RFT workflows. When datasets are nicely structured, reward capabilities seize the fitting notion of high quality, and coaching metrics are monitored rigorously. RFT can considerably enhance mannequin efficiency throughout each verifiable duties (similar to reasoning, coding, and structured extraction) and subjective duties utilizing AI suggestions.
Subsequent steps
Prepared to begin customizing with RFT in Amazon Bedrock? Log in to the Amazon Bedrock console or assessment the official AWS API docs and create your first RFT coaching job utilizing the open supply fashions that have been fine-tuned for this use-case.
To start:
- Discover the Documentation: Go to the excellent guides and tutorials: Create a reinforcement fine-tuning job
- Attempt the Pattern Notebooks: Entry ready-to-run examples within the AWS Samples GitHub repository
- Experiment with your individual workloads – Apply the dataset preparation, reward design, and hyperparameter tuning practices lined on this put up to your individual use instances.
Acknowledgement
Thanks to the contributions from the Amazon Bedrock Utilized Scientist crew, Zhe Wang and Wei Zhu, who’s experimental work served as the inspiration for most of the greatest practices listed on this weblog put up.
Concerning the authors
Nick McCarthy
Nick McCarthy is a Senior Generative AI Specialist Options Architect on the Amazon Bedrock crew, primarily based out of the AWS New York workplace. He helps prospects customise their GenAI fashions on AWS. He has labored with purchasers throughout a variety of industries — together with healthcare, finance, sports activities, telecommunications, and vitality — serving to them speed up enterprise outcomes via using AI and machine studying. He holds a Bachelor’s diploma in Physics and a Grasp’s diploma in Machine Studying from UCL, London.
Shreyas Subramanian
Shreyas Subramanian is a Principal Information Scientist and helps prospects by utilizing Generative AI and deep studying to resolve their enterprise challenges utilizing AWS companies like Amazon Bedrock and AgentCore. Dr. Subramanian contributes to cutting-edge analysis in deep studying, Agentic AI, basis fashions and optimization methods with a number of books, papers and patents to his title. In his present position at Amazon, Dr. Subramanian works with varied science leaders and analysis groups inside and outdoors Amazon, serving to to information prospects to greatest leverage state-of-the-art algorithms and methods to resolve enterprise essential issues. Outdoors AWS, Dr. Subramanian is a consultant reviewer for AI papers and funding through organizations like Neurips, ICML, ICLR, NASA and NSF.
Sapana Chaudhary
Sapana Chaudhary is an Utilized Scientist II at Amazon Internet Companies (AWS), the place she works on reinforcement studying post-training of enormous language fashions. Her analysis sits on the intersection of reinforcement studying, robustness, and language fashions — with the aim to make AI programs extra dependable and reliable for downstream duties — whether or not via constrained optimization, risk-aware finetuning, or verifiable reasoning. Sapana holds a PhD from Texas A&M College (TAMU). Outdoors of labor, she likes to hike, cook dinner, paint, and {photograph}.
Jennifer Zhu
Jennifer Zhu is an Utilized Science Supervisor at AWS, the place she leads the mannequin customization companies together with Reinforcement High quality-tuning on Amazon Bedrock. At AWS, Jennifer works on LLM fine-tuning and distillation, with a give attention to constructing production-grade infrastructure for mannequin post-training at scale. Jennifer holds a PhD diploma from Cornell College, and a grasp diploma from College of San Francisco. Outdoors of labor, she enjoys studying books and watching tennis video games.

