This hands-on information walks by means of each step of fine-tuning an Amazon Nova mannequin with the Amazon Nova Forge SDK, from knowledge preparation to coaching with knowledge mixing to analysis, supplying you with a repeatable playbook you possibly can adapt to your personal use case. That is the second half in our Nova Forge SDK sequence, constructing on the SDK introduction and first half, which lined kicking off customization experiments.
The main target of this publish is knowledge mixing: the approach that permits you to fine-tune on domain-specific knowledge with out sacrificing a mannequin’s basic capabilities. Within the earlier publish, we made the case for why this issues, mixing buyer knowledge with Amazon-curated datasets preserved near-baseline Huge Multitask Language Understanding (MMLU) scores whereas delivering a 12-point F1 enchancment on a Voice of Buyer classification process spanning 1,420 leaf classes. In contrast, fine-tuning an open-source mannequin on buyer knowledge alone induced a near-total lack of basic capabilities. Now we present you learn how to do it your self.
Resolution overview
The workflow consists of 5 levels:
- Setting setup – Set up the Nova Forge SDK and configure AWS sources
- Information preparation – Load, sanitize, rework, validate, and cut up your coaching knowledge
- Coaching configuration – Configure the Amazon SageMaker HyperPod runtime, MLflow monitoring, and knowledge mixing ratios
- Mannequin coaching – Launch and monitor a supervised fine-tuning job with Low-Rank Adaptation (LoRA)
- Mannequin analysis – Run public benchmarks and domain-specific evaluations in opposition to the fine-tuned checkpoint
Conditions
Earlier than you start, ensure you have the next:
- An AWS account with entry to Amazon Nova Forge
- A SageMaker HyperPod cluster provisioned with GPU cases. This walkthrough makes use of `ml.p5.48xlarge` cases. Establishing a HyperPod cluster includes configuring an Amazon Elastic Kubernetes Service (Amazon EKS) cluster, provisioning compute nodes, and creating execution roles. For detailed directions, see Getting began with SageMaker HyperPod.
- An Amazon SageMaker MLflow utility for experiment monitoring
- An IAM position with permissions for SageMaker, Amazon Easy Storage Service (Amazon S3), and Amazon CloudWatch
- A SageMaker Studio pocket book or comparable Jupyter setting
Value consideration: This walkthrough makes use of 4 `ml.p5.48xlarge` cases for coaching and for analysis. These are high-end GPU cases. We suggest beginning with a brief take a look at run (max_steps=5) to validate your configuration earlier than committing to a full coaching run. For present charges, see the Amazon SageMaker pricing web page.
Step 1: Set up the Nova Forge SDK and dependencies
The SDK requires the SageMaker HyperPod CLI tooling. Obtain and set up it from the Nova Forge S3 distribution bucket (offered throughout your Nova Forge onboarding) or use the next easy-to-use installer script that installs the dependencies from the personal S3 bucket and units up a digital setting.
# Obtain the HyperPod CLI Installer from Github (Solely relevant for Forge)
curl –O https://github.com/aws-samples/amazon-nova-samples/blob/most important/customization/nova-forge-hyperpod-cli-installation/install_hp_cli.sh
# Run the Installer
bash install_hp_cli.sh
Subsequent, inside the identical digital setting, additionally set up the Nova Forge SDK (nova-forge-sdk) which gives the high-level APIs for knowledge preparation, coaching, and analysis.
pip set up –upgrade botocore awscli
pip set up amzn-nova-forge
pip set up datasets huggingface_hub pandas pyarrow
In any case dependencies are put in, activate the digital setting and set it as a kernel to be used inside a Jupyter pocket book setting.
supply ~/hyperpod-cli-venv/bin/activate
pip set up ipykernel
python -m ipykernel set up –user –name=hyperpod-cli-venv —
display-name=”Forge (hyperpod-cli-venv)”
jupyter kernelspec record
Confirm the set up:
from amzn_nova_forge import *
print(“SDK imported efficiently”)
Step 2: Configure AWS sources
Create an S3 bucket on your coaching knowledge and mannequin outputs. Then, grant your HyperPod execution position entry to it.
import boto3
import time
import json
TIMESTAMP = int(time.time())
S3_BUCKET = f”nova-forge-customisation-{TIMESTAMP}”
S3_DATA_PATH = f”s3://{S3_BUCKET}/demo/enter”
S3_OUTPUT_PATH = f”s3://{S3_BUCKET}/demo/output”
sts = boto3.shopper(“sts”)
s3 = boto3.shopper(“s3”)
ACCOUNT_ID = sts.get_caller_identity()[“Account”]
REGION = boto3.session.Session().region_name
# Create the S3 bucket
if REGION == “us-east-1”:
s3.create_bucket(Bucket=S3_BUCKET)
else:
s3.create_bucket(
Bucket=S3_BUCKET,
CreateBucketConfiguration={“LocationConstraint”: REGION}
)
# Grant HyperPod execution position entry
HYPERPOD_ROLE_ARN = f”arn:aws:iam::{ACCOUNT_ID}:position/
“bucket_policy = {
“Model”: “2012-10-17”,
“Assertion”: [{
“Sid”: “AllowHyperPodAccess”,
“Effect”: “Allow”,
“Principal”: {“AWS”: HYPERPOD_ROLE_ARN},
“Action”: [“s3:GetObject”, “s3:PutObject”, “s3:DeleteObject”, “s3:ListBucket”],
“Useful resource”: [
f”arn:aws:s3:::{S3_BUCKET}”,
f”arn:aws:s3:::{S3_BUCKET}/*”
]
}]
}
s3.put_bucket_policy(Bucket=S3_BUCKET, Coverage=json.dumps(bucket_policy))
Step 3: Put together your coaching dataset
The Nova Forge SDK helps JSONL, JSON, and CSV enter codecs. On this walkthrough, we use the publicly out there MedReason dataset from Hugging Face. The dataset accommodates medical reasoning with roughly 32,700 question-answer pairs to display fine-tuning for a domain-specific use case.
Obtain and sanitize the info
The Nova Forge SDK enforces token-level validation on coaching knowledge. Sure tokens battle with the mannequin’s inner chat template, particularly the particular delimiters Nova makes use of to separate system, person, and assistant turns throughout coaching. In case your knowledge accommodates literal strings like `System:` or `Assistant:`, the mannequin might misread them as flip boundaries, corrupting the coaching sign. The sanitization step beneath inserts an area earlier than the colon (e.g., System: → System 🙂 to interrupt the sample match whereas preserving readability, and strips particular tokens like [EOS] and which have reserved which means within the mannequin’s vocabulary.
from huggingface_hub import hf_hub_download
import pandas as pd
import json
import re
# Obtain the dataset
jsonl_path = hf_hub_download(
repo_id=”UCSC-VLAA/MedReason”,
filename=”ours_quality_33000.jsonl”,
repo_type=”dataset”,
local_dir=”.”
)
df = pd.read_json(jsonl_path, traces=True)
# Tokens that battle with the mannequin’s chat template
INVALID_TOKENS = [
“System:”, “SYSTEM:”, “User:”, “USER:”, “Bot:”, “BOT:”,
“Assistant:”, “ASSISTANT:”, “Thought:”, “[EOS]”,
“”, “”, “”,
]
def sanitize_text(textual content):
for token in INVALID_TOKENS:
if “:” in token:
phrase = token[:-1]
textual content = re.sub(rf’b{phrase}:’, f'{phrase} :’, textual content, flags=re.IGNORECASE)
else:
textual content = textual content.change(token, “”)
return textual content.strip()
# Write sanitized JSONL
with open(“training_data.jsonl”, “w”) as f:
for _, row in df.iterrows():
f.write(json.dumps({
“query”: sanitize_text(row[“question”]),
“reply”: sanitize_text(row[“answer”]),
}) + “n”)
print(f”Dataset saved: training_data.jsonl ({len(df)} examples)”)
To validate in case your knowledge has any of reserved key phrase run this script.
Load, rework, and validate with the SDK
The SDK gives a JSONLDatasetLoader that handles the conversion out of your uncooked knowledge format into the construction anticipated by Nova fashions.If you name rework(), the SDK wraps every question-answer pair into the Nova chat template format, which is the structured turn-based format that Nova fashions anticipate throughout coaching. Your uncooked knowledge goes from easy Q&A pairs to completely formatted multi-turn conversations with the suitable position tags and delimiters.
Earlier than rework (your uncooked JSONL):
{
“query”: “What are the causes of chest ache in a 45-year-old affected person?”,
“reply”: “Chest ache in a 45-year-old may result from cardiac causes akin to…”
}
After rework (Nova chat template format):
{
“messages”: [
{“role”: “user”, “content”: “What are the causes of chest pain in a 45-year-old patient?”},
{“role”: “assistant”, “content”: “Chest pain in a 45-year-old can result from cardiac causes such as…”}
]
}
The validate() methodology then checks the remodeled knowledge for points, verifying that the chat template construction is appropriate, that no invalid tokens stay, and that the info conforms to the necessities on your chosen mannequin and coaching methodology.
# Initialize the loader, mapping your column names
loader = JSONLDatasetLoader(
query=”query”,
reply=”reply”,
)
loader.load(“training_data.jsonl”)
# Preview uncooked knowledge
loader.present(n=3)
# Rework into Nova’s anticipated chat template format
loader.rework(methodology=TrainingMethod.SFT_LORA, mannequin=Mannequin.NOVA_LITE_2)
# Preview remodeled knowledge to confirm the construction
loader.present(n=3)
# Validate — prints “Validation accomplished” if profitable
loader.validate(methodology=TrainingMethod.SFT_LORA, mannequin=Mannequin.NOVA_LITE_2)
train_path = loader.save(f”{S3_DATA_PATH}/practice.jsonl”)
print(f”Coaching knowledge: {train_path}”)
Step 4: Configure and launch coaching with knowledge mixing
If you allow knowledge mixing, Nova Forge robotically blends your domain-specific coaching knowledge with Amazon-curated datasets throughout fine-tuning. This prevents the mannequin from forgetting its basic capabilities whereas it learns your area.
A be aware on coaching strategies: LoRA vs. full-rank SFT
Nova Forge helps a number of fine-tuning approaches. On this walkthrough, we use supervised fine-tuning (SFT) with LoRA (TrainingMethod.SFT_LORA), which is a parameter-efficient methodology that updates solely a small set of low-rank adapter weights reasonably than all mannequin parameters. LoRA provides quicker coaching, decrease compute prices, and is the really useful place to begin for many use instances.
Nova Forge additionally helps full-rank SFT, which updates all mannequin parameters and may incorporate extra area data. Nonetheless, it requires extra compute and is extra vulnerable to catastrophic forgetting (making knowledge mixing much more necessary). The earlier publish on this sequence demonstrates outcomes utilizing full-rank SFT. Select full-rank when LoRA doesn’t obtain enough area efficiency, or if you want deeper mannequin adaptation.
Configure the runtime and MLflow
from amzn_nova_customization_sdk.mannequin.model_enums import Platform
cluster_name = “nova-forge-hyperpod”
instance_type = “ml.p5.48xlarge”
instance_count = 4
namespace = “kubeflow”
runtime = SMHPRuntimeManager(
instance_type=instance_type,
instance_count=instance_count,
cluster_name=cluster_name,
namespace=namespace,
)
MLFLOW_APP_ID = “” # e.g., “app-XXXXXXXXXXXX”
mlflow_app_arn = f”arn:aws:sagemaker:{REGION}:{ACCOUNT_ID}:mlflow-app/{MLFLOW_APP_ID}”
mlflow_monitor = MLflowMonitor(
tracking_uri=mlflow_app_arn,
experiment_name=”nova-sft-datamix”,
)
Create the customizer with knowledge mixing enabled
Go data_mixing_enabled=True when establishing the NovaModelCustomizer:
customizer = NovaModelCustomizer(
mannequin=Mannequin.NOVA_LITE_2,
methodology=TrainingMethod.SFT_LORA,
infra=runtime,
data_s3_path=f”{S3_DATA_PATH}/practice.jsonl”,
output_s3_path=f”{S3_OUTPUT_PATH}/”,
mlflow_monitor=mlflow_monitor,
data_mixing_enabled=True,
)
Perceive and tune the info mixing configuration
Information mixing controls how coaching batches are composed. The customer_data_percent parameter determines what fraction of every batch comes out of your area knowledge. The remaining fraction is crammed by Nova-curated datasets, with every nova_*_percent parameter controlling the relative weight of that functionality class inside the Nova portion.
For instance, with the configuration beneath:
- 50% of every coaching batch consists of your area knowledge
- 50% consists of Nova-curated knowledge, distributed throughout functionality classes in response to their relative weights
The Nova-side percentages should sum to 100. Every worth represents that class’s share of the Nova-curated portion of the batch.
# View the default mixing ratios
customizer.get_data_mixing_config()
You may override these ratios based mostly in your priorities:
customizer.set_data_mixing_config({
“customer_data_percent”: 50,
“nova_agents_percent”: 1,
“nova_baseline_percent”: 10,
“nova_chat_percent”: 0.5,
“nova_factuality_percent”: 0.1,
“nova_identity_percent”: 1,
“nova_long-context_percent”: 1,
“nova_math_percent”: 2,
“nova_rai_percent”: 1,
“nova_instruction-following_percent”: 13,
“nova_stem_percent”: 10.5,
“nova_planning_percent”: 10,
“nova_reasoning-chat_percent”: 0.5,
“nova_reasoning-code_percent”: 0.5,
“nova_reasoning-factuality_percent”: 0.5,
“nova_reasoning-instruction-following_percent”: 45,
“nova_reasoning-math_percent”: 0.5,
“nova_reasoning-planning_percent”: 0.5,
“nova_reasoning-rag_percent”: 0.4,
“nova_reasoning-rai_percent”: 0.5,
“nova_reasoning-stem_percent”: 0.4,
“nova_rag_percent”: 1,
“nova_translation_percent”: 0.1,
})
How to consider tuning the combo
Parameter
What it controls
Steering
customer_data_percent
Share of your area knowledge in every coaching batch.
Larger values drive stronger area specialization however improve forgetting danger. 50% is a balanced place to begin.
nova_instruction-following_percent
Weight of instruction-following examples within the Nova portion.
Maintain this excessive in case your mannequin must comply with structured prompts or output codecs in manufacturing.
nova_reasoning-*_percent
Weights for varied reasoning capabilities (math, code, planning, and so on.).
Enhance these in case your downstream duties require multi-step reasoning.
nova_rai_percent
Accountable AI alignment knowledge.
At all times maintain this non-zero to protect security behaviors.
nova_baseline_percent
Core factual data.
Helps retain broad world data.
Tip: Begin with the defaults, run a coaching job, consider on each your area process and MMLU, then iterate. The Constructing specialised AI with out sacrificing intelligence publish reveals that even a 75/25 customer-to-Nova cut up preserves near-baseline MMLU (0.74 vs. 0.75 baseline) whereas delivering a 12-point F1 enchancment on a posh classification process.
Launch the coaching job
The overrides parameter enables you to management key coaching hyperparameters:
Parameter
Description
Steering
lr
Studying charge
1e-5 is an affordable default for LoRA fine-tuning.
warmup_steps
Steps to linearly ramp up studying charge from 0
Usually 5–10% of complete steps. Set proportionally to max_steps.
global_batch_size
Variety of examples per gradient replace throughout all GPUs
Bigger batches give extra secure gradients however use extra reminiscence.
max_length
Most sequence size in tokens
Set based mostly in your knowledge. 65536 helps long-context use instances; cut back for shorter knowledge to avoid wasting reminiscence and pace up coaching.
max_steps
Whole coaching steps
Begin small (5–10) to validate your setup, then improve. For ~23k coaching examples with batch measurement 32, one full epoch ≈ 720 steps.
training_config = {
“lr”: 1e-5,
“warmup_steps”: 2,
“global_batch_size”: 32,
“max_length”: 65536,
“max_steps”: 5, # Begin small to validate; improve for manufacturing runs
}
training_result = customizer.practice(
job_name=”nova-forge-sft-datamix”,
overrides=training_config,
)
training_result.dump(“training_result.json”)
print(“Coaching outcome saved”)
Monitor coaching progress
You may monitor the job by means of the SDK or CloudWatch:
# Examine job standing
print(training_result.get_job_status())
# Stream latest logs
customizer.get_logs(restrict=50, start_from_head=False)
# Or use the CloudWatch monitor
monitor = CloudWatchLogMonitor.from_job_result(training_result)
monitor.show_logs(restrict=10)
# Ballot till completion
import time
whereas training_result.get_job_status()[1] == “Operating”:
time.sleep(60)
Coaching metrics (loss curves, studying charge schedule) are additionally out there in your MLflow experiment for visualization and comparability throughout runs.
Step 5: Consider the fine-tuned mannequin
Analysis is essential if you use knowledge mixing as a result of it’s good to measure two issues concurrently: whether or not your mannequin improved in your area process, and whether or not it retained its basic capabilities. For those who measure just one axis, you possibly can’t inform if the combo is working.After coaching completes, retrieve the mannequin checkpoint location from the output manifest:
from amzn_nova_forge.util.checkpoint_util import extract_checkpoint_path_from_job_output
checkpoint_path = extract_checkpoint_path_from_job_output(
output_s3_path=training_result.model_artifacts.output_s3_path,
job_result=training_result,
)
Configure the analysis infrastructure
Analysis requires solely a single GPU occasion (in comparison with 4 for coaching):
eval_infra = SMHPRuntimeManager(
instance_type=instance_type,
instance_count=1,
cluster_name=cluster_name,
namespace=namespace,
)
eval_mlflow = MLflowMonitor(
tracking_uri=mlflow_app_arn,
experiment_name=”nova-forge-eval”,
)
evaluator = NovaModelCustomizer(
mannequin=Mannequin.NOVA_LITE_2,
methodology=TrainingMethod.EVALUATION,
infra=eval_infra,
output_s3_path=f”s3://{S3_BUCKET}/demo/eval-outputs/”,
mlflow_monitor=eval_mlflow,
)
Run evaluations
Nova Forge helps three complementary analysis approaches:
1. Public benchmarks (used to measure basic functionality retention)
These let you know whether or not knowledge mixing is doing its job. If MMLU drops considerably from the baseline, your combine wants extra Nova knowledge. If IFEval drops, improve the instruction-following weight.
# MMLU — broad data and reasoning throughout 57 topics
mmlu_result = evaluator.consider(
job_name=”eval-mmlu”,
eval_task=EvaluationTask.MMLU,
model_path=checkpoint_path,
)
# IFEval — capability to comply with structured directions
ifeval_result = evaluator.consider(
job_name=”eval-ifeval”,
eval_task=EvaluationTask.IFEVAL,
model_path=checkpoint_path,
)
2. Carry-your-own-data (measure domain-specific efficiency)
Use your held-out take a look at set to measure whether or not fine-tuning improved efficiency in your precise process:
byod_result = evaluator.consider(
job_name=”eval-byod”,
eval_task=EvaluationTask.GEN_QA,
data_s3_path=f”s3://{S3_DATA_PATH}/eval/gen_qa.jsonl”,
model_path=checkpoint_path,
overrides={“max_new_tokens”: 2048},
)
3. Massive language mannequin (LLM) as choose (for domains the place automated metrics fall brief, you should utilize one other LLM to evaluate response high quality)
Examine outcomes and retrieve outputs
# Examine job standing
print(mmlu_result.get_job_status())
print(ifeval_result.get_job_status())
print(byod_result.get_job_status())
# Retrieve the S3 path containing detailed analysis outcomes
print(mmlu_result.eval_output_path)
The analysis output path accommodates the detailed outcomes as JSON. Obtain and examine them to get the precise scores.
Moreover, metrics will be printed to MLflow monitoring servers by supplying the monitoring server URI at job creation. With this method, you possibly can file and retailer your metrics for evaluating experiments.
Decoding your outcomes
Use the next determination framework to information your subsequent iteration:
Commentary
What it means
What to regulate
MMLU near baseline (e.g., inside 0.01–0.02)
Information mixing is efficiently stopping catastrophic forgetting
Your combine is working — deal with area efficiency
MMLU considerably degraded
The mannequin is forgetting basic capabilities
Lower customer_data_percent or improve Nova knowledge weights
Area process efficiency beneath expectations
The mannequin isn’t studying sufficient out of your knowledge
Enhance customer_data_percent, add extra coaching knowledge, or improve max_steps
IFEval degraded
The mannequin is shedding instruction-following capability
Enhance nova_instruction-following_percent
Each MMLU and area process improved
Supreme end result
Doc your configuration and promote to manufacturing
As a reference level, this publish reviews these outcomes for Amazon Nova 2 Lite on a VOC classification process:
The important thing takeaway is that fine-tuning with solely buyer knowledge boosts Area F1 however considerably reduces basic intelligence (MMLU drops from 0.75 to 0.47), whereas the blended method (75% buyer + 25% Nova knowledge) recovers practically all of the MMLU accuracy whereas nonetheless bettering area efficiency.
Finest practices
- Begin with the default mixing ratios. The defaults are tuned for a balanced trade-off. Solely customise after you might have baseline analysis outcomes to match in opposition to.
- At all times consider on each axes. Run at the least one public benchmark (MMLU) alongside your domain-specific analysis. With out each, you possibly can’t inform if the combo is working.
- Use MLflow to match experiments. When iterating on mixing ratios and hyperparameters, MLflow makes it simple to match runs side-by-side and establish the most effective configuration.
- Iterate on the combo, not simply hyperparameters. In case your mannequin is forgetting basic capabilities, adjusting the info combine is commonly more practical than tuning studying charge or batch measurement.
- Begin with LoRA, transfer to full-rank if wanted. LoRA is quicker and cheaper. Solely transfer to full-rank SFT if LoRA doesn’t obtain enough area adaptation on your use case.
Cleansing up
To keep away from ongoing expenses, clear up the sources created throughout this walkthrough:
- Delete the S3 bucket and its contents.
- Cease or delete the SageMaker HyperPod cluster if it was created for this train.
- Delete the MLflow utility if now not wanted.
Conclusion
On this publish, we walked by means of the end-to-end workflow for fine-tuning Amazon Nova fashions utilizing the Nova Forge SDK with knowledge mixing enabled. The SDK handles knowledge preparation, coaching orchestration on SageMaker HyperPod, and multi-dimensional analysis, so you possibly can focus in your knowledge and your area.Information mixing is what makes fine-tuning sensible for manufacturing. Reasonably than selecting between area experience and basic intelligence, you get each. The bottom line is to deal with it as an iterative course of: practice, consider on each axes, regulate the combo, and repeat till you discover the correct steadiness on your use case.
To get began, see the Nova Forge Developer Information for detailed documentation, and discover the Nova Forge SDK for the complete API reference.
In regards to the authors
Gideon Teo is a FSI Resolution Architect at AWS in Melbourne, specialising in Amazon SageMaker AI and Amazon Bedrock. Enthusiastic about each conventional AI/ML and Generative AI, he helps monetary establishments resolve advanced enterprise challenges with cutting-edge applied sciences. Exterior work, he enjoys time with family and friends, and exploring numerous expertise domains.
Andrew Smith is a Sr. Cloud Help Engineer at AWS, based mostly in Sydney, Australia. He specialises in serving to prospects with AI/ML workloads on AWS with experience in Amazon SageMaker AI, Amazon Bedrock and LLM inference.
Timothy Downs is a Startup Options Architect at AWS in Melbourne who enjoys working on the bleeding fringe of tech, often earlier than it’s totally baked.
Krishna Neupane is an Utilized Scientist at Amazon’s AGI Customization group, specializing in Nova mannequin customization and knowledge mixing.

