Enterprises constructing AI brokers typically require greater than what managed basis mannequin (FM) companies can present. They want exact management over efficiency tuning, value optimization at scale, compliance and information residency, mannequin choice, and networking configurations that combine with current safety architectures. Amazon SageMaker AI endpoints align with these necessities by giving organizations management over compute sources, scaling conduct, and infrastructure placement, whereas benefiting from the managed operational layer of AWS. These fashions which can be deployed by SageMaker AI, can energy AI brokers, deal with conversational workloads, and combine with orchestration frameworks just like the FMs which can be accessible on Amazon Bedrock. The distinction is that the group retains architectural management over how and the place inference occurs.
On this put up, we exhibit construct AI brokers utilizing Strands Brokers SDK with fashions deployed on SageMaker AI endpoints. You’ll discover ways to deploy basis fashions from SageMaker JumpStart, combine them with Strands Brokers, and set up production-grade observability utilizing SageMaker Serverless MLflow for agent tracing. We additionally cowl implement A/B testing throughout a number of mannequin variants and consider agent efficiency utilizing MLflow metrics and present how one can construct, deploy, and constantly enhance AI brokers on infrastructure you management.
Strands Brokers SDK is an open supply SDK that takes a model-driven method to constructing and working AI brokers in only some traces of code. Strands scales from easy to advanced agent use instances, and from native improvement to deployment in manufacturing.
Amazon SageMaker JumpStart is a machine studying (ML) hub that may assist you speed up your ML journey. With SageMaker JumpStart, you possibly can consider, evaluate, and choose FMs rapidly based mostly on predefined high quality and accountability metrics to carry out duties, like article summarization and picture technology.
SageMaker AI MLflow is a managed functionality that streamlines the machine studying lifecycle via experiment monitoring, mannequin versioning, and deployment administration.
On this put up we:
- Deploy fashions on SageMaker AI – Deploy basis fashions from SageMaker JumpStart.
- Combine Strands with SageMaker AI – Use deployed SageMaker AI fashions with Strands Brokers.
- Arrange Agent Observability – Configure SageMaker AI MLflow App for agent tracing.
- Implement A/B Testing with analysis – Deploy a number of mannequin variants and consider agent with MLflow metrics.
A Jupyter pocket book with full code to make use of with this put up might be discovered within the GitHub repo.
Constructing your first Strands Agent
Strands brokers put collectively a mannequin, a system immediate, and a set of instruments to construct a easy AI agent. Strands provides many mannequin suppliers, together with Amazon SageMaker AI. It additionally offers many generally used instruments as a part of strands-agent-tools SDK, so organizations can rapidly construct AI brokers for his or her enterprise wants.
The next code snippet exhibits create your first agent utilizing the Strands Brokers SDK. An in depth pattern of brokers constructed utilizing Strands Brokers SDK might be discovered within the GitHub repo.
mannequin = BedrockModel(
model_id=”us.anthropic.claude-sonnet-4-5-20250929-v1:0″
)
agent = Agent(mannequin=mannequin, instruments=[http_request])
agent(“The place is the worldwide area station now?”)
This agent makes use of Claude 4.5 Sonnet multi-Area inference mannequin on Amazon Bedrock. A listing of obtainable inference profiles might be discovered within the Amazon Bedrock Person Information.
Amazon Bedrock offers you a selection of fashions. The mannequin ID for accessible fashions might be discovered within the Amazon Bedrock Person Information.
Why use fashions deployed on SageMaker AI?
Organizations can take into account deploying basis fashions on SageMaker AI for the next causes.
- Infrastructure Management – SageMaker AI offers management over compute situations, networking configurations, and scaling insurance policies. That is essential for organizations with strict latency SLAs or particular {hardware} necessities.
- Mannequin Flexibility – SageMaker AI permits deployment of various fashions, whether or not it’s a customized structure, fine-tuned variant, or open-source various like Llama or Mistral.
- Price Predictability – The SageMaker AI devoted endpoints allow exact value forecasting and optimization via reserved situations, spot pricing, and right-sized compute sources. This may be particularly useful for high-volume workloads.
- Superior MLOps – The SageMaker AI integration with MLflow mannequin registry, and A/B testing capabilities offers enterprise-grade mannequin governance that many organizations require for manufacturing AI methods.
Constructing a Strands Agent with SageMaker AI fashions
The Strands Brokers SDK implements a SageMaker AI supplier, so you possibly can run brokers in opposition to fashions deployed on SageMaker AI inference endpoints. This consists of each pre-trained fashions from SageMaker JumpStart and customized fine-tuned fashions. The mannequin that you just use with Strands Brokers SDK ought to assist OpenAI appropriate chat completions APIs.On this put up, we see how Qwen3 4B and Qwen3 8B fashions accessible on SageMaker AI JumpStart are used with Strands Brokers.
Stipulations
To execute the code from this put up it’s essential to have the next:
- An AWS Account with entry to Amazon Bedrock and Amazon SageMaker AI.
- A task with entry to SageMaker AI, Amazon Bedrock fashions, SageMaker AI Serverless MLflow, Amazon Easy Storage Service (Amazon S3), and Amazon SageMaker AI JumpStart. You need to use belief coverage to imagine that position.
- A Jupyter pocket book working regionally in your desktop or on SageMaker AI Studio. When working regionally, just remember to authenticate into your AWS Account and assume the position that has the required permissions.
Step 1: Set up required packages
First, we set up the required Python packages in the environment.
%%writefile necessities.txt
strands-agents>=1.9.1
strands-agents-tools>=0.2.8
mlflow>=3.4.0
strands-agents[sagemaker]
mlflow-sagemaker>=1.5.11
pip set up -r necessities.txt
Step 2: Deploy mannequin as SageMaker AI endpoint
Now that the packages can be found, we use SageMaker JumpStart API to deploy Qwen3-4B mannequin as a SageMaker AI endpoint.
# Deploy preliminary endpoint with Qwen-4B
import sagemaker
import boto3
from boto3.session import Session
from sagemaker.jumpstart.mannequin import JumpStartModel
boto_session = Session()
sts = boto3.shopper(‘sts’)
account_id = sts.get_caller_identity().get(“Account”)
area = boto_session.region_name
ENDPOINT_NAME = INITIAL_CONFIG_NAME = “llm-qwen-endpoint-sagemaker”
# We are going to maintain utilizing this endpoint identify
model_a = JumpStartModel(
model_id=”huggingface-reasoning-qwen3-4b”,
model_version=”1.0.0″,
identify=”qwen3-4b-model”
)
# Deploy the mannequin to an endpoint
predictor_a = model_a.deploy(
initial_instance_count=1,
instance_type=”ml.g5.2xlarge”,
endpoint_name=ENDPOINT_NAME
)
Step 3: Use with Strands agent
With the mannequin deployed, we create a SageMakerAIModel and use it with Strands Brokers.
from strands.fashions.sagemaker import SageMakerAIModel
from strands import Agent, device
from strands_tools import http_request, calculator
model_sagemaker = SageMakerAIModel(
endpoint_config={
“endpoint_name”: ENDPOINT_NAME,
“region_name”: area
},
payload_config={
“max_tokens”: 2048,
“temperature”: 0.2,
“stream”: True,
}
)
# Check the agent
agent = Agent(mannequin=model_sagemaker, instruments=[http_request])
agent(“The place is the worldwide area station now? (Use: http://api.open-notify.org/iss-now.json”)
For extra details about SageMaker AI as a Mannequin Supplier for Strands Brokers, see Amazon SageMaker on the Strands Agent web site.
Utilizing SageMaker AI Serverless MLflow App for Agent Observability
SageMaker AI Serverless MLflow helps present complete observability for AI brokers by routinely capturing execution traces, device utilization patterns, and decision-making workflows with out requiring customized instrumentation. The managed service helps scale back operational overhead whereas providing native integration with Strands Brokers SDK. This allows the monitoring of agent dialog flows. With this centralized observability service, groups can monitor agent conduct throughout a number of deployments, establish efficiency bottlenecks, and keep audit trails for compliance necessities.
Step 1: Establishing SageMaker AI Serverless MLflow App
Step one of setting observability on your AI agent is to arrange a MLflow App. You’ve gotten two main choices for deployment:
- utilizing the intuitive SageMaker AI Studio UI for fast setup with guided configuration
- utilizing Boto3 for programmatic deployment that permits automation and infrastructure-as-code practices.
Each approaches create a Serverless MLflow App, so you possibly can concentrate on constructing and monitoring your Strands Brokers relatively than managing the underlying observability infrastructure.On this put up, we use the Boto3 SDK to deploy a MLflow App.
# Create S3 bucket for MLflow artifacts
s3_client = boto3.shopper(‘s3′, region_name=area)
bucket_name = f'{account_id}-mlflow-bucket’
if area == ‘us-east-1’:
s3_client.create_bucket(Bucket=bucket_name)
else:
s3_client.create_bucket(
Bucket=bucket_name,
CreateBucketConfiguration={‘LocationConstraint’: area}
)
# Create SageMaker shopper
sagemaker_client = boto3.shopper(‘sagemaker’)
# Create MLflow App
mlflow_app_details = sagemaker_client.create_mlflow_app(
Title=”strands-mlflow-app”,
ArtifactStoreUri=f’s3://{account_id}-mlflow-bucket/artifacts’,
RoleArn=position,
)
print(f”MLflow app creation initiated: { mlflow_app_details[‘Arn’]}”)
Step 2: Arrange MLflow App for Strands Agent tracing
Now that you just’ve created an MLflow App, you allow computerized logging for Strands Brokers in order that the agent interactions, device utilization, and efficiency metrics are routinely captured and despatched to the MLflow App with out requiring handbook instrumentation.
import os
import mlflow
tracking_uri = mlflow_app_details[‘Arn’]
print(f”MLflow App URL: {tracking_uri}”)
# Set MLflow monitoring URI
os.environ[“MLFLOW_TRACKING_URI”] = tracking_uri
# Or you possibly can set the monitoring server as beneath.
#mlflow.set_tracking_uri(tracking_uri)
mlflow.set_experiment(“Strands-MLflow”) # This experiment identify will probably be used within the UI
mlflow.strands.autolog()
Step 3: Run the agent
With MLflow App setup and auto logging enabled, you’re now able to invoke our Strands Agent precisely such as you did in the beginning.
def capitalize(response):
return response.higher()
agent = Agent(mannequin=model_sagemaker, instruments=[http_request])
response = agent(“The place is the worldwide area station now?”)
capitalize(response.message[‘content’][0][‘text’])
Traces and metrics will probably be accessible in your MLflow App that may be accessed utilizing the signed URL constructed utilizing data from mlflow_app_details and AWS Area variables.
# Get presigned URL for MLflow App
presigned_response = sagemaker_client.create_presigned_mlflow_app_url(
Arn=mlflow_app_details[‘Arn’]
)
mlflow_ui_url = presigned_response[‘AuthorizedUrl’]
print(f”MLflow UI URL: {mlflow_ui_url}”)
Step 4: Assessment traces in MLflow
Put up agent run, your agent hint, device calling, and different metrics will probably be accessible via the MLflow App UI’s Traces part.
The traces in an experiment can be found in a listing view to examine additional.
When you choose a particular hint, you may have the choice to see the small print as an Execution Timeline or as a Span Tree. In each views, you possibly can see the Agent Loop, device calls, enter/output to every step, and different data.
Handbook tracing
Whereas the earlier code block had capitalize(response.message[‘content’][0][‘text’]), the capitalize operate isn’t seen within the MLflow hint. MLflow Strands automated tracing logs agent invocation (and its device and FM invocations). Different operate calls aren’t logged. In the event you want hint for a whole block of code, you should utilize MLflow’s handbook tracing functionality as proven within the following code block.
@mlflow.hint(span_type=”func”, attributes={“operation”: “capitalize”})
def capitalize(response):
return response.higher()
@mlflow.hint
def run_agent():
agent = Agent(instruments=[http_request])
mlflow.update_current_trace(request_preview=”Run Strands Agent”)
response = agent(“The place is the worldwide area station now? (Use: http://api.open-notify.org/iss-now.json) “)
capitalized_response = capitalize(response.message[‘content’][0][‘text’])
return capitalized_response
# Execute the traced operate
capitalized_response = run_agent()
print(capitalized_response)
On this code block, the capitalize operate is adorned with @mlflow.hint to make it possible for its execution, enter, and output are seen in MLflow. Consequently, you will note the next hint in MLflow that features the agent invocation and “capitalize” operate name.
Deploying a brand new LLM for A/B testing
With Amazon SageMaker AI, you possibly can optimize massive language fashions (LLMs) on your agent purposes. For instance, to boost your utility’s efficiency by upgrading from a smaller Qwen3-4B mannequin to the bigger Qwen3-8B mannequin, you don’t have to carry out an entire migration instantly. As a result of your present agent with Qwen3-4B is functioning properly, you possibly can deploy the brand new Qwen3-8B mannequin alongside it and distribute visitors between each LLM endpoints. With this method, you possibly can conduct A/B testing to measure the influence and effectiveness of the bigger mannequin earlier than totally committing to the improve. First, deploy a brand new mannequin behind the identical endpoint as a second variant:
# Step1: Create a mannequin from JumpStart
model_b_name =”sagemaker-strands-demo-qwen3-8b”
model_b_id, model_b_version = “huggingface-reasoning-qwen3-8b”, “1.0.0”
model_8b = JumpStartModel(
model_id=”huggingface-reasoning-qwen3-8b”,
model_version=”1.0.0″,
identify=model_b_name
)
model_b.create(instance_type=”ml.g5.2xlarge”)
# Step2: Create manufacturing variants for A/B testing
# Create manufacturing variants for A/B testing
production_variants = [
# The original model (champion)
{
“VariantName”: “qwen-4b-variant”,
“ModelName”: “qwen3-4b-model”,
“InitialInstanceCount”: 1,
“InstanceType”: “ml.g5.2xlarge”,
“InitialVariantWeight”: 0.5 # It will take 50% of the traffic
},
# The new model (challenger)
{
“VariantName”: “qwen3-8b-variant”,
“ModelName”: model_b_name,
“InitialInstanceCount”: 1,
“InstanceType”: “ml.g5.2xlarge”,
“InitialVariantWeight”: 0.5 # It will take 50% of the traffic
}
]
# Step3: Create new endpoint configuration
sagemaker_client = boto3.shopper(‘sagemaker’)
ENDPOINT_CONFIG_AB_TESTING = “llm-endpoint-config-ab”
sagemaker_client.create_endpoint_config(
EndpointConfigName=ENDPOINT_CONFIG_AB_TESTING,
ProductionVariants=production_variants
)
# Step4: Replace the endpoint with new A/B testing configuration
sagemaker_client.update_endpoint(
EndpointName=ENDPOINT_NAME, #Keep in mind, the endpoint identify stays the identical
EndpointConfigName=ENDPOINT_CONFIG_AB_TESTING
)
# Wait till the replace is accomplished
waiter = boto3.shopper(‘sagemaker’).get_waiter(‘endpoint_in_service’)
waiter.wait(EndpointName=ENDPOINT_NAME)
After the replace is completed, the agent that you just created with this endpoint will use each LLMs with a 50/50 probability. For a managed experiment, you possibly can create two new brokers that can level to particular variants.
from strands.fashions.sagemaker import SageMakerAIModel
from strands import Agent, device
from strands_tools import http_request, calculator
model_sagemaker_a = SageMakerAIModel(
endpoint_config={
“endpoint_name”: ENDPOINT_NAME,
“region_name”: area,
“target_variant”:”qwen-4b-variant”
},
payload_config={
“max_tokens”: 2048,
“temperature”: 0.2,
“stream”: True,
}
)
model_sagemaker_b = SageMakerAIModel(
endpoint_config={
“endpoint_name”: ENDPOINT_NAME,
“region_name”: area,
“target_variant”:”qwen-8b-variant”
},
payload_config={
“max_tokens”: 2048,
“temperature”: 0.2,
“stream”: True,
}
)
Evaluating agent utilizing Mannequin Variants with MLflow GenAI
With each mannequin variants deployed behind your A/B testing endpoint, the subsequent step is to make use of them in Strands Brokers, that are the identical in different features apart from the mannequin itself. We are going to then systematically consider these brokers utilizing MLflow’s GenAI analysis framework, which offers a structured method utilizing each customized metrics and LLM-based judges.
Create an analysis dataset
To measure agent efficiency objectively, you want a constant set of take a look at instances that characterize your real-world workloads. Every entry within the analysis dataset consists of the enter question together with expectations, the bottom reality values that scorers will use to evaluate whether or not the agent responded appropriately. This construction facilitates reproducible evaluations throughout mannequin variants.
eval_dataset = [
{
“inputs”: {“query”: “Calculate 15% tip on a $85.50 bill. Use calculator tool”},
“expectations”: {
“expected_tool”: “calculator”,
“expected_facts”: [“The tip amount is approximately $12.83”]
}
},
{
“inputs”: {“question”: “What’s 2048 divided by 64? Use calculator device”},
“expectations”: {
“expected_tool”: “calculator”,
“expected_facts”: [“The answer is 32”]
}
}
# Add extra take a look at instances…
]
Outline analysis scorers
Scorers decide how agent responses are assessed in opposition to your expectations. MLflow helps each customized scorers for deterministic checks (like verifying that the right device was used) and built-in LLM judges that consider subjective qualities like factual correctness and relevance. Combining these scorer varieties helps present a complete view of agent efficiency, from fundamental functionality verification to nuanced response high quality.
from mlflow.genai.scorers import scorer, Correctness, RelevanceToQuery
from mlflow.entities import Suggestions
@scorer
def tool_selection_scorer(inputs, outputs, expectations):
expected_tool = expectations.get(“expected_tool”, “”)
tool_used = expected_tool in outputs.get(“instruments”, [])
return Suggestions(identify=”tool_selection”, worth=1.0 if tool_used else 0.0)
Run the analysis
Together with your dataset and scorers outlined, now you can run the analysis in opposition to every mannequin variant. The mlflow.genai.consider() operate runs your agent on each take a look at case, applies the scorers to evaluate the responses, and logs the outcomes to MLflow for evaluation. Operating separate evaluations for every variant helps just remember to can straight evaluate their efficiency underneath similar situations.
import mlflow
from strands import Agent
from strands_tools import calculator
mlflow.set_experiment(“Strands_Agents_AB_Evaluation”)
def predict_4b(question):
agent = Agent(mannequin=model_sagemaker_a, instruments=[calculator])
response = agent(question)
return {“outputs”: str(response), “instruments”: checklist(response.metrics.tool_metrics.keys())}
def predict_8b(question):
agent = Agent(mannequin=model_sagemaker_b, instruments=[calculator])
response = agent(question)
return {“outputs”: str(response), “instruments”: checklist(response.metrics.tool_metrics.keys())}
scorers = [
tool_selection_scorer,
Correctness(model=”bedrock:/us.amazon.nova-pro-v1:0″),
RelevanceToQuery(model=”bedrock:/us.amazon.nova-pro-v1:0″)
]
eval_results_4b = mlflow.genai.consider(information=eval_dataset, predict_fn=predict_4b, scorers=scorers)
eval_results_8b = mlflow.genai.consider(information=eval_dataset, predict_fn=predict_8b, scorers=scorers)
Examine outcomes
After each evaluations full, you possibly can evaluate the aggregated metrics to find out which mannequin variant performs higher on your particular workloads. This comparability offers the data-driven proof that it’s worthwhile to make knowledgeable selections about mannequin choice, relatively than counting on assumptions or basic benchmarks.
metrics_4b = eval_results_4b.metrics
metrics_8b = eval_results_8b.metrics
for metric in metrics_4b: print(f”{metric}:
4B={metrics_4b[metric]:.3f}, 8B={metrics_8b[metric]:.3f}”)
You can even view detailed comparisons within the MLflow UI by navigating to the Evaluations tab and deciding on each runs for side-by-side evaluation.
Transition to the brand new mannequin
If the brand new mannequin is confirmed to be higher, you may make a transition to the brand new mannequin by adjusting weights:
production_variants = [
{
“VariantName”: “qwen-8b-variant”,
“ModelName”: model_b_name,
“InitialInstanceCount”: 1,
“InstanceType”: “ml.g5.2xlarge”,
“InitialVariantWeight”: 1
}
]
Debugging points with MLflow tracing
Within the case of “ImportError: can’t import identify ‘TokenUsageKey’ from ‘mlflow.tracing.fixed’ (/choose/conda/lib/python3.12/site-packages/mlflow/tracing/fixed.py)” or different points with tracing Strands Brokers in MLflow, verify the next:
- Test the model of MLflow put in. It needs to be 3.4.0 or better.
- Guarantee that the position you’re utilizing to execute the Strands Agent has the permission to:
- learn, write, and checklist the s3:bucket used because the artifact location for MLflow Monitoring Server
- entry MLflow App
Clear up created sources
The next code will delete the SageMaker AI endpoint and MLflow App that you just created.
# Delete the endpoint
sagemaker_client.delete_endpoint(EndpointName=ENDPOINT_NAME)
sagemaker_client.delete_endpoint_config(EndpointConfigName=INITIAL_CONFIG_NAME)
sagemaker_client.delete_endpoint_config(EndpointConfigName=ENDPOINT_CONFIG_AB_TESTING)
# Delete MLflow App
sagemaker_client.delete_mlflow_app(
Arn=server_info[“Arn”]
)
Conclusion
On this put up, we explored construct AI brokers utilizing Strands Brokers SDK with fashions deployed on Amazon SageMaker AI endpoints, whereas utilizing SageMaker AI Serverless MLflow for complete agent observability. This method helps present organizations with better flexibility and management over their AI infrastructure.Deploying fashions on SageMaker AI offers you exact management over compute sources, networking configurations, and scaling insurance policies which can be particularly precious for organizations with particular efficiency, value, or compliance necessities. The combination with MLflow offers strong observability capabilities, so you possibly can observe agent conduct, monitor efficiency metrics, and keep audit trails.The mixture of Strands Brokers SDK, SageMaker AI, and MLflow creates a robust framework for constructing, deploying, and monitoring AI brokers that you may customise to satisfy your particular enterprise wants.
Subsequent steps
To get began with constructing your personal AI brokers utilizing this method, we suggest the next sources:
By following the steps outlined on this put up, you possibly can construct subtle AI brokers that may leverage the total energy of the versatile infrastructure of SageMaker AI and the great observability capabilities of MLflow. We’re excited to see what you’ll construct!
In regards to the authors
Dheeraj Hegde
Dheeraj Hegde is a Sr. Specialist Options Architect at Amazon Internet Companies, centered on Generative AI and Machine Studying. He helps prospects design and construct AI brokers and agentic architectures, leveraging his deep machine studying background to ship production-ready generative AI options.
Gi Kim
Gi Kim is a Sr. Specialist Options Architect at Amazon Internet Companies, specializing in Generative AI and Machine Studying. He companions with prospects to architect clever agent-based methods on AWS, combining his machine studying experience with a ardour for pushing the boundaries of what autonomous AI brokers can accomplish.

