Shifting AI brokers from prototypes to manufacturing surfaces a problem that conventional testing is unable to handle. Brokers are versatile, adaptive, and context-aware by design, however the identical qualities that make them highly effective additionally make them tough to guage systematically.
Conventional software program testing depends on deterministic outputs: identical enter, identical anticipated output, each time. AI brokers break this assumption. They generate pure language, make context-dependent selections, and produce assorted outputs even from equivalent inputs. How do you systematically consider one thing that’s not deterministic?
On this put up, we present the best way to consider AI brokers systematically utilizing Strands Evals. We stroll via the core ideas, built-in evaluators, multi-turn simulation capabilities and sensible approaches and patterns for integration. Strands Evals gives a structured framework for evaluating AI brokers constructed with the Strands Brokers SDK, providing evaluators, simulation instruments, and reporting capabilities. Whether or not it’s good to confirm that your agent makes use of the correct instruments, produces useful responses, or guides customers towards their objectives, the framework gives infrastructure to measure and observe these qualities systematically.
Why evaluating AI brokers is completely different
While you ask an agent “What’s the climate like in Tokyo?”, many legitimate responses exist, and no single reply is definitively appropriate. The agent may report temperature in Celsius or Fahrenheit, embody humidity and wind, or solely deal with temperature. These variations could possibly be appropriate and useful, which is strictly why conventional assertion-based testing falls brief. Past textual content technology, brokers additionally take motion. A well-designed agent calls instruments, retrieves data, and makes selections all through a dialog. Evaluating the ultimate response alone misses whether or not the agent took applicable steps to achieve that response.
Even appropriate responses can fall brief. A response is likely to be factually correct however unhelpful, or useful however untrue to supply supplies. No single metric captures these completely different high quality dimensions. Conversations add one other layer of complexity as a result of they unfold over time. In multi-turn interactions, earlier responses have an effect on later ones. An agent may deal with particular person queries effectively however fail to keep up a coherent context throughout a dialog. Testing single turns in isolation misses these interplay patterns.
These traits demand analysis that requires judgment reasonably than key phrase comparability. Giant language mannequin (LLM)-based analysis addresses this want. By utilizing language fashions as evaluators, we are able to assess qualities like helpfulness, coherence, and faithfulness that resist mechanical checking. Strands Evals embraces this flexibility whereas nonetheless providing rigorous, repeatable high quality assessments.
Core ideas of Strands Evals
Strands Evals follows a sample that ought to really feel acquainted to anybody who has written unit assessments however adapts it for the judgment-based analysis that AI brokers require. The framework introduces three foundational ideas that work collectively: Instances, Experiments, and Evaluators.
Determine: Excessive-Degree Structure
A Case represents a single take a look at situation. It accommodates the enter that you just wish to take a look at, maybe a consumer’s question like “What’s the climate in Paris?”, together with optionally available anticipated outputs, anticipated device sequences referred to as trajectories, and metadata. Instances are the atomic unit of analysis. Each defines one situation that you really want your agent to deal with appropriately.
from strands_evals import Case
case = Case(
identify=”Climate Question”,
enter=”What’s the climate like in Tokyo?”,
expected_output=”Ought to embody temperature and situations”,
expected_trajectory=[“weather_api”]
)
An Experiment bundles a number of Instances along with a number of evaluators. Consider it as a take a look at suite in conventional testing. The Experiment orchestrates the analysis course of. It takes every Case, runs your agent on it, and applies the configured evaluators to attain the outcomes.
Evaluators are the judges. They study what your agent produced (the precise output and trajectory) and examine it towards what was anticipated. Not like easy assertion checks, evaluators in Strands Evals are primarily LLM-based. They use language fashions to make nuanced judgments about high quality, relevance, helpfulness, and different qualities that can not be lowered to string comparability.
Separating these issues helps preserve the framework versatile. You possibly can outline what to check with Instances, the best way to take a look at it with evaluators, and the framework handles orchestration and reporting via Experiments. Every bit will be configured independently with the intention to construct analysis suites which can be tailor-made to your particular wants.
The duty perform: connecting brokers to analysis
Instances outline your situations, and evaluators present judgment. However how does your agent really connect with this analysis system? That’s the place the Activity Operate is available in.
A Activity Operate is a callable that you just present to the Experiment. It receives a Case and returns the outcomes of working that case via your system. This interface permits two essentially completely different analysis patterns.
Determine: Activity Operate Patterns
On-line analysis entails invoking your agent stay through the analysis run. Your Activity Operate creates an agent, sends it the case enter, captures the response and execution hint, and returns them for analysis. This sample is advisable throughout improvement while you wish to take a look at adjustments instantly, or in steady integration and supply (CI/CD) pipelines the place it’s good to confirm agent habits earlier than deployment.
from strands import Agent
def online_task(case):
agent = Agent(instruments=[search_tool, calculator_tool])
outcome = agent(case.enter)
return {
“output”: str(outcome),
“trajectory”: agent.session
}
Offline analysis works with historic information. As an alternative of invoking an agent, your Activity Operate retrieves beforehand recorded traces from logs, databases, or observability techniques. It parses these traces into the format that evaluators count on and returns them for judgment. This sample works effectively when it’s good to consider manufacturing site visitors, carry out historic evaluation, or examine agent variations towards the identical set of actual consumer interactions.
def offline_task(case):
hint = load_trace_from_database(case.session_id)
session = session_mapper.map_to_session(hint)
return {
“output”: extract_final_response(hint),
“trajectory”: session
}
Whether or not you might be testing a brand new agent implementation or analyzing months of manufacturing information, the identical evaluators and reporting infrastructure apply. The Activity Operate adapts your information supply to the analysis system.
Constructed-in evaluators for complete evaluation
Along with your Activity Operate connecting agent output to the analysis system, now you can determine which elements of high quality to measure. Strands Evals ships with ten built-in evaluators, every designed to evaluate a distinct dimension of agent high quality.
Determine: Evaluator Sorts
Rubric-based evaluators
Essentially the most versatile evaluators allow you to outline customized standards via pure language rubrics.
- OutputEvaluator judges the ultimate response that your agent produces. You present a rubric, an outline of what attractiveness like, and the evaluators use an LLM to attain the output towards that standards. This works effectively for basic high quality checks the place you wish to outline particular requirements in your use case.
from strands_evals.evaluators import OutputEvaluator
output_evaluator = OutputEvaluator(
rubric=”Rating 1.0 if the response appropriately solutions the query and is well-structured. ”
“Rating 0.5 if partially appropriate. Rating 0.0 if incorrect or irrelevant.”
)
- TrajectoryEvaluator extends this to look at the sequence of actions (usually device calls) that your agent took. Past simply wanting on the remaining reply, you may confirm that the agent used applicable instruments in a logical order. The evaluators embody three built-in scoring features for evaluating precise versus anticipated trajectories: actual match, in-order match, and any-order match. These scorers are supplied as instruments to the analysis LLM, which chooses essentially the most applicable one based mostly in your rubric.
- InteractionsEvaluator handles multi-agent techniques the place a number of parts talk. It evaluates sequences of interactions between brokers or system parts. That is helpful when your structure entails orchestrators, sub-agents, or complicated device chains.
Semantic evaluators
Some high quality dimensions are frequent sufficient that Strands Evals gives pre-built evaluators with fastidiously designed prompts and scoring scales.
- HelpfulnessEvaluator assesses responses from the consumer’s perspective utilizing a seven-point scale starting from “Not useful in any respect” to “Above and past.” It evaluates whether or not the response really addresses the consumer’s wants, not solely whether or not it’s technically appropriate.
- FaithfulnessEvaluator checks whether or not the response is grounded within the dialog historical past. That is significantly vital for Retrieval Augmented Era (RAG) techniques the place it’s good to be sure that the agent doesn’t hallucinate data. The five-point scale ranges from “By no means” trustworthy to “Fully sure.”
- HarmfulnessEvaluator performs security checks, serving to decide whether or not responses include dangerous, inappropriate, or harmful content material. It gives binary sure/no judgments for clear decision-making.
Software-level evaluators
When your agent makes use of instruments, you usually want to guage not solely the ultimate final result, however the high quality of particular person device invocations.
- ToolSelectionAccuracyEvaluator examines every device name in context and judges whether or not deciding on that individual device was justified given the dialog state. It solutions: “At this level within the dialog, was it affordable to name this device?”
- ToolParameterAccuracyEvaluator goes deeper, checking whether or not the parameters handed to every device had been appropriate and applicable. It helps catch delicate errors the place the correct device was chosen however known as with unsuitable or incomplete arguments.
Session-level evaluators
- GoalSuccessRateEvaluator takes the broadest view, evaluating complete dialog classes to find out whether or not the consumer in the end achieved their aim. For task-oriented brokers, success is outlined by outcomes reasonably than a single response.
Choosing the proper evaluators
The selection depends upon what issues most in your utility. A customer support agent may prioritize helpfulness and aim success. A analysis assistant may emphasize faithfulness. Begin with a small set of evaluators that cowl your core high quality dimensions, then add extra as you learn the way your agent fails.
Simulating customers for multi-turn testing
The beforehand talked about evaluators work effectively for single-turn interactions the place you present an enter, get an output, and consider it. Multi-turn conversations current a more durable problem. Actual customers don’t observe scripts. They ask follow-up questions, change route, and specific confusion. How do you take a look at this? Strands Evals consists of an ActorSimulator that creates AI-powered simulated customers to drive multi-turn conversations together with your agent.
Determine: Consumer Simulator Circulation
ActorSimulator begins with a take a look at case that defines what the consumer needs to attain. From this, it generates a sensible consumer profile utilizing an LLM, together with character traits, experience stage, communication type, and a selected aim. This profile shapes how the simulated consumer behaves all through the dialog.
from strands_evals import Case, ActorSimulator
from strands import Agent
case = Case(
enter=”I need assistance organising a brand new checking account”,
metadata={“task_description”: “Efficiently open a checking account”}
)
user_sim = ActorSimulator.from_case_for_user_simulator(
case=case,
max_turns=10
)
In the course of the interplay, the simulated consumer sends messages to your agent, receives responses, and decides what to say subsequent. This loop continues till both the aim is achieved, indicated by emitting a particular cease token, or the utmost flip depend is reached.
agent = Agent(system_prompt=”You’re a useful banking assistant.”)
user_message = case.enter
whereas user_sim.has_next():
agent_response = agent(user_message)
user_result = user_sim.act(str(agent_response))
user_message = str(user_result.structured_output.message)
You possibly can then move the ensuing dialog transcript to session-level evaluators like GoalSuccessRateEvaluator to evaluate whether or not your agent efficiently helped the simulated consumer obtain their aim. As an alternative of manually writing multi-turn scripts, you outline objectives and let the simulator create reasonable interplay patterns. It’d ask sudden follow-up questions, specific confusion, or take the dialog in instructions that you just didn’t anticipate, catching edge instances that scripted assessments can miss.
Analysis ranges: understanding the hierarchy
Whether or not utilizing simulated or actual conversations, completely different evaluators function at completely different granularities. Strands Evals makes use of a TraceExtractor to parse session information into the format that every evaluator wants.
Session stage analysis seems to be on the full dialog from starting to finish. The evaluator receives the complete historical past, the device executions, and understands the whole context. GoalSuccessRateEvaluator works at this stage as a result of figuring out aim achievement requires understanding the entire interplay.
Hint stage analysis focuses on particular person turns, every consumer immediate, and agent response pair. Evaluators at this stage obtain the dialog historical past as much as that time and choose the particular response. Helpfulness, Faithfulness, and Harmfulness evaluators work right here as a result of these qualities will be assessed flip by flip.
Software stage analysis drills all the way down to particular person device invocations. Every device name is evaluated in context, with entry to the obtainable instruments, the dialog thus far, and the particular arguments handed. Software Choice and Software Parameter evaluators function at this granularity.
You need to use the hierarchical design to compose analysis suites that examine high quality at a number of ranges concurrently. Inside a single analysis run, you may confirm that particular person device calls are smart, responses are useful, and general objectives are achieved.
Floor reality and anticipated behaviors
At many various analysis ranges, evaluators can profit from having reference factors for comparability. Strands Evals gives first-class help for floor reality via three anticipated fields on Instances.
The expected_output area specifies what the agent ought to say. That is helpful when there are appropriate solutions or customary response codecs. The expected_trajectory area defines the sequence of instruments or actions that the agent ought to take. You may require {that a} customer support agent checks account standing earlier than making adjustments, or {that a} analysis agent queries a number of sources earlier than synthesizing. Not each Case wants each area. You outline expectations based mostly on what issues in your analysis objectives. When anticipated values are supplied, evaluators obtain each anticipated and precise outcomes, enabling comparison-based scoring alongside standalone high quality evaluation.
Placing all of it collectively
Let’s stroll via a typical analysis workflow to see how these ideas come collectively.
Determine: Analysis Circulation
First, you outline your take a look at instances, that means the situations you need your agent to deal with effectively. They may come from actual consumer queries, artificial technology, or edge instances that you’ve got recognized.
from strands_evals import Experiment, Case
from strands_evals.evaluators import OutputEvaluator, TrajectoryEvaluator
from strands_evals.extractors import tools_use_extractor
instances = [
Case(
name=”Weather Query”,
input=”What is the weather like in Tokyo?”,
expected_output=”Should include temperature and conditions”,
expected_trajectory=[“weather_api”]
),
Case(
identify=”Calculator Utilization”,
enter=”What’s 15% of 847?”,
expected_output=”127.05″,
expected_trajectory=[“calculator”]
)
]
Subsequent, you configure evaluators with applicable rubrics or settings.
output_evaluator = OutputEvaluator(
rubric=”Rating 1.0 if the response is correct and straight solutions the query. ”
“Rating 0.5 if partially appropriate. Rating 0.0 if incorrect or irrelevant.”
)
trajectory_evaluator = TrajectoryEvaluator(
rubric=”Confirm the agent used applicable instruments for the duty.”
)
Then, you create an experiment bundling instances and evaluators.
experiment = Experiment(
instances=instances,
evaluators=[output_evaluator, trajectory_evaluator]
)
Lastly, you run the analysis together with your Activity Operate and study the outcomes.
def my_task(case):
agent = Agent(instruments=[weather_tool, calculator_tool])
outcome = agent(case.enter)
return {
“output”: str(outcome),
“trajectory”: tools_use_extractor.extract_agent_tools_used(agent.messages)
}
studies = experiment.run_evaluations(my_task)
for report in studies:
report.show()
The EvaluationReport gives general scores, per-case breakdowns, move/fail standing, and detailed reasoning from every evaluator. You possibly can show outcomes interactively within the console, export to JSON for additional evaluation, or combine into CI/CD pipelines. For bigger take a look at suites, Strands Evals helps asynchronous analysis with configurable parallelism:
studies = await experiment.run_evaluations_async(my_task, max_workers=10)
Producing take a look at instances at scale
The earlier workflow assumes that you’ve got take a look at instances prepared. Creating complete take a look at suites by hand is tedious as your agent’s capabilities develop. Strands Evals consists of an ExperimentGenerator that makes use of LLMs to create take a look at instances and analysis rubrics from high-level descriptions.
from strands_evals.mills import ExperimentGenerator
from strands_evals.evaluators import OutputEvaluator
generator = ExperimentGenerator(
input_type=str,
output_type=str,
include_expected_output=True
)
experiment = await generator.from_context_async(
context=”A customer support agent for an e-commerce platform”,
task_description=”Deal with buyer inquiries about orders, returns, and merchandise”,
num_cases=20,
evaluator=OutputEvaluator
)
The generator creates numerous take a look at instances overlaying completely different elements of the desired context, with applicable problem ranges. It could additionally generate analysis rubrics which can be tailor-made to the duty. Generated instances are significantly precious throughout early improvement while you need broad protection however haven’t but recognized particular failure patterns. As your analysis apply matures, complement generated Instances with hand-crafted situations concentrating on identified edge instances.
Integrating analysis into your workflow
Analysis helps ship essentially the most worth as a part of your common improvement workflow. Throughout improvement, run evaluations incessantly as you make adjustments. Quick suggestions helps you catch regressions early and perceive how adjustments have an effect on completely different high quality dimensions.
In CI/CD pipelines, embody analysis as a top quality gate earlier than deployment. Set rating thresholds that should be met for a construct to move. This helps forestall high quality regressions from reaching manufacturing. For manufacturing monitoring, use offline analysis to evaluate actual consumer interactions periodically. This reveals patterns that improvement testing may miss: uncommon queries, edge instances you didn’t anticipate, or gradual drift in agent habits. Observe analysis outcomes over time. Trending metrics assist you to perceive whether or not high quality is bettering or degrading.
Finest practices for agent analysis
- Begin small and iterate : Start with a handful of take a look at instances overlaying your most crucial consumer situations. As you observe how your agent fails in apply, add focused instances that deal with these particular failure modes. A targeted take a look at suite that catches actual issues is extra precious than a big suite with poor protection.
- Match Evaluators to your high quality objectives : Select evaluators that straight measure what issues in your use case. A customer-facing agent may prioritize HelpfulnessEvaluator and GoalSuccessRateEvaluator, whereas a analysis assistant may weigh FaithfulnessEvaluator extra closely. Keep away from the temptation so as to add each obtainable evaluator, as this will increase price and might dilute focus.
- Write clear, particular rubrics : Rubric-based evaluators are solely nearly as good because the rubrics you present. Keep away from obscure standards like “good response” in favor of particular, measurable requirements. Embody examples of what constitutes excessive, medium, and low scores. Check your rubrics on pattern outputs earlier than working full evaluations.
- Mix on-line and offline analysis : Use on-line analysis throughout improvement for quick suggestions on code adjustments. Complement this with offline analysis of manufacturing traces to catch points that solely seem with actual consumer habits. The 2 approaches reveal several types of issues.
- Set significant thresholds : Outline move/fail thresholds based mostly in your precise high quality necessities, not arbitrary numbers. A 0.8 threshold means nothing in case your customers want 0.95 accuracy. Analyze analysis outcomes to grasp what scores correlate with good consumer outcomes, then set thresholds accordingly.
- Observe tendencies over time : Particular person analysis runs present snapshots, however tendencies reveal the trajectory. Retailer analysis outcomes and observe key metrics throughout releases. Gradual degradation will be more durable to note than sudden failures, however equally damaging.
- Spend money on take a look at case range : Cowl the complete vary of inputs that your agent will encounter: frequent queries, edge instances, adversarial inputs, and multi-turn conversations. Use the ExperimentGenerator for broad protection, then complement with hand-crafted Instances concentrating on identified weaknesses.
- Consider at a number of ranges : Session-level success can masks tool-level issues, and the reverse. An agent may obtain consumer objectives via inefficient or incorrect intermediate steps. Compose analysis suites that examine high quality at session, hint, and gear ranges to get an entire image.
Conclusion
Constructing dependable AI brokers requires greater than instinct and spot checks. It requires systematic analysis that tracks high quality throughout a number of dimensions over time. Strands Evals helps present this basis via a framework designed particularly for the distinctive challenges of agent analysis.
Activity Capabilities separate agent invocation from analysis logic, enabling each on-line testing throughout improvement and offline evaluation of manufacturing traces. LLM-based evaluators present the judgment that high quality evaluation requires. Hierarchical analysis ranges permit evaluation at a number of granularities, from particular person device calls to finish dialog classes. And the consumer simulator transforms multi-turn testing from a scripting train into reasonable consumer habits simulation.
These capabilities assist you to construct confidence in your AI brokers via proof reasonably than assumptions. You possibly can measure whether or not adjustments enhance or degrade high quality, catch regressions earlier than they attain manufacturing, and display to stakeholders that your brokers meet outlined high quality requirements.
We encourage you to discover Strands Evals in your agent analysis wants. The samples repository accommodates sensible examples that you may adapt to your individual use instances. Begin with just a few take a look at instances representing your most vital consumer situations, add evaluators that match your high quality standards, and run evaluations as a part of your improvement workflow. Over time, broaden your take a look at suite to cowl extra situations. Systematic analysis is the inspiration that helps you ship AI brokers with confidence.
Concerning the Authors
Ishan Singh
Ishan Singh is a Sr. Utilized Scientst at Amazon Net Companies, the place he helps clients construct progressive and accountable generative AI options and merchandise. With a powerful background in AI/ML, Ishan focuses on constructing Generative AI options that drive enterprise worth. Exterior of labor, he enjoys taking part in volleyball, exploring native bike trails, and spending time along with his spouse and canine, Beau.
Akarsha Sehwag
Akarsha Sehwag is a Generative AI Information Scientist for the Amazon Bedrock AgentCore crew. With over 6 years of experience in AI/ML, she has constructed production-ready enterprise options throughout numerous buyer segments in generative AI, deep studying, and laptop imaginative and prescient domains. Exterior of labor, she likes to hike, bike, and play badminton.
Po-Shin Chen
Po-Shin Chen is a Software program Developer specializing in agentic AI improvement and evaluations at Amazon Net Companies. With a background in engineering and science, his work focuses on constructing core capabilities for agentic framework (Strands SDK), main and growing the agent analysis framework (Strands Evals).
Jonathan Buck
Jonathan Buck is a Senior Software program Engineer at Amazon Net Companies. His work focuses on constructing agent environments, analysis, and post-training infrastructure to help the productization of agentic techniques.
Smeet Dhakecha
Smeet Dhakecha is a Analysis Engineer at Amazon, working inside the Agentic AI Science crew. His work spans agent simulation and analysis techniques, in addition to the design and deployment of knowledge transformation pipeline infrastructure to help fast-moving scientific analysis.

