Your AI agent labored within the demo, impressed stakeholders, dealt with check situations, and appeared prepared for manufacturing. Then you definately deployed it, and the image modified. Actual customers skilled fallacious instrument calls, inconsistent responses, and failure modes no person anticipated throughout testing.
The result’s a niche between anticipated agent habits and precise consumer expertise in manufacturing. Agent analysis introduces challenges that conventional software program testing wasn’t designed to deal with. As a result of giant language fashions (LLMs) are non-deterministic, the identical consumer question can produce completely different instrument picks, reasoning paths, and outputs throughout a number of runs. Which means you have to check every situation repeatedly to know your agent’s precise habits patterns. A single check go tells you what can occur, not what usually occurs. With out systematic measurement throughout these variations, groups are trapped in cycles of handbook testing and reactive debugging. This burns by means of API prices with out clear perception into whether or not modifications enhance agent efficiency. This uncertainty makes each immediate modification dangerous and leaves a basic query unanswered: “Is that this agent truly higher now?”
On this put up, we introduce Amazon Bedrock AgentCore Evaluations, a totally managed service for assessing AI agent efficiency throughout the event lifecycle. We stroll by means of how the service measures agent accuracy throughout a number of high quality dimensions. We clarify the 2 analysis approaches for growth and manufacturing and share sensible steering for constructing brokers you may deploy with confidence.
Why agent analysis requires a brand new strategy
When a consumer sends a request to an agent, a number of choices occur in sequence. The agent determines which instruments (if any) to name, executes these calls, and generates a response based mostly on the outcomes. Every step introduces potential failure factors: deciding on the fallacious instrument, calling the proper instrument with incorrect parameters, or synthesizing instrument outputs into an inaccurate remaining reply. Not like conventional functions the place you check a single operate’s output, agent analysis requires measuring high quality throughout this whole interplay move.
This creates particular challenges for agent builders that may be addressed by doing the next:
- Outline analysis standards on what constitutes an accurate instrument choice, legitimate instrument parameters, an correct response, and a useful consumer expertise.
- Construct check datasets that symbolize actual consumer requests and anticipated behaviors.
- Select scoring strategies that may assess high quality constantly throughout repeated runs.
Every of those definitions instantly determines what your analysis system measures and getting them fallacious means optimizing for the fallacious outcomes. With out this foundational work, the hole between what groups hope their brokers do and what they will show their brokers do turns into an actual enterprise threat.Bridging this hole requires a steady analysis cycle, as proven in Determine 1. Groups construct check circumstances, run them in opposition to the agent, rating the outcomes, analyze failures, and implement enhancements. Every failure turns into a brand new check case, and the cycle continues by means of each iteration of the agent.
Determine 1: The agent analysis course of follows a steady cycle of check circumstances, agent execution, scoring, evaluation, and enhancements. Failures grow to be new check circumstances.
Operating this cycle finish to finish, nevertheless, requires important infrastructure past the analysis logic itself. Groups should curate datasets, choose and host scoring fashions, handle inference capability and API charge limits, construct information pipelines that remodel agent traces into evaluation-ready codecs, and create dashboards to visualise developments. For organizations working a number of brokers, this overhead multiplies with every one. The result’s that agent developer groups find yourself spending extra time sustaining analysis tooling than performing on what it tells them. That is the issue Amazon Bedrock AgentCore Evaluations was constructed to handle.
Introducing Amazon Bedrock AgentCore Evaluations
First launched in public preview at AWS re:Invent 2025, the service is now usually accessible. It handles the analysis fashions, inference infrastructure, information pipelines, and scaling so groups can concentrate on enhancing agent high quality moderately than constructing and sustaining analysis methods. For built-in evaluators, mannequin quota and inference capability are absolutely managed. Which means organizations evaluating many brokers aren’t consuming their very own quotas or provisioning separate infrastructure for analysis workloads.
AgentCore Evaluations study agent habits end-to-end utilizing OpenTelemetry (OTEL) traces with generative AI semantic conventions. OTEL is an open supply observability normal for gathering distributed traces from functions. The generative AI semantic conventions lengthen it with fields particular to language mannequin interactions, together with prompts, completions, instrument calls, and mannequin parameters. By constructing on this normal, the service works constantly throughout brokers constructed with any Strands Brokers or LangGraph, and instrumented with OpenTelemetry and OpenInference, capturing the total context wanted for significant analysis.
The evaluations might be configured with completely different approaches:
- LLM-as-a-Decide the place an LLM evaluates every agent interplay in opposition to structured rubrics with clearly outlined standards.
- Floor Fact based mostly analysis can be utilized to match the agent responses in opposition to pre-defined or simulated datasets.
- Customized code evaluators the place you may herald a Lambda as a evaluator with your personal customized code.
Within the LLM-as-a-Decide strategy, the Decide mannequin examines the total interplay context, together with dialog historical past, accessible instruments, instruments used, parameters handed, and system directions, then gives detailed reasoning earlier than assigning a rating. Each rating comes with a proof. Groups can use these scores to confirm judgments, perceive precisely why an interplay acquired a selected ranking, and establish what ought to have occurred in a different way. This strategy goes past easy go/fail judgments, offering the structured analysis and clear reasoning that allow high quality evaluation at a scale that handbook overview can not match.
Three ideas information how the service approaches analysis. Proof-driven growth replaces instinct with quantitative metrics, so groups can measure the precise affect of modifications moderately than debating whether or not a immediate modification “feels higher.” Multi-dimensional evaluation evaluates completely different features of agent habits independently. This makes it doable to pinpoint precisely the place enhancements are wanted moderately than counting on a single combination rating. Steady measurement connects the efficiency baselines established throughout growth on to manufacturing monitoring, ensuring that high quality holds up as real-world situations evolve. These ideas apply all through the agent lifecycle, from the primary spherical of growth testing by means of ongoing manufacturing monitoring.
Analysis throughout the agent lifecycle
An agent’s journey from prototype to manufacturing creates two distinct analysis wants. Throughout growth, groups want managed environments the place they will examine alternate options, check the agent on curated datasets, reproduce outcomes, and validate modifications earlier than they attain customers. After the agent is stay, the problem shifts to monitoring real-world interactions at scale, the place customers encounter edge circumstances and interplay patterns that no quantity of pre-deployment testing anticipated. Determine 2 illustrates how analysis helps every stage of this journey, from preliminary proof of idea by means of shadow testing, A/B testing, and steady manufacturing monitoring.
Determine 2: From POC to manufacturing, analysis validates brokers earlier than deployment. As brokers mature, analysis helps shadow testing, A/B testing, and steady monitoring at scale.
AgentCore Evaluations map two complementary approaches to those lifecycle phases, as proven in Determine 3. On-line analysis handles steady manufacturing monitoring, whereas on-demand analysis helps managed testing throughout growth and steady integration and steady supply (CI/CD) workflows, together with evaluations in opposition to floor reality.
On-demand Analysis
On-line Analysis
Benefits
- Flip-by-turn debug contemplating session stage data
- Part validation
- CI/CD integration
- Dialog high quality
- Monitoring stay agent interactions
Use circumstances
- Benchmarking
- Stability validation
- Part monitoring
- Pre-release test
- Steady sampling
- Dwell dashboards
Determine 3: On-line analysis screens manufacturing site visitors repeatedly, whereas on-demand analysis helps managed testing throughout growth.
On-line analysis for manufacturing monitoring
On-line analysis screens stay agent interactions by repeatedly sampling a configurable share of traces and scoring them in opposition to your chosen evaluators. You outline which evaluators to use, set sampling guidelines that management what fraction of manufacturing site visitors will get evaluated, and arrange applicable filters. The service handles studying traces, working evaluations, and surfacing leads to the AgentCore Observability dashboard powered by Amazon CloudWatch. When you’re already gathering traces for observability, on-line analysis provides high quality scores with rationalization, alongside your current operational metrics with out requiring code modifications or re-deployments. Determine 4 reveals how this course of works.
High quality points in manufacturing typically floor in ways in which conventional monitoring misses. Operational dashboards could present inexperienced throughout latency and error charges whereas consumer expertise quietly degrades as a result of the agent begins deciding on fallacious instruments or offering much less useful responses. Steady high quality scoring catches these silent failures by monitoring analysis metrics alongside operational ones. As a result of AgentCore Observability runs on CloudWatch, you may create customized dashboards and set alarms to get alerted the second scores drop under your thresholds.
On-demand analysis for growth
On-demand analysis is a real-time API designed for growth and CI/CD workflows. Groups use it to check modifications earlier than deployment, run analysis suites as a part of CI/CD pipelines, carry out regression testing throughout builds, and gate deployments on high quality thresholds. Builders choose a full session and specify precise spans (particular person operations inside a hint) or traces by offering their IDs. The service considers the total session dialog and scores particular person span/traces in opposition to the identical evaluators utilized in manufacturing. Frequent use circumstances embody validating immediate modifications, evaluating mannequin efficiency throughout alternate options, and stopping high quality regressions.
Determine 5: On-demand analysis permits builders to arrange hint datasets, invoke evaluations by means of a CI/CD pipeline or growth atmosphere, and obtain scores utilizing built-in or customized evaluators powered by Amazon Bedrock basis fashions.
As a result of each modes use the identical evaluators, what you check in CI/CD is what you monitor in manufacturing, supplying you with constant high quality requirements throughout the whole growth lifecycle. On-demand analysis gives the managed atmosphere wanted for structure choices and systematic enchancment, whereas on-line analysis maintains high quality monitoring continues after the agent is stay. Collectively, the 2 modes type a steady suggestions loop between growth and manufacturing, and each draw from the identical set of evaluators and scoring infrastructure.
How AgentCore evaluates your agent
AgentCore Evaluations organizes agent interactions right into a three-level hierarchy that determines what might be evaluated and at what granularity. A session represents a whole dialog between a consumer and your agent, grouping all associated interactions from a single consumer or workflow. Inside every session, a hint captures the whole lot that occurs throughout a single change. When a consumer sends a message and receives a response, that spherical journey produces one hint containing each step that the agent took to generate its reply. Every hint in flip accommodates particular person operations known as spans, representing particular actions your agent carried out, akin to invoking a instrument, retrieving data from a information base, or producing textual content.
Totally different evaluators function at completely different ranges of this hierarchy, and issues at one stage can look very completely different from issues at one other. The service gives 13 pre-configured built-in evaluators organized throughout these three ranges, every measuring a definite side of agent habits (Determine 6). You’ll be able to outline customized evaluators utilizing LLM-as-a-Decide and customized code evaluators that may work on session, hint and span ranges.
Stage
Evaluators
Objective
Floor Fact Use
Session
Objective Success Price
Assesses whether or not all consumer objectives have been accomplished inside a dialog
Consumer gives free type textual assertions of objective completion, that are in contrast in opposition to system habits and measured by way of Objective Success Price
Hint
Helpfulness, Correctness, Coherence, Conciseness, Faithfulness, Harmfulness, Instruction Following, Response Relevance, Context Relevance, Refusal, Stereotyping
Evaluates response high quality, accuracy, security, and communication effectiveness
Flip stage floor reality (e.g., anticipated reply or attributes per flip) helps analysis of Correctness
Software
Software Choice Accuracy, Software Parameter Accuracy
Assesses instrument choice choices and parameter extraction precision
Software name floor reality specifies the right instrument sequence enabling Trajectory Actual Order Match, Trajectory In-Order Match, and Trajectory Any Order Match
Determine 6: Constructed-in evaluators function at session, hint, and power ranges. Every stage measures completely different features of agent habits. Floor Fact might be offered as assertions, anticipated response and anticipated trajectory for analysis on session, hint and power stage.
Evaluating every stage independently helps groups to diagnose whether or not an issue originates in instrument choice, response technology, or session-level planning. An agent may select the proper instrument with correct parameters however then synthesize the instrument’s output poorly in its remaining response. This sample solely turns into seen when every stage is assessed by itself. Your agent’s main function guides which evaluators to prioritize. Customer support brokers ought to concentrate on Helpfulness, Objective Success Price, and Instruction Following, since resolving consumer points inside outlined guardrails instantly impacts satisfaction. Brokers with Retrieval Augmented Technology (RAG) parts profit most from Correctness and Faithfulness to guarantee that responses are grounded within the offered context. Software-heavy brokers want robust Software Choice Accuracy and Software Parameter Accuracy scores. It’s really helpful to start out with three or 4 evaluators that align together with your agent’s function and develop protection as your understanding matures.
Understanding evaluator distinctions
Some evaluators naturally work together with one another, so scores ought to be learn collectively moderately than in isolation. Evaluators that sound related typically measure essentially various things, and understanding these distinctions is essential for prognosis.
- Correctness checks whether or not the response is factually correct, whereas Faithfulness checks whether or not it’s per the dialog historical past. For instance, an agent might be devoted to flawed supply materials however nonetheless fallacious.
- Helpfulness asks whether or not the response advances the consumer towards their objective, whereas Response Relevance asks whether or not it addresses what was initially requested. For instance, an agent can reply the fallacious query completely.
- Coherence checks for inside contradictions in reasoning, whereas Context Relevance checks whether or not the agent had the proper data accessible. For instance, one reveals a technology downside, the opposite a retrieval downside.
Some evaluators additionally rely upon or trade-off in opposition to one another. For example:
- Software Parameter Accuracy is significant solely when the agent has chosen the right instrument, so low Software Choice Accuracy ought to be addressed first.
- Correctness typically depends upon Context Relevance as a result of an agent can not generate correct solutions with out the proper data.
- Conciseness and Helpfulness typically battle as a result of transient responses may omit context that customers want.
Constructed-in evaluators ship with predefined immediate templates, chosen evaluator fashions, and standardized scoring standards, with configurations fastened to protect consistency throughout evaluations. They use cross-Area inference to mechanically choose compute from AWS Areas inside your geography, enhancing mannequin availability and throughput whereas maintaining information saved within the originating Area. Customized evaluators lengthen this basis with help in your personal evaluator mannequin, analysis directions, standards, and scoring schema. They’re significantly priceless for industry-specific assessments akin to compliance checking in healthcare or monetary companies, model voice consistency verification, or implementing organizational high quality requirements. Customized code evaluators allow you to herald an AWS Lambda operate to carry out the evaluations. This lets you additionally create deterministic scoring of your brokers.
To be used circumstances requiring all processing inside a single Area, customized evaluators additionally present full management over inference configuration. When constructing a customized evaluator, you outline directions with placeholders that get changed with precise hint data earlier than being despatched to the choose mannequin. The scope of knowledge accessible depends upon the evaluator’s stage: a session-level evaluator can entry the total dialog context and accessible instruments, a trace-level evaluator sees earlier turns plus the present assistant response, and a tool-level evaluator focuses on particular instrument calls inside their surrounding context. The AWS console gives the choice to load the immediate template of any current built-in evaluator as a place to begin, making it simple to create customized variants (Determine 7).
Determine 7: The AgentCore Evaluations console gives the choice to load any built-in evaluator’s immediate template as a place to begin when making a customized evaluator.
When constructing a number of customized evaluators, use the MECE (Mutually Unique, Collectively Exhaustive) precept to design your analysis suite. Every evaluator ought to have a definite, non-overlapping scope whereas collectively protecting all high quality dimensions you care about. For instance, moderately than creating two evaluators that each partially assess “response high quality,” separate them into one which evaluates factual grounding and one other that evaluates communication readability. Moreover, to jot down evaluator directions, set up the choose mannequin’s function as a efficiency evaluator to stop confusion between analysis and process execution. Use clear, sequential directions with exact language, and take into account together with one to a few related examples with matching enter/output pairs that symbolize your anticipated requirements. For scoring, select between binary scales (0/1) for go/fail situations or ordinal scales (akin to 1–5) for extra nuanced assessments, and begin with binary scoring when unsure. The service standardizes output to incorporate a cause area adopted by a rating area, so the choose mannequin all the time presents its reasoning earlier than assigning a quantity. Keep away from together with your personal output formatting directions, as they will confuse the Decide mannequin.
Customized Code-based evaluators
Constructed-in and customized evaluators each use an LLM-as-a-Decide. AgentCore Evaluations additionally helps a 3rd strategy: code-based evaluators, the place an AWS Lambda operate can be utilized because the evaluator together with your customized code.
Code-based evaluators are superb when you’ve heuristic scoring strategies that don’t require language understanding to confirm. An LLM evaluator can choose whether or not a response “sounds appropriate,” however it can not reliably affirm {that a} particular pay stub determine of $8,333.33 seems verbatim in a response, or {that a} generated request ID follows the format PTO-2026-NNN. For these deterministic checks, a customized code is quicker, cheaper, and extra dependable. There are 4 conditions the place code-based evaluators are significantly useful:
- Actual information validation: The agent is anticipated to return particular values from a knowledge supply, akin to account balances, transaction IDs, or costs.
- Format compliance: Responses should conform to structural constraints, akin to size limits, required phrases, or output schemas.
- Enterprise rule enforcement: Insurance policies that require exact interpretation, akin to whether or not a response accurately applies a tiered low cost rule or cites the proper regulatory clause.
- Excessive-volume manufacturing monitoring: Lambda invocations price a fraction of LLM inference, making code-based evaluators the proper selection when each manufacturing session must be scored repeatedly at scale.
Making a code-based evaluator
A code-based evaluator is configured as an AWS Lambda operate together with your customized logic. AgentCore passes the agent’s OTel spans to your operate as a structured occasion and expects a end in return. Your operate extracts no matter data it wants from the spans and returns a rating, a label, and a proof.
As soon as your Lambda is deployed and granted permission to be invoked by the AgentCore service principal, you register it as an evaluator for AgentCore. As soon as registered, the evaluator ID can be utilized for on-demand analysis.
Establishing AgentCore Evaluations
Configuring the service entails three steps. Choose your agent, select your evaluators, and set your sampling guidelines. Earlier than you start, deploy your agent utilizing AgentCore Runtime and arrange observability by means of OpenTelemetry or OpenInference instrumentation. The AgentCore samples repository on GitHub gives working examples.
Configuring on-line analysis
Create a brand new on-line analysis configuration by means of the AgentCore Evaluations console. Right here, you specify which evaluators to use, which information supply to observe, and what sampling parameters to make use of. For the information supply, choose both an current AgentCore Runtime endpoint or a CloudWatch log group for brokers not hosted on AgentCore Runtime. Then select your evaluators and outline your sampling guidelines.
Determine 8: The AgentCore Evaluations console for creating an internet analysis configuration, together with information supply choice, evaluator project, and sampling guidelines.
You can even create configurations programmatically utilizing the CreateOnlineEvaluationConfig API with a singular configuration identify, information supply, listing of evaluators (as much as 10), and IAM service function. The enableOnCreate parameter controls whether or not analysis begins instantly or stays paused, and executionStatus determines whether or not the configuration actively processes traces as soon as enabled. When a configuration is working, any customized evaluators it references grow to be locked and can’t be modified or deleted. If you want to change an evaluator, clone it and create a brand new model. On-line analysis outcomes are saved to a devoted CloudWatch log group in JSON format.
Monitoring outcomes
After enabling your configuration, monitor outcomes by means of the AgentCore Observability dashboard in Amazon CloudWatch. Agent-level views show aggregated analysis metrics and developments, and you’ll drill into particular periods and traces to see particular person scores and the reasoning behind every one.
Determine 9: The AgentCore Observability dashboard shows analysis metrics and developments on the agent stage, with drill-down into particular person periods, traces, scores, and choose reasoning.
Drilling into a person hint reveals the analysis scores and detailed explanations for that particular interplay, so groups can confirm choose reasoning and perceive why the agent acquired a selected ranking.
Determine 10: The trace-level view shows analysis scores and explanations instantly on particular person traces, displaying the choose mannequin’s reasoning for every metric.
Utilizing on-demand analysis
For growth and testing, you should utilize on-demand analysis to research particular interactions by deciding on the traces or spans that you simply wish to study, making use of your chosen evaluators, and receiving detailed scores with explanations. Outcomes return instantly within the API response, restricted to 10 evaluations per name, with every consequence containing the span context, rating, and reasoning. If an analysis partially fails, the response consists of each profitable and failed outcomes with error codes and messages. On-demand analysis works nicely for testing customized evaluators, investigating particular high quality points, and validating fixes earlier than deployment.
Evaluating brokers with floor reality
LLM-as-judge scoring tells you whether or not responses appear appropriate and useful by the requirements of a general-purpose language mannequin. Floor reality analysis takes this additional by letting you specify the reply, the instruments that ought to have been known as, and the outcomes the session ought to have achieved. This helps you measure how carefully the agent’s precise habits matches your reference inputs. That is significantly priceless throughout growth, when you’ve area information about what the proper habits is and wish to check for particular situations.
AgentCore Evaluations helps three varieties of floor reality reference inputs, every consumed by a selected set of evaluators:
Reference Enter
Evaluator
What it measures
expected_response
Builtin.Correctness
Similarity between the agent’s response and the known-correct reply
expected_trajectory
Builtin.TrajectoryExactOrderMatch, Builtin.TrajectoryInOrderMatch, Builtin.TrajectoryAnyOrderMatch
Whether or not the agent known as the proper instruments in the proper sequence
assertions
Builtin.GoalSuccessRate
Whether or not the session glad a set of natural-language statements about anticipated outcomes
These inputs are elective and impartial. Evaluators that don’t require floor reality akin to Builtin.Helpfulness and Builtin.ResponseRelevance might be included in the identical name as ground-truth evaluators, and every evaluator reads solely the fields it wants. You’ll be able to provide all three reference inputs concurrently for a complete analysis, or provide solely the subset related to a given situation.
The bedrock-agentcore Python SDK gives two interfaces for floor reality analysis: EvaluationClient for assessing current periods and OnDemandEvaluationRunner for automated dataset analysis.
Analysis Shopper: Evaluating current periods
Analysis Shopper is the proper selection when you have already got agent periods recorded in CloudWatch and wish to consider particular interactions. You present the session ID, the agent ID, your chosen evaluators, a glance again window for CloudWatch span retrieval, and elective Reference Inputs. The consumer fetches the session’s spans and submits them for analysis. That is nicely suited to growth evaluation, debugging particular agent failures, and validating recognized interactions after immediate or mannequin modifications.
Analysis Shopper works equally nicely for multi-turn periods. Whenever you go a session ID from a multi-turn dialog, the consumer fetches all spans for that session and evaluates the entire dialogue. Trajectory evaluators confirm instrument utilization throughout all turns, objective success assertions apply to the session, and correctness evaluators rating every particular person response in opposition to its corresponding anticipated reply.
On-Demand Analysis Dataset Runner: Automated dataset analysis
On-Demand Analysis Dataset Runner is the proper selection if you wish to consider your agent systematically throughout a curated dataset by invoking the agent for each situation, gathering CloudWatch spans, and scoring leads to a single automated workflow. You outline a Dataset containing multi-turn situations with per-turn and per-scenario floor reality and supply an agent_invoker operate that the runner requires every flip. The runner manages session IDs and handles all coordination between invocation, span assortment, and analysis.
On-Demand Analysis Dataset Runner is nicely suited to CI/CD pipelines the place the identical dataset runs in opposition to each construct, regression testing after immediate or mannequin modifications, and batch analysis throughout a big corpus of check circumstances earlier than a launch.
The 2 interfaces share the identical evaluators and Reference Inputs schema, so you may develop and validate floor reality check circumstances interactively with Analysis Shopper in opposition to current manufacturing periods, then promote those self same situations into your Analysis Runner dataset for systematic regression testing. The hands-on tutorial within the AgentCore samples repository demonstrates each interfaces end-to-end utilizing an instance agent throughout single-turn and multi-turn situations with all three varieties of floor reality reference inputs.
Finest practices
Success standards in your agent usually mix three dimensions: the standard of responses, the latency at which customers obtain them, and the price of inference. AgentCore Evaluations focuses on the standard dimension, whereas operational metrics like latency and price can be found by means of AgentCore Observability in CloudWatch. The next finest practices are organized across the three analysis ideas described earlier, and replicate patterns that emerge from working with agent analysis at scale.
Proof-driven growth
- Baseline your agent’s efficiency with each artificial and real-world information, and experiment rigorously. Measure earlier than and after each change in order that enhancements are grounded in proof, not instinct. Begin testing early with the check circumstances that you’ve got, and construct your corpus repeatedly. The analysis loop described in Determine 1 makes certain that failures grow to be new check circumstances over time.
- Run A/B testing with statistical rigor for each change. Whether or not you’re updating a system immediate, swapping a mannequin, or including a instrument, examine efficiency throughout the identical evaluator set earlier than and after deployment.
- Run repeated trials (a minimum of 10 per query) organized by class to benchmark reliability and establish specialization alternatives. Variance throughout repeated runs reveals the place your agent is constant and the place it wants work.
Multi-dimensional evaluation
- Outline what success appears to be like like early, utilizing multi-dimensional standards that replicate your agent’s precise function. Think about which analysis ranges matter most (session, hint, or instrument) and choose evaluators that map to what you are promoting targets.
- Consider each step within the agent’s workflow, not simply remaining outcomes. Measuring instrument choice, parameter accuracy, and response high quality independently offers you the diagnostic precision to repair issues the place they really happen.
- Contain subject material specialists in designing your metrics, defining process protection, and conducting human-in-the-loop opinions for high quality assurance. SME enter retains your evaluators grounded in real-world expectations and catches blind spots that automated scoring alone can miss.
- Begin with built-in evaluators to ascertain baseline measurements, then create customized evaluators as your wants mature. Calibrate customized evaluator scoring with SMEs for automated judgments align with human expectations in your area.
Steady measurement
- Detect drift by evaluating manufacturing habits to your check baselines. Arrange CloudWatch alarms on key metrics so that you catch regressions earlier than they attain a broad set of customers.
- Keep in mind that your check dataset evolves together with your agent, your customers, and the adversarial situations you encounter. Replace it commonly as edge circumstances emerge in manufacturing and necessities shift.
Troubleshooting frequent Analysis patterns
- The evaluator relationships described earlier helps you interpret scores diagnostically. The next patterns are described for particular situations you could encounter as you scale your utility together with steps to resolve them.
- When you discover low scores throughout all evaluators, the difficulty is usually foundational. Begin by reviewing Context Relevance scores to find out whether or not your agent has entry to the knowledge it wants. Examine your agent’s system immediate for readability and completeness; imprecise or contradictory directions have an effect on each downstream habits. Confirm that instrument descriptions precisely clarify when and learn how to use every instrument.
- When you discover inconsistent scores for related interactions, it often factors to analysis configuration points moderately than agent issues. In case you are utilizing customized evaluators, test whether or not your directions are particular sufficient and whether or not every rating stage has clear, distinguishable definitions. Think about decreasing the temperature parameter in your customized evaluator’s mannequin configuration to supply extra deterministic scoring.
- When you see excessive Software Choice Accuracy however low Objective Success Price, your agent selects applicable instruments however fails to finish consumer targets. This sample suggests that you simply may want extra instruments to deal with sure consumer requests, or your agent struggles with duties requiring a number of sequential instrument calls. Examine Helpfulness scores as nicely; the agent may use instruments accurately however clarify outcomes poorly.
- If evaluations are sluggish or failing as a consequence of throttling, decrease your sampling charge to guage a smaller share of periods. Scale back your evaluator depend. For customized evaluators, request quota will increase in your chosen mannequin, or swap to a mannequin with increased default quotas.
Conclusion
On this put up, we confirmed how Amazon Bedrock AgentCore Evaluations helps groups transfer from reactive debugging to systematic high quality administration for AI brokers. As a totally managed service, it handles the analysis fashions, inference infrastructure, and information pipelines that groups would in any other case have to construct and preserve for every agent. With on-demand analysis anchoring the event workflow and on-line analysis offering steady manufacturing perception, high quality turns into a measurable and improvable property all through the agent lifecycle. The evaluator relationships and diagnostic patterns give a framework not simply to attain brokers however for understanding the place and why high quality points happen and the place to focus enchancment efforts.
To discover AgentCore Evaluations intimately, watch the general public preview launch session from AWS re:Invent 2025 for a walkthrough with stay demos. Go to the Amazon Bedrock AgentCore samples repository on GitHub for hands-on tutorials. For technical particulars on configuration and API utilization, see the AgentCore Evaluations documentation. You can even overview service limits and pricing.
Concerning the authors
Akarsha Sehwag
Akarsha Sehwag is a WW Generative AI Information Scientist for Amazon Bedrock Agentcore GTM workforce. With over seven years of experience in AI/ML product growth, she has constructed enterprise options throughout numerous buyer segments. Exterior of labor, she enjoys studying one thing new, mentoring, talking at conferences or being outside in nature.
Ishan Singh
Ishan Singh is a Sr. Utilized Scientist at Amazon Net Companies, the place he helps prospects construct modern and accountable generative AI options and merchandise. With a powerful background in AI/ML, Ishan focuses on constructing Generative AI options that drive enterprise worth. Exterior of labor, he enjoys enjoying volleyball, exploring native bike trails, and spending time together with his spouse and canine, Beau.
Bharathi Srinivasan
Bharathi Srinivasan is a Generative AI Information Scientist at AWS. She is keen about Accountable AI to extend the reliability of AI brokers in actual world situations. Bharathi guides inside groups and AWS prospects on their accountable AI journey. She has offered her work at numerous machine studying conferences.
Jack Gordley
Jack Gordley contributed to AgentCore Evaluations and centered on delivering merchandise that assist corporations monitor and deploy production-ready brokers at scale.
Samaneh Aminikhanghahi
Samaneh Aminikhanghahi is an Utilized Scientist on the AWS Generative AI Innovation Heart, the place she works with prospects throughout completely different verticals to speed up their adoption of generative AI. She focuses on agentic AI frameworks, constructing sturdy analysis methods, and implementing accountable AI practices that drive sustainable enterprise outcomes
Osman Santos
Osman Santos is a Sr. Deep Studying Architect within the Generative AI Innovation Heart at AWS, the place he helps enterprise prospects design, construct, and scale generative and agentic AI options. He focuses on Agentic AI, from particular person use circumstances to enterprise-wide platform enablement. Exterior of labor, Osman enjoys spending time together with his household, enjoying board video games, and catching up with the newest anime and sci-fi content material.

