Generate suggestions from manufacturing traces, validate them with batch analysis and A/B testing, and ship with confidence.
AI brokers that carry out nicely at launch don’t keep that manner. As fashions evolve, consumer habits shifts, and prompts get reused in new contexts they have been by no means designed for. Agent high quality quietly degrades. In most groups, the development course of nonetheless appears the identical: with out automated suggestions loops, when a consumer complains, a developer reads via traces, varieties a speculation, rewrites the immediate, checks a handful of circumstances, and ships the repair. Then the cycle repeats, usually introducing a brand new concern for a unique consumer. Up till at present, Amazon Bedrock AgentCore offered the items so that you can debug it manually or construct customized implementations: test the analysis scores to detect high quality drop, deep dive into the traces to find out the basis trigger and replace the agent with an improved configuration. The developer is the efficiency engine counting on instinct somewhat than on systematic data-backed proof. Devoted science groups and enormous centralized benchmarks assist, however they’re neither a sensible nor well timed resolution for many product groups. Even when you might have that equipment, it tends to maneuver on weekly or month-to-month cycles, whereas brokers drift in manufacturing day by day.
AgentCore is the platform to construct, join, and optimize brokers at scale, with safety enforced on the infrastructure layer. Hundreds of builders already use AgentCore to construct brokers that cause, plan, and act throughout advanced workflows. Right now we’re asserting new capabilities in AgentCore that full the observe, consider, enhance loop for agent efficiency and high quality: suggestions and two methods to validate them.
Suggestions analyze manufacturing traces and analysis outputs to optimize your system immediate or instrument descriptions for the evaluator you specify. Batch analysis helps take a look at the advice in opposition to a pre-defined take a look at dataset and studies combination scores, catching regressions on circumstances you recognize matter. When hand-authored eventualities aren’t sufficient, you too can simulate a dataset utilizing an LLM-backed actor to play the position of an finish consumer. A/B testing runs a managed comparability between variations of an agent via AgentCore Gateway, splitting reside manufacturing visitors on the share you configure and reporting outcomes with confidence intervals and statistical significance. Suggestions suggest modifications, batch analysis and A/B testing validate them, and collectively they change the handbook cycle of studying traces, guessing at fixes, and deploying blind.
“Constantly evaluating and bettering brokers is crucial for driving data-driven worth creation. Processes that historically required weeks of handbook immediate tuning have developed into fast, repeatable cycles via using AgentCore. By deriving enchancment suggestions from manufacturing hint knowledge and validating their impression via A/B testing, organizations can optimize efficiency whereas making certain accuracy and effectiveness. This strategy allows steady, extremely environment friendly enchancment at scale.” Yoshiharu Okuda, Head of Generative AI Enterprise Technique Division, NTT DATA
How the loop runs in observe
Right here is how the loop runs for the mannequin improve situation. The sample is identical for any change: a immediate refactor, a instrument set replace, a framework improve.
Finish-to-end traceability in AgentCore captures each mannequin name, instrument invocation, and reasoning step as OpenTelemetry-compatible traces managed utilizing AgentCore Observability. Evaluations rating these traces routinely throughout dimensions like objective success price, instrument choice accuracy, helpfulness, and security, utilizing built-in evaluators, ground-truth comparisons, or customized LLM-as-judge scoring.
Generate a suggestion. Level the Suggestions API on the CloudWatch Log group the place your agent writes traces. Decide the reward sign because the evaluator you need to optimize for, both a built-in evaluator from AgentCore or a customized evaluator you’ve constructed, and select what to optimize: the system immediate or the instrument descriptions. AgentCore displays on the traces, contemplating the offered reward sign, and generates a suggestion aimed toward bettering the efficiency on that reward sign. For instrument description suggestions, it solely sharpens the instrument description with out touching the instrument implementation. The service proposes, and also you determine what to take ahead into the validation steps.
Bundle the change as a configuration bundle. Configurations ship as bundles, that are immutable, versioned snapshots of your agent’s configuration keyed by runtime ARN: mannequin ID, system immediate, instrument descriptions. Your agent reads its lively configuration dynamically at runtime via the AgentCore SDK, so swapping a immediate or a mannequin is a configuration change, not a code change. Create one bundle to your present configuration and one other for the advice. Bundles are optionally available. For modifications that embody code, deploy to a separate runtime endpoint as a substitute.
Validate offline: batch analysis. Run your agent in opposition to a curated knowledge set utilizing the brand new bundle, then consider the ensuing periods in batch and examine combination scores to your baseline. This catches regressions on use circumstances you might have already outlined. Groups usually wire batch analysis into their CI/CD pipelines so no configuration change reaches manufacturing with out passing their known-good circumstances.
Validate in opposition to reside visitors: A/B testing. Configure AgentCore Gateway to separate reside manufacturing visitors between two variants, with the present model because the management and the candidate because the remedy. Variants may be completely different bundle variations on the identical runtime for configuration-only modifications, or completely different gateway targets pointing to separate runtime endpoints for modifications that embody code. On-line analysis scores each session along with your specified evaluators. The A/B take a look at outcomes embody confidence intervals and p-values. When you might have sufficient knowledge to provide you confidence within the new model’s efficiency, cease the take a look at and promote the brand new variant by setting it because the default. To roll again, pause the take a look at and the agent reverts to its present configuration.
“What took weeks of handbook immediate iteration is now a repeatable cycle with AgentCore: generate a suggestion from manufacturing traces, validate it in opposition to reside visitors with statistical significance, and deploy the profitable configuration. Every cycle produces the baseline knowledge for the following — the development course of compounds.” — Masashi Shimizu, Senior Managing Director, Nomura Analysis Institute, Ltd.
The place we’re headed
Right now’s preview is developer-triggered by design. You select when to generate a suggestion, which evaluator to focus on, and whether or not to advertise the consequence. Our imaginative and prescient is a flywheel the place traces feed evaluations, evaluations floor drift, suggestions flip that sign right into a concrete change, and A/B testing proves it really works. The profitable configuration turns into the brand new baseline, and the traces it produces are the enter for the following cycle.Over time, the flywheel spins with much less effort. Suggestions weigh a number of evaluators collectively, surfacing trade-offs with proof. In addition they increase the optimization floor to abilities, proposing new ones or refining present ones primarily based on manufacturing utilization. Hint evaluation clusters manufacturing failures into patterns you possibly can deal with earlier than they multiply. Monitor alarms launch a suggestion and validation on their very own when an evaluator drops beneath a threshold, touchdown the lead to a evaluation queue. You determine what ships, and the system can do the heavy lifting to get there.
See it in motion
The Market Tendencies Agent pattern on GitHub is a market intelligence agent constructed for funding brokers overlaying real-time inventory knowledge, sector evaluation, information search, and customized dealer profiles. For an agent serving brokers with completely different danger profiles, sector pursuits, and conversational types, high quality degradation is difficult to identify and more durable to repair with out the correct tooling.
Stroll via the total enchancment loop: generate a suggestion that surfaces the place the agent fails to personalize recommendation to a dealer’s said technique or selects the unsuitable instrument when a question spans a number of sectors. Bundle the change as a configuration bundle model. Validate the repair with batch analysis throughout a curated set of dealer conversations. Then A/B take a look at the configuration in opposition to actual dealer periods with statistical confidence earlier than selling it to manufacturing.
Get began
These capabilities can be found in preview at present via Amazon Bedrock AgentCore in AWS Areas the place AgentCore Evaluations is on the market. Throughout preview, AgentCore Optimization targets system prompts and power descriptions for brokers deployed on AgentCore Runtime and utilizing AgentCore Observability and Evaluations.
Get began via the AgentCore Console or CLI. Learn the documentation and comply with step-by-step tutorials right here.
In regards to the authors
Amandeep Khurana
Amandeep Khurana is a Principal Product Supervisor, engaged on Amazon Bedrock AgentCore, specializing in agent operations and efficiency tooling. He’s obsessed with constructing merchandise on the slicing fringe of expertise and serving to clients undertake them to unravel their enterprise issues.
Nikhil Kandoi
Nikhil Kandoi is a Principal Engineer on the AgentCore staff. Nikhil brings deep experience in constructing and scaling clever programs spanning a number of AI companies like AWS Lex, Panorama and Amazon Q. Right now, he focuses on the challenges of deploying and managing AI brokers at enterprise scale that make large-scale agent deployments dependable and safe.
Bharathi Srinivasan
Bharathi Srinivasan is a Senior Generative AI Knowledge Scientist at AWS. Bharathi works with enterprise clients on giant‑scale generative AI challenges, together with robustness and verification of non‑deterministic programs, governance of GenAI and agentic AI platforms, and the standard of dynamic agentic AI programs.

