You need to use ToolSimulator, an LLM-powered instrument simulation framework inside Strands Evals, to completely and safely check AI brokers that depend on exterior instruments, at scale. As an alternative of risking dwell API calls that expose personally identifiable data (PII), set off unintended actions, or settling for static mocks that break with multi-turn workflows, you should use ToolSimulator’s giant language mannequin (LLM)-powered simulations to validate your brokers. Accessible at the moment as a part of the Strands Evals Software program Improvement Package (SDK), ToolSimulator helps you catch integration bugs early, check edge instances comprehensively, and ship production-ready brokers with confidence.
On this publish, you’ll discover ways to:
- Arrange ToolSimulator and register instruments for simulation
- Configure stateful instrument simulations for multi-turn agent workflows
- Implement response schemas with Pydantic fashions
- Combine ToolSimulator into an entire Strands Evals analysis pipeline
- Apply finest practices for simulation-based agent analysis
Stipulations
Earlier than you start, just remember to have the next:
- Python 3.10 or later put in in your surroundings
- Strands Evals SDK put in: pip set up strands-evals
- Primary familiarity with Python, together with decorators and kind hints
- Familiarity with AI brokers and tool-calling ideas (API calls, operate schemas)
- Pydantic information is useful for the superior schema examples, however just isn’t required to get began
- An AWS account just isn’t required to run ToolSimulator domestically
Why instrument testing challenges your improvement workflow
Fashionable AI brokers don’t simply cause. They name APIs, question databases, invoke Mannequin Context Protocol (MCP) providers, and work together with exterior programs to finish duties. Your agent’s habits relies upon not solely on its reasoning, however on what these instruments return. If you check these brokers in opposition to dwell APIs, you run into three challenges that sluggish you down and put your programs in danger.
Three challenges that dwell APIs create:
- Exterior dependencies sluggish you down. Dwell APIs impose charge limits, expertise downtime, and require community connectivity. If you’re operating tons of of check instances, these constraints make complete testing impractical.
- Take a look at isolation turns into dangerous. Actual instrument calls set off actual unwanted effects. You threat sending precise emails, modifying manufacturing databases, or reserving precise flights throughout testing. Your agent checks shouldn’t work together with the programs that they’re testing in opposition to.
- Privateness and safety create obstacles. Many instruments deal with delicate information, together with consumer data, monetary data, and PII. Operating checks in opposition to dwell programs unnecessarily exposes that information and creates compliance dangers.
Why static mocks fall brief
You would possibly take into account static mocks instead. Static mocks work for straightfoward, predictable situations, however they require fixed upkeep as your APIs evolve. Extra importantly, they break down within the multi-turn, stateful workflows that actual brokers carry out.
Take into account a flight reserving agent. It searches for flights with one instrument name, then checks reserving standing with one other. The second response ought to depend upon what the primary name did. A hardcoded response can’t replicate a database that modifications state between calls. Static mocks can’t seize this.
What makes ToolSimulator totally different
ToolSimulator solves these challenges with three important capabilities that work collectively to offer you protected, scalable agent testing with out sacrificing realism.
- Adaptive response technology. Instrument outputs replicate what your agent really requested, not a set template. When your agent calls to seek for Seattle-to-New York flights, ToolSimulator returns believable choices with reasonable costs and instances, not a generic placeholder.
- Stateful workflow help. Many real-world instruments keep state throughout calls. A write operation ought to have an effect on subsequent reads. ToolSimulator maintains constant shared state throughout instrument calls, making it protected to check database interactions, reserving workflows, and multi-step processes with out touching manufacturing programs.
- Schema enforcement. Builders sometimes add a post-processing layer that parses uncooked instrument output right into a structured format. When a instrument returns a malformed response, this layer breaks. ToolSimulator validates responses in opposition to Pydantic schemas that you simply outline, catching malformed responses earlier than they attain your agent.
How ToolSimulator works
Determine 1: ToolSimulator (TS) intercepts instrument calls and routes them to an LLM-based response generator
ToolSimulator intercepts calls to your registered instruments and routes them to an LLM-based response generator. The generator makes use of the instrument schema, your agent’s enter, and the present simulation state to supply a sensible, context-appropriate response. No handwritten fixtures required.
Your workflow follows three steps: beautify and register your instruments, optionally steer the simulation with context, then let ToolSimulator mock the instrument responses when your agent runs.
Determine 2: The three-step ToolSimulator (TS) workflow — Enhance & Register, Steer, Mock
Getting began with ToolSimulator
The next sections stroll you thru every step of the ToolSimulator workflow, from preliminary setup to operating your first simulation.
Step 1: Enhance and register
Create a ToolSimulator occasion, then wrap your instrument operate with the @simulator.instrument() decorator to register it for simulation. The actual operate physique can stay empty. ToolSimulator intercepts calls earlier than they attain the implementation:
from strands_evals.simulation.tool_simulator import ToolSimulator
tool_simulator = ToolSimulator()
@tool_simulator.instrument()
def search_flights(origin: str, vacation spot: str, date: str) -> dict:
“””Seek for obtainable flights between two airports on a given date.”””
cross # The actual implementation isn’t referred to as throughout simulation
Step 2: Steer (non-obligatory configuration)
By default, ToolSimulator routinely infers how every instrument ought to behave from its schema and docstring. No further configuration is required to get began. If you want extra management, you should use these three non-obligatory parameters to customise simulation habits:
- share_state_id: Hyperlinks instruments that share the identical backend below a standard state key. State modifications made by one instrument (for instance, a setter) are instantly seen to subsequent calls by one other (for instance, a getter).
- initial_state_description: Seeds the simulation with a pure language description of pre-existing state. Richer context produces extra reasonable and constant responses.
- output_schema: A Pydantic mannequin defining the anticipated response construction. ToolSimulator generates responses that conform strictly to this schema.
Step 3: Mock
When your agent calls a registered instrument, the ToolSimulator wrapper intercepts the decision and routes it to the dynamic response generator. The generator validates the agent’s parameters in opposition to the instrument schema, produces a response that matches the output_schema, and updates the state registry so subsequent instrument calls see a constant world.
Determine 3: The ToolSimulator (TS) simulation circulate when the agent calls a registered instrument
The next instance simulates a flight search instrument hooked up to a flight search assistant:
from strands import Agent
from strands_evals.simulation.tool_simulator import ToolSimulator
# 1. Create a simulator occasion
tool_simulator = ToolSimulator()
# 2. Register a instrument for simulation with preliminary state context
@tool_simulator.instrument(
initial_state_description=”Flight database: SEA->JFK flights obtainable at 8am, 12pm, and 6pm. Costs vary from $180 to $420.”,
)
def search_flights(origin: str, vacation spot: str, date: str) -> dict:
“””Seek for obtainable flights between two airports on a given date.”””
cross
# 3. Create an agent with the simulated instrument and run it
flight_tool = tool_simulator.get_tool(“search_flights”)
agent = Agent(
system_prompt=”You’re a flight search assistant.”,
instruments=[flight_tool],
)
response = agent(“Discover me flights from Seattle to New York on March 15.”)
print(response)
# Anticipated output: A structured record of simulated SEA->JFK flights with instances
# and costs in step with the initial_state_description you supplied.
Superior ToolSimulator utilization
The next sections cowl three superior capabilities that provide you with extra management over simulation habits: operating unbiased situations for parallel testing, configuring shared state for multi-turn workflows, and implementing customized response schemas.
Run unbiased simulator situations
You may create a number of ToolSimulator situations aspect by aspect. Every occasion maintains its personal instrument registry and state, so you’ll be able to run parallel experiment configurations in the identical codebase:
simulator_a = ToolSimulator()
simulator_b = ToolSimulator()
# Every occasion has an unbiased instrument registry and state —
# ultimate for evaluating agent habits throughout totally different instrument setups.
Configure shared state for multi-turn workflows
For stateful instruments comparable to database getters and setters, ToolSimulator maintains constant shared state throughout instrument calls. Use share_state_id to hyperlink instruments that function on the identical backend, and initial_state_description to seed the simulation with pre-existing context:
@tool_simulator.instrument(
share_state_id=”flight_booking”,
initial_state_description=”Flight reserving system: SEA->JFK flights obtainable at 8am, 12pm, and 6pm. No bookings presently lively.”,
)
def search_flights(origin: str, vacation spot: str, date: str) -> dict:
“””Seek for obtainable flights between two airports on a given date.”””
cross
@tool_simulator.instrument(
share_state_id=”flight_booking”,
)
def get_booking_status(booking_id: str) -> dict:
“””Retrieve the present standing of a flight reserving by reserving ID.”””
cross
# Each instruments share “flight_booking” state.
# When search_flights is named, get_booking_status sees the identical
# flight availability information in subsequent calls.
Examine the state earlier than and after agent execution to validate that instrument interactions produced the anticipated modifications:
initial_state = tool_simulator.get_state(“flight_booking”)
# … run the agent …
final_state = tool_simulator.get_state(“flight_booking”)
# Confirm not simply the ultimate output, however the full sequence of instrument interactions.
Tip: Seeding state from actual information
As a result of initial_state_description accepts pure language, you may get inventive with the way you seed context. For instruments that work together with tabular information, use a DataFrame.describe() name to generate statistical summaries and cross these statistics straight because the state description. ToolSimulator will generate responses that replicate reasonable information distributions, with out ever accessing the precise information.
Implement a customized response schema
By default, ToolSimulator infers a response construction from the instrument’s docstring and kind hints. For instruments that comply with strict specs comparable to OpenAPI or MCP schemas, outline the anticipated response as a Pydantic mannequin and cross it utilizing output_schema:
from pydantic import BaseModel, Area
class FlightSearchResponse(BaseModel):
flights: record[dict] = Area( …, description=”Listing of obtainable flights with flight quantity, departure time, and value” )
origin: str = Area(…, description=”Origin airport code”)
vacation spot: str = Area(…, description=”Vacation spot airport code”)
standing: str = Area(default=”success”, description=”Search operation standing”)
message: str = Area(default=””, description=”Further standing message”)
@tool_simulator.instrument(output_schema=FlightSearchResponse)
def search_flights(origin: str, vacation spot: str, date: str) -> dict:
“””Seek for obtainable flights between two airports on a given date.”””
cross
# ToolSimulator validates parameters strictly and returns solely legitimate JSON
# responses that conform to the FlightSearchResponse schema.
Integration with Strands Analysis pipelines
ToolSimulator matches naturally into the Strands Evals analysis framework. The next instance exhibits an entire pipeline, from simulation setup to experiment report, utilizing the GoalSuccessRateEvaluator to attain agent efficiency on tool-calling duties:
from typing import Any
from pydantic import BaseModel, Area
from strands import Agent
from strands_evals import Case, Experiment
from strands_evals.evaluators import GoalSuccessRateEvaluator
from strands_evals.simulation.tool_simulator import ToolSimulator
from strands_evals.mappers import StrandsInMemorySessionMapper
from strands_evals.telemetry import StrandsEvalsTelemetry
# Arrange telemetry and power simulator
telemetry = StrandsEvalsTelemetry().setup_in_memory_exporter()
memory_exporter = telemetry.in_memory_exporter
tool_simulator = ToolSimulator()
# Outline the response schema
class FlightSearchResponse(BaseModel):
flights: record[dict] = Area( …, description=”Accessible flights with quantity, departure time, and value” )
origin: str = Area(…, description=”Origin airport code”)
vacation spot: str = Area(…, description=”Vacation spot airport code”)
standing: str = Area(default=”success”, description=”Search operation standing”)
message: str = Area(default=””, description=”Further standing message”)
# Register instruments for simulation
@tool_simulator.instrument(
share_state_id=”flight_booking”,
initial_state_description=”Flight reserving system: SEA->JFK flights at 8am, 12pm, and 6pm. No bookings presently lively.”,
output_schema=FlightSearchResponse,
)
def search_flights(origin: str, vacation spot: str, date: str) -> dict[str, Any]:
“””Seek for obtainable flights between two airports on a given date.”””
cross
@tool_simulator.instrument(share_state_id=”flight_booking”)
def get_booking_status(booking_id: str) -> dict[str, Any]:
“””Retrieve the present standing of a flight reserving by reserving ID.”””
cross
# Outline the analysis process
def user_task_function(case: Case) -> dict:
initial_state = tool_simulator.get_state(“flight_booking”)
print(f”[State before]: {initial_state.get(‘initial_state’)}”)
search_tool = tool_simulator.get_tool(“search_flights”)
status_tool = tool_simulator.get_tool(“get_booking_status”)
agent = Agent(
trace_attributes={ “gen_ai.dialog.id”: case.session_id, “session.id”: case.session_id },
system_prompt=”You’re a flight reserving assistant.”,
instruments=[search_tool, status_tool],
callback_handler=None,
)
agent_response = agent(case.enter)
print(f”[User]: {case.enter}”)
print(f”[Agent]: {agent_response}”)
final_state = tool_simulator.get_state(“flight_booking”)
print(f”[State after]: {final_state.get(‘previous_calls’, [])}”)
finished_spans = memory_exporter.get_finished_spans()
mapper = StrandsInMemorySessionMapper()
session = mapper.map_to_session(finished_spans, session_id=case.session_id)
return {“output”: str(agent_response), “trajectory”: session}
# Outline check instances, run the experiment, and show the report
test_cases = [
Case( name=”flight_search”, input=”Find me flights from Seattle to New York on March 15.”, metadata={“category”: “flight_booking”}, ),
]
experiment = Experiment[str, str](
instances=test_cases,
evaluators=[GoalSuccessRateEvaluator()]
)
studies = experiment.run_evaluations(user_task_function)
studies[0].run_display()
The duty operate retrieves the simulated instruments, creates an agent, runs the interplay, and returns each the agent’s output and the total telemetry trajectory. The trajectory offers evaluators like GoalSuccessRateEvaluator entry to the whole sequence of instrument calls and mannequin invocations, not simply the ultimate response.
Finest practices for simulation-based analysis
The next practices assist you get essentially the most out of ToolSimulator throughout improvement and analysis workflows:
- Begin with the default configuration for broad protection. Add configuration overrides just for the particular instrument environments that you simply wish to management exactly. ToolSimulator’s defaults are designed to supply reasonable habits with out requiring setup.
- Present wealthy initial_state_description values for stateful instruments. The extra context that you simply seed, the extra reasonable and constant the simulated responses will likely be. Embody information ranges, entity counts, and relationship context.
- Use share_state_id for instruments that work together with the identical backend, so write operations are seen to subsequent reads. That is important for testing multi-turn workflows like reserving, cart administration, or database updates.
- Apply output_schema for instruments that comply with strict specs, comparable to OpenAPI or MCP schemas. Schema enforcement catches malformed responses earlier than they attain your agent and break your post-processing layer.
- Validate instrument interplay sequences, not simply closing outputs. Examine state modifications earlier than and after agent execution to verify that instrument calls occurred in the correct order and produced the correct state transitions.
- Begin small and increase. Start along with your commonest instrument interplay situations, then increase to edge instances as your analysis observe matures. Complement simulation-based testing with focused dwell API checks for vital manufacturing paths.
Conclusion
ToolSimulator transforms the way you check AI brokers by changing dangerous dwell API calls with clever, adaptive simulations. Now you can safely validate complicated, stateful workflows at scale, catching integration bugs early and transport production-ready brokers with confidence. Combining ToolSimulator with Strands Evals analysis pipelines offers you full visibility into agent habits with out managing check infrastructure or risking real-world unwanted effects.
Subsequent steps
Begin testing your AI brokers safely at the moment. Set up ToolSimulator with the next command:
pip set up strands-evals
To proceed exploring ToolSimulator and Strands Evals, take these subsequent steps:
- Learn the Strands Evals documentation to discover all configuration choices, together with superior state administration and customized evaluators.
- Attempt the instance to see ToolSimulator in motion. Lengthen the instance by including extra instruments and testing multi-step agent workflows.
- Discover Amazon Bedrock for the LLM backend choices that energy ToolSimulator’s response technology.
- Study AWS Lambda for serverless agent deployment methods that pair effectively with ToolSimulator-based testing.
- Be part of the Strands group boards to ask questions, share your analysis setups, and join with different agent builders.
Share your suggestions. We’d love to listen to the way you’re utilizing ToolSimulator. Share your suggestions, report points, and counsel options by the Strands Evals GitHub repository or group boards.
About The Authors
Darren Wang
Darren Wang is a Analysis Engineer at Amazon Internet Providers, the place he bridges cutting-edge AI analysis and manufacturing programs. With a Ph.D. background in speech recognition and 5 years of expertise in e mail anti-spam engineering, Darren transforms early-stage machine studying analysis into scalable, production-ready options that ship measurable buyer affect. Specializing in agent simulation and analysis frameworks, he empowers builders to construct extra dependable, testable AI brokers by strong testing infrastructure. Outdoors of labor, he enjoys bouldering, taking part in violin, and something about cats.
Xuan Qi
Xuan Qi is an Utilized Scientist at Amazon Internet Providers, the place she applies her background in physics to sort out complicated challenges in machine studying and synthetic intelligence. Specializing in ML modeling and simulation, Xuan is obsessed with translating scientific ideas into sensible purposes that drive significant technological developments. Her work focuses on growing extra intuitive and environment friendly AI programs that may higher perceive and work together with the world. Outdoors of her skilled pursuits, Xuan finds steadiness and creativity by dancing and taking part in the violin, bringing the precision and concord of those arts into her scientific endeavors.
Smeet Dhakecha
Smeet Dhakecha is a Analysis Engineer at Amazon Internet Providers, working throughout the Agentic AI Science crew. His work spans agent simulation and analysis programs, in addition to the design and deployment of knowledge transformation pipelines and to help fast-moving scientific analysis for mannequin post-training, and RL coaching.
Vinayak Arannil
Vinayak is a Sr. Utilized Scientist at Amazon Internet Providers. With a number of years of expertise, he has labored on varied domains of AI like laptop imaginative and prescient, pure language processing, advice programs and many others. At present, Vinayak helps construct new capabilities on the AgentCore and Strands, enabling prospects to judge their Agentic purposes with ease, accuracy and effectivity.

