AI brokers that automate net workflows function throughout the browser’s net layer, the DOM that Playwright and the Chrome DevTools Protocol (CDP) expose. AgentCore Browser supplies a safe, remoted browser surroundings for this, and it really works properly for the overwhelming majority of automation: navigating pages, filling types, clicking parts, extracting content material. However the net layer has a tough boundary. Something that the working system renders (native dialogs, safety prompts, certificates choosers, context menus, even Chrome settings) sits outdoors the DOM totally. CDP can’t see it, and Playwright can’t work together with it.
When an internet utility calls window.print() and a system print dialog seems, Playwright has no DOM to work together with. When a workflow requires a keyboard shortcut or a right-click context menu, CDP has no mechanism to difficulty these instructions on the OS degree. When a browser session encounters a macOS privateness dialog, a Home windows Safety immediate, or a certificates chooser, they’re invisible to the online automation layer. These situations are inclined to floor in manufacturing. They’re triggered by particular utility states, OS configurations, or consumer permissions, not in testing, the place net content material is predictable sufficient to validate towards.
The problem compounds for vision-enabled brokers. A typical structure is to seize a screenshot, ship it to a mannequin, obtain again coordinates or directions, and execute. This loop works properly for net content material, however breaks the second that native UI seems. The screenshot captures it, the mannequin causes about it, after which there’s nothing to behave with. CDP can’t attain what the OS rendered. The agent sees precisely what to do and has no technique to do it.
We’re saying OS Degree Actions for AgentCore Browser. This new functionality unblocks these situations by exposing direct OS management via the InvokeBrowser API, so brokers can work together with content material seen on the display screen, not solely what’s accessible via the browser’s net layer. By combining full-desktop screenshots with mouse and keyboard management on the OS degree, brokers can observe native UI, cause about it, and act on it throughout the identical session. This publish walks via how OS Degree Actions work, what actions are supported, and how one can get began.
How OS Degree Actions work
OS Degree Actions can be found for brand new and present browser configurations with out additional setup. After a session is energetic, you dispatch actions via the InvokeBrowser API. Every name carries precisely one motion, recognized by its kind and arguments, and returns a SUCCESS or FAILED standing. The energetic session is recognized utilizing the x-amzn-browser-session-id header, which ties every OS-level motion to the proper browser session.
The anticipated interplay sample is an action-screenshot-reaction loop. The agent takes an motion (click on, kind, shortcut), captures a screenshot to look at the present state of the display screen, after which decides the following motion primarily based on what it sees. This loop permits the agent to react to dynamic UI. This contains native dialogs and OS prompts that may seem mid-workflow.
- Agent sends an motion. This could be a mouse click on, key press, or shortcut utilizing InvokeBrowser.
- AgentCore executes the motion on the complete OS desktop and returns SUCCESS or FAILED.
- Agent requests a screenshot to look at the present display screen state.
- AgentCore captures the complete desktop, together with native dialogs, OS modals, and UI outdoors the browser window, and returns a base64-encoded PNG.
- Agent causes in regards to the screenshot sending it to a imaginative and prescient mannequin to find out what occurred and what to do subsequent.
- Agent sends the following motion primarily based on what it noticed, persevering with the loop.
Supported actions
OS Degree Actions are organized into three classes: mouse management, keyboard enter, and visible seize. The next desk summarizes eight actions with their fields and constraints.
Motion
Required fields
Non-compulsory fields
Notes
mouseClick
—
x, y, button, clickCount
Defaults to present place, LEFT, single click on. clickCount: 1–10.
mouseMove
x, y
—
Strikes cursor to coordinates.
mouseDrag
endX, endY
startX, startY, button
Drags from begin to finish. button defaults to LEFT.
mouseScroll
—
x, y, deltaX, deltaY
deltaY unfavorable = scroll down. Vary: -1000 to 1000.
keyType
textual content
—
Sorts a string. Max 10,000 characters.
keyPress
key
presses
Presses a key N occasions. presses: 1–100, defaults to 1.
keyShortcut
keys
—
Key mixture array. As much as 5 keys, for instance, [“ctrl”, “a”].
screenshot
—
format
Captures full OS desktop. Returns base64-encoded PNG.
Mouse actions
Mouse actions cowl the complete vary of pointer interactions: clicking, shifting, dragging, and scrolling. Coordinate fields are optionally available for mouseClick. If omitted, the press lands on the present cursor place with a left button single click on. That is helpful when a previous mouseMove has already positioned the cursor. mouseDrag requires the 4 coordinates, begin and finish positions. mouseScroll accepts a place and delta values for each axes—unfavorable deltaY scrolls down, constructive scrolls up. A right-click context menu, for instance, is a single mouseClick with button set to RIGHT on the goal coordinates. Word that some context menu gadgets won’t perform as anticipated due to the virtualized surroundings during which the browser session runs.
Keyboard actions
The three keyboard actions cowl completely different ranges of enter. keyType is for typing textual content. It sends characters straight and handles strings as much as 10,000 characters. keyPress is for particular person keys that should be pressed repeatedly, similar to tab to advance via kind fields or escape to dismiss a modal. keyShortcut is for mixtures—go an array of key names and AgentCore presses them concurrently.
Key names for keyPress and keyShortcut should be lowercase. Supported keys embody single characters (a–z, 0–9) and named keys similar to enter, tab, area, backspace, delete, escape, ctrl, alt, and shift.
To pick out the whole textual content, for instance, you’d use keyShortcut with [“ctrl”, “a”].
{
“motion”: {
“keyShortcut”: {
“keys”: [“ctrl”, “a”]
}
}
}
Screenshot
The screenshot motion captures the complete OS desktop and returns a base64-encoded PNG within the response. It’s the one motion that returns information. The opposite actions return solely a standing (SUCCESS or FAILED) and an error area on failure.
{
“motion”:{
“screenshot”:{
“format”:”PNG”
}
}
}
Getting began
The next examples stroll via the action-screenshot-reaction loop, matching the companion pocket book. For the complete working pocket book with eight actions demonstrated finish to finish, begin there.
Arrange purchasers and create a browser
You want two purchasers: a management airplane shopper (bedrock-agentcore-control) for managing browser sources, and a knowledge airplane shopper (bedrock-agentcore) for dispatching actions throughout a session.
import boto3
import time
browser_boto3 = boto3.shopper(‘bedrock-agentcore-control’, region_name=”us-west-2″)
BROWSER_NAME = “browser_with_os_actions”
Earlier than beginning a session, you want an AWS Identification and Entry Administration (IAM) execution position and a browser useful resource. The execution position requires bedrock-agentcore:InvokeBrowser, bedrock-agentcore:StartBrowserSession, and bedrock-agentcore:StopBrowserSession permissions. The companion pocket book features a helper that creates this position for you:
from helpers.utils import create_agentcore_execution_role, SAMPLE_ROLE_NAME
execution_role_arn = create_agentcore_execution_role(SAMPLE_ROLE_NAME)
With the position created, create a customized browser:
created_browser = browser_boto3.create_browser(
title=BROWSER_NAME,
executionRoleArn=execution_role_arn,
networkConfiguration={
‘networkMode’: ‘PUBLIC’
}
)
browser_id = created_browser[‘browserId’]
print(f”Browser ID: {browser_id}”)
Begin a browser session
With the browser useful resource created, begin a session. The viewPort units the display screen decision. This determines the coordinate area for mouse actions and the scale of captured screenshots. The sessionTimeoutSeconds controls how lengthy the session stays alive earlier than it’s routinely terminated.
# These helpers are included within the companion pocket book repository
from helpers.browser import get_credentials, invoke, start_session, stop_session
creds, default_region = get_credentials()
BEDROCK_AGENTCORE_DP_ENDPOINT = f”https://bedrock-agentcore.{default_region}.amazonaws.com/”
sid = start_session(BEDROCK_AGENTCORE_DP_ENDPOINT, browser_id, area=default_region, credentials=creds)
# Await session to initialize — modify if wanted in your surroundings
time.sleep(3)
The start_session helper sends a SigV4-signed PUT request to create the session and returns the sessionId. The invoke helper handles signing and dispatching particular person actions.
Invoke an OS-level motion
With the session operating, you may dispatch OS-level actions via the invoke helper. Every name takes a single motion — on this case, a left click on at coordinates (600, 370) on the display screen:
r = invoke(
BEDROCK_AGENTCORE_DP_ENDPOINT, sid,
{“mouseClick”: {“x”: 600, “y”: 370, “button”: “LEFT”}},
area=default_region, credentials=creds, browser_id=browser_id
)
print(f”Mouse click on standing: {r.status_code}, motion: {r.json()[‘result’]}”)
The response tells you whether or not the motion succeeded or failed. Coordinates map to display screen pixels, if the session viewport is 1920×1080, legitimate x values vary from 0 to 1919 and y from 0 to 1079. Coordinates outdoors the display screen dimensions return a ValidationException.
Seize a screenshot
After every motion, the agent should observe what occurred. The screenshot motion captures the complete desktop and returns the picture as a base64-encoded PNG:
import base64
from IPython.show import Picture, show
r = invoke(
BEDROCK_AGENTCORE_DP_ENDPOINT, sid,
{“screenshot”: {“format”: “PNG”}},
area=default_region, credentials=creds, browser_id=browser_id
)
img_bytes = base64.b64decode(r.json()[‘result’][‘screenshot’][‘data’])
show(Picture(img_bytes))
That is the commentary step within the loop. The agent sends the screenshot to a imaginative and prescient mannequin, which causes about what’s on display screen and returns the following motion to take. The cycle repeats till the workflow is full.
Placing it collectively: dismissing a print dialog
Right here is the action-screenshot-reaction loop in observe. Suppose the agent navigates to a web page that triggers window.print(), and a local print dialog seems. The agent can’t work together with it via CDP, however it may with OS Degree Actions.First, the agent captures a screenshot to see the present state of the display screen:
r = invoke(
BEDROCK_AGENTCORE_DP_ENDPOINT, sid,
{“screenshot”: {“format”: “PNG”}},
area=default_region, credentials=creds, browser_id=browser_id
)
# Ship the screenshot to a imaginative and prescient mannequin to determine the dialog and find the Cancel button.
# The imaginative and prescient mannequin integration is determined by your agent structure — see the Bedrock
# InvokeModel API for how one can ship photographs to Claude or different fashions.
# The mannequin returns coordinates, e.g.: {“x”: 410, “y”: 535}
The imaginative and prescient mannequin identifies the print dialog and returns the coordinates of the Cancel button. The agent selects it:
r = invoke(
BEDROCK_AGENTCORE_DP_ENDPOINT, sid,
{“mouseClick”: {“x”: 410, “y”: 535, “button”: “LEFT”}},
area=default_region, credentials=creds, browser_id=browser_id
)
print(f”Click on standing: {r.status_code}, motion: {r.json()[‘result’]}”)
The agent takes one other screenshot to verify that the dialog was dismissed, and the workflow continues.
Cease the session and clear up
When the workflow is finished, cease the session and clear up sources:
stop_session(BEDROCK_AGENTCORE_DP_ENDPOINT, sid, browser_id, area=default_region, credentials=creds)
To delete the browser useful resource and IAM position:
browser_boto3.delete_browser(browserId=browser_id)
print(f”Browser {browser_id} deleted”)
from helpers.utils import delete_agentcore_execution_role, SAMPLE_ROLE_NAME
delete_agentcore_execution_role(SAMPLE_ROLE_NAME)
These steps, act, observe, resolve, kind the core of the action-screenshot-reaction sample. The companion pocket book walks via eight supported actions with a reside browser session, together with mouse drag, scroll, keyboard enter, and shortcut mixtures.
Conclusion
Once we launched Amazon Bedrock AgentCore Browser, it gave AI brokers a totally managed, cloud-based browser surroundings to work together with web sites. It navigated pages, extracted content material, and automatic workflows at scale via Playwright and CDP. OS Degree Actions prolong that functionality past the online layer to UI parts seen on the display screen. Native dialogs, safety prompts, keyboard shortcuts, and browser chrome are not blockers. Brokers can now observe, cause about, and act on the complete OS desktop throughout the identical session.
Mixed with AgentCore Browser’s present capabilities like visible understanding and framework integration with Playwright and Amazon Nova Act, OS Degree Actions shut the final hole in browser automation protection.
To start out constructing:
Concerning the authors
Evandro Franco
Evandro Franco is a Sr. Information Scientist engaged on Amazon Internet Companies. He’s a part of the International GTM crew that helps AWS prospects overcome enterprise challenges associated to AI/ML on high of AWS, primarily on Amazon Bedrock AgentCore and Strands Brokers. He has greater than 18 years of expertise working with know-how, from software program improvement, infrastructure, serverless, to machine studying. In his free time, Evandro enjoys taking part in along with his son, primarily constructing some humorous Lego bricks.
Phelipe Fabres
Phelipe Fabres is a Sr. Options Architect for Generative AI at AWS for Startups. He’s a part of a worldwide Frontier AI crew with a give attention to costumers which can be constructing Basis Fashions/LLMs/SLMs. Has prolonged work on Agentic techniques and Software program pushed AI techniques. He has greater than 10 years of working with software program improvement, from monolith to event-driven architectures with a Ph.D. in Graph Idea. In his free time, Phelipe enjoys taking part in along with his daughter, primarily board video games and drawing princess.
Saurav Das
Saurav Das is a part of the Amazon Bedrock AgentCore Product Administration crew. He has greater than 15 years of expertise in working with cloud, information and infrastructure applied sciences. He has a deep curiosity in fixing buyer challenges centered round information and AI infrastructure.
Yanda Hu
Yanda Hu is a software program engineer on the Amazon Bedrock AgentCore Engineering crew with 5+ years of expertise constructing machine studying and AI options at scale. He makes a speciality of designing and delivering scalable agentic techniques. He’s passionate in regards to the rising agentic AI panorama, specializing in serving to prospects overcome real-world challenges in agentic workflows.
Cristiano Scandura
Cristiano has been within the IT trade since 1998. He joined Amazon Internet Companies (AWS) in 2018, the place he labored on tasks for enterprise purchasers. At the moment, he makes a speciality of GenAI and machine studying (ML) tasks for all industries in AWS Worldwide Public Sector.
Joshua Samuel
Joshua Samuel is a Senior AI/ML Specialist Options Architect at AWS who accelerates enterprise transformation via AI/ML, and generative AI options, primarily based in Melbourne, Australia. A passionate disrupter, he makes a speciality of agentic AI and coding methods – Something that makes builders quicker and happier. Exterior work, he tinkers with dwelling automation and AI coding tasks, and enjoys life along with his spouse, children and canine.

