Coaching AI brokers that may truly use a pc — opening apps, clicking buttons, shopping the net, writing code — is among the hardest infrastructure issues in trendy AI. It’s not an information downside. It’s not a mannequin downside. It’s a plumbing downside.
You might want to spin up a whole bunch, probably hundreds, of full working system environments with precise graphical consumer interfaces. Every one must run actual software program. Every one must deal with unpredictable crashes. And also you want all of them to run concurrently at a price that doesn’t bankrupt a college analysis lab.
That’s the issue ‘OSGym‘, a brand new analysis from a staff of researchers at MIT, UIUC, CMU, USC, UVA, and UC Berkeley, is designed to resolve.
https://arxiv.org/pdf/2511.11672
What’s a Pc Use Agent?
Earlier than unpacking the infrastructure, it helps to grasp what a pc use agent truly is. Not like a chatbot that responds to textual content prompts, a pc use agent observes a screenshot of a desktop, decides what to do — click on a button, sort textual content, open a file — and executes that motion by keyboard and mouse inputs. Consider it as an AI that may function any software program the way in which a human would.
Fashions like Anthropic’s Claude Pc Use and OpenAI’s Operator are early business examples. Analysis fashions like UI-TARS, Agent-S2, and CogAgent are pushing the boundaries additional. However coaching any of those methods requires large quantities of interplay knowledge generated inside actual OS environments — and that’s the place issues get costly and sophisticated quick.
The Core Downside: OS Sandboxes at Scale
A coding setting or an internet browser sandbox is comparatively light-weight to run. A full OS sandbox with a GUI just isn’t. Every digital machine wants its personal bootable disk (round 24 GB), its personal CPU and RAM allocation, and its personal show stack. Multiply that by a whole bunch or hundreds of parallel situations and you’ve got a useful resource consumption downside that typical educational compute budgets merely can not take in.
On prime of useful resource prices, there’s the reliability downside. Software program crashes. Browser periods day out. Functions freeze. In case your coaching pipeline doesn’t deal with these failures gracefully, one unhealthy VM can stall a whole coaching batch.
OSGym tackles each issues with 4 distinct architectural optimizations.
Decentralized OS State Administration
The primary design selection issues how the system manages the state of every OS duplicate — monitoring whether or not it’s wholesome, what activity it’s operating, and recuperate it if one thing goes improper.
A naive method makes use of a single centralized supervisor for all replicas. It is a basic single level of failure: as duplicate rely grows into the hundreds, the central supervisor turns into overwhelmed, latency will increase, and one crash can halt the entire system. OSGym as an alternative provides each OS duplicate its personal devoted state supervisor. Every state supervisor exposes public strategies modeled after the OpenAI Gymnasium API — reset, step, and shutdown — however handles its personal well being monitoring and crash restoration internally. A failure in a single duplicate can not propagate to some other.
{Hardware}-Conscious OS Reproduction Orchestration
Right here’s a non-obvious perception this analysis surfaces: if you run many OS replicas on a single server, the bottleneck depends upon what number of replicas you pack per machine. For a small variety of replicas per server (low Ok), the system is CPU-bounded — most replicas are preventing over processor time. However as you pack extra replicas per server (giant Ok), the bottleneck shifts to RAM — and RAM is dramatically cheaper than CPU.
A 32 GB DDR4 RAM module sometimes prices 10–20% of what a 16-core CPU prices. OSGym runs replicas as Docker containers (utilizing Docker photos from OSWorld as a basis) somewhat than full Digital Machines to cut back per-replica overhead. By selecting servers with greater RAM capability and operating extra replicas per machine, the every day value drops from round $300 for 128 replicas at Ok=1, to roughly $30 at Ok=64 — roughly $0.234 per duplicate per day, a quantity that matches comfortably inside many educational grant budgets.
KVM Virtualization with Copy-on-Write Disk Administration
The disk provisioning downside is solved with a filesystem approach referred to as reflink copy-on-write (CoW). Usually, spinning up 128 VM situations would imply duplicating a 24 GB base picture 128 occasions — over 3 TB of storage and 30 seconds of provisioning time per VM.
OSGym as an alternative makes use of cp –reflink=at all times on XFS-formatted NVMe drives. Every per-VM disk picture shares bodily disk blocks with the bottom picture and solely allocates new blocks when the VM truly writes to them. The consequence: 128 VMs devour 366 GB of bodily disk as an alternative of three.1 TB — an 88% discount — and disk provisioning time drops from 30 seconds to 0.8 seconds per VM, a 37× speedup. Every VM nonetheless sees its full 24 GB logical disk with near-native CPU efficiency.
Strong Container Pool with Multi-Layer Fault Restoration
OSGym maintains a pre-warmed runner pool — by default, 128 runners per executor node — initialized earlier than coaching begins. Slightly than creating and destroying VMs on demand, runners are recycled between duties. Earlier than every VM creation, OSGym reads /proc/meminfo and /proc/loadavg to confirm the host can safely accommodate one other occasion, blocking creation if accessible reminiscence falls under 10% or underneath 8 GB absolute. Every container is memory-limited to six GB to forestall over-provisioning underneath burst situations.
The system additionally tunes Linux kernel parameters that may in any other case trigger silent failures at excessive concurrency — for instance, fs.aio-max-nr is raised from 65,536 to 1,048,576, and fs.inotify.max_user_instances from 128 to eight,192. Fault restoration operates at two ranges: on the step stage, every motion will get as much as 10 retries by default; on the activity stage, if a runner fails completely, the duty is mechanically reassigned to a recent runner.
Unified Job Movement and Centralized Information Server
Two design parts which might be significantly vital for devs integrating OSGym: each activity follows a four-phase unified execution stream — Configure, Reset, Function, Consider — no matter which software program or area is concerned. This standardization makes it simple so as to add new activity sorts with out altering the encompassing infrastructure.
Above the duplicate layer, a centralized knowledge server Python class exposes a single-entry batched interface (__next__ and async_step) that hides all of the complexity of state supervisor communication and queuing. The batched step methodology is asynchronous, that means the coaching loop isn’t blocked whereas ready for OS replicas to finish their actions.
What the Numbers Look Like in Apply
Utilizing 1,024 parallel OS replicas, the system collected trajectories throughout ten activity classes — together with LibreOffice Author, Calc, and Impress, Chrome, ThunderBird, VLC, VS Code, GIMP, OS system configuration, and multi-app workflows — at roughly 1,420 trajectories per minute, versus 115,654 seconds with out parallelization. The complete dataset value $43 in cloud compute.
The analysis staff then used that knowledge to fine-tune Qwen2.5-VL 32B through supervised fine-tuning, adopted by reinforcement studying utilizing a PPO-based semi-online asynchronous pipeline (200 steps, batch measurement 64, studying price 1e-6). The ensuing mannequin achieved a 56.3% success price on the OSWorld-Verified benchmark — aggressive with present strategies for a 32B parameter base mannequin with no task-specific tuning.
Key Takeaways
- Coaching laptop use brokers is an infrastructure downside first: Full OS sandboxes with GUIs are far heavier than coding or browser environments — every VM wants ~24 GB of disk, devoted CPU and RAM, and a show stack. With out cautious optimization, scaling to a whole bunch of replicas is solely unaffordable for many educational labs.
- RAM is a better scaling lever than CPU: OSGym’s hardware-aware orchestration reveals that packing extra replicas per server shifts the bottleneck from CPU to RAM — and RAM is 5–10× cheaper. This single perception cuts per-replica value from ~$2.10/day to as little as $0.23/day.
- Copy-on-write disk administration eliminates the storage wall. By utilizing XFS reflink CoW (cp –reflink=at all times), OSGym reduces bodily disk consumption by 88% and hurries up VM disk provisioning by 37× — turning a 3.1 TB, 30-second-per-VM downside right into a 366 GB, 0.8-second one.
- Decentralized state administration is the important thing to robustness at scale. Giving every OS duplicate its personal devoted state supervisor means failures keep remoted. Even ranging from a completely crashed state, OSGym self-recovers all replicas inside a brief window — crucial for uninterrupted long-running coaching jobs.
- Tutorial-scale laptop use agent analysis is now financially viable. With 1,024 replicas producing 1,420 trajectories per minute and a full dataset costing simply $43 in cloud compute, OSGym brings the infrastructure value of coaching general-purpose laptop brokers inside attain of college analysis budgets.
Take a look at the Paper right here. Additionally, be happy to observe us on Twitter and don’t overlook to affix our 120k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you possibly can be a part of us on telegram as properly.
Must accomplice with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so on.? Join with us

