OpenAI Releases GPT-5.5, a Totally Retrained Agentic Mannequin That Scores 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval

OpenAI has launched GPT-5.5, its most succesful mannequin thus far and the primary absolutely retrained base mannequin since GPT-4.5. GPT-5.5 is designed to finish advanced, multi-step pc duties with minimal human route. Consider it because the distinction between an assistant who wants a guidelines and one who understands the underlying purpose and figures out the steps themselves. The discharge is rolling out at the moment to Plus, Professional, Enterprise, and Enterprise subscribers throughout ChatGPT and Codex.

What ‘Agentic’ Truly Means Right here

An agentic mannequin doesn’t simply reply to a single immediate — it takes a sequence of actions, makes use of instruments (like shopping the net, writing code, operating scripts, or working software program), checks its personal work, and retains going till the duty is completed. Prior fashions typically stalled at handoff factors, requiring the consumer to re-prompt or appropriate course. GPT-5.5 is constructed to scale back these interruptions.

OpenAI launched GPT-5.5 as a mannequin focused at agentic pc use — it writes and debugs code, browses the net, fills out spreadsheets, and retains working via multi-step duties with out requiring a human to oversee each transfer.

The 4 Domains The place Positive factors Are Concentrated

The beneficial properties are concentrated in 4 areas: agentic coding, pc use, data work, and early scientific analysis — domains OpenAI describes as these ‘the place progress will depend on reasoning throughout context and taking motion over time.’

For software program engineers, essentially the most instantly related benchmark is SWE-Bench Professional, which evaluates real-world GitHub challenge decision throughout 4 programming languages. GPT-5.5 resolves 58.6% of duties end-to-end in a single cross. Value noting: Claude Opus 4.7 scores increased at 64.3% on this similar benchmark, although OpenAI has famous that Anthropic reported indicators of memorization on a subset of these issues, which can have an effect on the comparability.

For long-horizon coding particularly, OpenAI additionally reviews outcomes on Professional-SWE, an inner benchmark measuring duties with a median estimated human completion time of 20 hours. GPT-5.5 outperforms GPT-5.4 on Professional-SWE. This benchmark is critical as a result of it displays the sort of prolonged, multi-session engineering work — giant refactors, characteristic builds, debugging deep in a codebase — that agentic instruments are more and more being requested to deal with autonomously.

Builders who examined the system early mentioned GPT-5.5 has a greater understanding of the “form” of a software program system, and might higher perceive why one thing is failing, the place the repair is required, and what else within the codebase could be affected.

https://openai.com/index/introducing-gpt-5-5/

For ML engineers and information scientists who spend vital time in terminal environments orchestrating pipelines and debugging scripts, the Terminal-Bench 2.0 outcomes are essentially the most compelling sign. GPT-5.5 scores 82.7% on Terminal-Bench 2.0, which checks advanced command-line workflows requiring planning, iteration, and gear coordination — beating Claude Opus 4.7 at 69.4% and Gemini 3.1 Professional at 68.5%. That’s not a marginal lead.

For broader data work, GPT-5.5 scores 84.9% on GDPval, which checks brokers throughout 44 occupations of information work. On OSWorld-Verified, a benchmark measuring whether or not a mannequin can autonomously function actual pc environments, it reaches 78.7%.

GPT-5.5 additionally ships with a Professional variant constructed for higher-accuracy, more durable duties. On BrowseComp, which checks a mannequin’s capacity to trace down hard-to-find data throughout the net, GPT-5.5 Professional scores 90.1%, forward of Gemini 3.1 Professional at 85.9%. The mannequin can also be the top-ranked system on the Synthetic Evaluation Intelligence Index.

https://openai.com/index/introducing-gpt-5-5/

Velocity and Token Effectivity

One concern with extra succesful fashions is that they are typically slower or dearer to run. OpenAI addressed this straight. GPT-5.5 matches GPT-5.4’s per-token latency in real-world serving whereas performing higher throughout almost each analysis measured. It additionally makes use of considerably fewer tokens to finish the identical Codex duties — that means shorter, extra environment friendly runs even on advanced agentic workflows.

On pricing, the usual GPT-5.5 API might be charged at $5 per million enter tokens and $30 per million output tokens. For context, GPT-5.4 was priced at $2.50 per million enter tokens and $15 per million output tokens — so the per-token value has doubled. OpenAI crew argued that token effectivity beneficial properties offset the fee, since GPT-5.5 completes the identical Codex duties with fewer tokens, that means cheaper runs total even on the increased per-token charge. GPT-5.5 Professional, the higher-accuracy variant, is priced at $30 per million enter tokens and $180 per million output tokens within the API.

For groups operating Codex at scale, the web math is what issues: if GPT-5.5 completes a job in materially fewer tokens than GPT-5.4, the efficient value per accomplished workflow can nonetheless come out decrease regardless of the upper charge.

Scale and Adoption

OpenAI has seen a surge in Codex utilization, with about 4 million builders utilizing the instrument weekly. That scale issues for understanding the deployment context: GPT-5.5 is just not a analysis preview however a manufacturing mannequin being pushed to an energetic, giant developer base instantly on launch.

Key Takeaways

GPT-5.5 is OpenAI’s first absolutely retrained base mannequin since GPT-4.5, designed particularly for agentic workflows — it will possibly perceive advanced targets, use instruments, examine its personal work, and carry multi-step duties via to completion with minimal human route.
The largest efficiency beneficial properties are in agentic coding, pc use, data work, and early scientific analysis — GPT-5.5 scores 82.7% on Terminal-Bench 2.0, 84.9% on GDPval, and 78.7% on OSWorld-Verified, outperforming each Claude Opus 4.7 and Gemini 3.1 Professional on a number of key benchmarks.
GPT-5.5 matches GPT-5.4’s per-token latency whereas being extra succesful throughout almost each benchmark — it additionally makes use of considerably fewer tokens to finish the identical Codex duties, that means higher outcomes and not using a proportional improve in velocity or value per accomplished workflow.
API pricing will increase to $5/M enter tokens and $30/M output tokens (up from $2.50 and $15 for GPT-5.4), with GPT-5.5 Professional priced at $30/M enter and $180/M output — OpenAI crew argues token effectivity beneficial properties offset the upper per-token charge for many workloads.
GPT-5.5 is rolling out at the moment to Plus, Professional, Enterprise, and Enterprise customers in ChatGPT and Codex, with roughly 4 million builders already utilizing Codex weekly.

Michal Sutter is a knowledge science skilled with a Grasp of Science in Information Science from the College of Padova. With a stable basis in statistical evaluation, machine studying, and information engineering, Michal excels at reworking advanced datasets into actionable insights.

What's Hot

NYT Strands hints and solutions for Tuesday, Might 12 (sport #800)

OpenAI Introduces Dawn: A Cybersecurity Initiative That Places Codex Safety on the Middle of Vulnerability Detection and Patch Validation

FAQ on hantavirus and outbreak on cruise ship Hondius

OpenAI Introduces Dawn: A Cybersecurity Initiative That Places Codex Safety on the Middle of Vulnerability Detection and Patch Validation

College students Boo Graduation Speaker After She Calls AI the ‘Subsequent Industrial Revolution’

10 GitHub Repositories to Grasp FastAPI

Ilya Sutskever Stands by His Function in Sam Altman’s OpenAI Ouster: ‘I Didn’t Need It to Be Destroyed’

Constructing internet search-enabled brokers with Strands and Exa

Understanding LLM Distillation Methods – MarkTechPost

NYT Strands hints and solutions for Tuesday, Might 12 (sport #800)

OpenAI Introduces Dawn: A Cybersecurity Initiative That Places Codex Safety on the Middle of Vulnerability Detection and Patch Validation

FAQ on hantavirus and outbreak on cruise ship Hondius

NYT Strands hints and solutions for Tuesday, Might 12 (sport #800)

OpenAI Introduces Dawn: A Cybersecurity Initiative That Places Codex Safety on the Middle of Vulnerability Detection and Patch Validation

FAQ on hantavirus and outbreak on cruise ship Hondius

Usefull link

categories

What's Hot

What ‘Agentic’ Truly Means Right here

The 4 Domains The place Positive factors Are Concentrated

Velocity and Token Effectivity

Scale and Adoption

Key Takeaways

Related Posts

Usefull link

categories