HomeSample Page

Sample Page Title


OpenAI has simply launched GPT-5.3-Codex, a brand new agentic coding mannequin that extends Codex from writing and reviewing code to dealing with a broad vary of labor on a pc. The mannequin combines the frontier coding efficiency of GPT-5.2-Codex with the reasoning {and professional} information capabilities of GPT-5.2 right into a single system, and it runs 25% quicker for Codex customers attributable to infrastructure and inference enhancements.

For Devs people, GPT-5.3-Codex is positioned as a coding agent that may execute long-running duties that contain analysis, instrument use, and sophisticated execution, whereas remaining steerable ‘very like a colleague’ throughout a run.

Frontier agentic capabilities and benchmark outcomes

OpenAI evaluates GPT-5.3-Codex on 4 key benchmarks that focus on real-world coding and agentic conduct: SWE-Bench Professional, Terminal-Bench 2.0, OSWorld-Verified, and GDPval.

https://openai.com/index/introducing-gpt-5-3-codex/

On SWE-Bench Professional, a contamination-resistant benchmark constructed from actual GitHub points and pull requests throughout 4 languages, GPT-5.3-Codex reaches 56.8% with xhigh reasoning effort. This barely improves over GPT-5.2-Codex and GPT-5.2 on the similar effort stage. Terminal-Bench 2.0, which measures terminal abilities that coding brokers want, exhibits a bigger hole: GPT-5.3-Codex reaches 77.3%, considerably greater than earlier fashions.

https://openai.com/index/introducing-gpt-5-3-codex/

On OSWorld-Verified, an agentic computer-use benchmark the place brokers full productiveness duties in a visible desktop atmosphere, GPT-5.3-Codex reaches 64.7%. People rating round 72% on this benchmark, which provides a tough human-level reference level.

For skilled information work, GPT-5.3-Codex is evaluated with GDPval, an analysis launched in 2025 that measures efficiency on well-specified duties throughout 44 occupations. GPT-5.3-Codex achieves 70.9% wins or ties on GDPval, matching GPT-5.2 at excessive reasoning effort. These duties embody establishing shows, spreadsheets, and different work merchandise that align with typical skilled workflows.

A notable methods element is that GPT-5.3-Codex achieves its outcomes with fewer tokens than earlier fashions, permitting customers to “construct extra” inside the similar context and value budgets.

Past coding: GDPval and OSWorld

OpenAI emphasizes that software program devs, designers, product managers, and knowledge scientists carry out a variety of duties past code era. GPT-5.3-Codex is constructed to help throughout the software program lifecycle: debugging, deployment, monitoring, writing PRDs, enhancing copy, working consumer analysis, checks, and metrics.

With customized abilities much like these utilized in prior GDPval experiments, GPT-5.3-Codex produces full work merchandise. Examples within the OpenAI official weblog embody monetary recommendation slide decks, a retail coaching doc, an NPV evaluation spreadsheet, and a vogue presentation. Every GDPval process is designed by a site skilled and displays life like work from that occupation.

https://openai.com/index/introducing-gpt-5-3-codex/

On OSWorld, GPT-5.3-Codex demonstrates stronger computer-use capabilities than earlier GPT fashions. OSWorld-Verified requires the mannequin to make use of imaginative and prescient to finish numerous duties in a desktop atmosphere, aligning carefully with how brokers function actual functions and instruments as a substitute of solely producing textual content.

An interactive collaborator within the Codex app

As fashions change into extra succesful, OpenAI frames the principle problem as human supervision and management of many brokers working in parallel. The Codex app is designed to make managing and directing brokers simpler, and with GPT-5.3-Codex it good points extra interactive conduct.

Codex now offers frequent updates throughout a run so customers can see key choices and progress. As an alternative of ready for a single closing output, customers can ask questions, focus on approaches, and steer the mannequin in actual time. GPT-5.3-Codex explains what it’s doing and responds to suggestions whereas holding context. This ‘follow-up conduct’ will be configured within the Codex app settings.

A mannequin that helped practice and deploy itself

GPT-5.3-Codex is the primary mannequin on this household that was ‘instrumental in creating itself.’ OpenAI used early variations of GPT-5.3-Codex to debug its personal coaching, handle deployment, and diagnose check outcomes and evaluations.

The OpenAI analysis staff used Codex to watch and debug the coaching run, monitor patterns throughout the coaching course of, analyze interplay high quality, suggest fixes, and construct functions that visualize behavioral variations relative to prior fashions. The event staff used Codex to optimize and adapt the serving harness, determine context rendering bugs, discover the basis causes of low cache hit charges, and dynamically scale GPU clusters to take care of steady latency below site visitors surges.

Throughout alpha testing, a researcher requested GPT-5.3-Codex to quantify extra work accomplished per flip and the impact on productiveness. The mannequin generated regex-based classifiers to estimate clarification frequency, constructive and unfavourable responses, and process progress, then ran these over session logs and produced a report. Codex additionally helped construct new knowledge pipelines and richer visualizations when customary dashboard instruments have been inadequate and summarized insights from 1000’s of information factors in below 3 minutes

Cybersecurity capabilities and safeguards

GPT-5.3-Codex is the primary mannequin OpenAI classifies as ‘Excessive functionality’ for cybersecurity-related duties below its Preparedness Framework and the primary mannequin it has educated on to determine software program vulnerabilities. OpenAI states that it has no definitive proof that the mannequin can automate cyber assaults end-to-end and is taking a precautionary method with its most complete cybersecurity security stack up to now.

Mitigations embody security coaching, automated monitoring, trusted entry for superior capabilities, and enforcement pipelines that incorporate menace intelligence. OpenAI is launching a ‘Trusted Entry for Cyber’ pilot, increasing the non-public beta of Aardvark, a safety analysis agent, and offering free codebase scanning for broadly used open-source initiatives comparable to Subsequent.js, the place Codex was just lately used to determine disclosed vulnerabilities.

Key Takeaways

  • Unified frontier mannequin for coding and work: GPT-5.3-Codex combines the coding energy of GPT-5.2-Codex with the reasoning {and professional} capabilities of GPT-5.2 in a single agentic mannequin, and runs 25% quicker in Codex.
  • State-of-the-art on coding and agent benchmarks: The mannequin units new highs on SWE-Bench Professional (56.8% at xhigh), Terminal-Bench 2.0 (77.3%), and achieves 64.7% on OSWorld-Verified and 70.9% wins or ties on GDPval, usually with fewer tokens than earlier fashions.
  • Helps long-horizon net and app improvement: Utilizing abilities comparable to ‘develop net sport’ and generic follow-ups like ‘repair the bug’ and ‘enhance the sport,’ GPT-5.3-Codex autonomously developed complicated racing and diving video games over hundreds of thousands of tokens, demonstrating sustained multi-step improvement means.
  • Instrumental in its personal coaching and deployment: Early variations of GPT-5.3-Codex have been used to debug the coaching run, analyze conduct, optimize the serving stack, construct customized pipelines, and summarize large-scale alpha logs, making it the primary Codex mannequin ‘instrumental in creating itself.’
  • Excessive-capability cyber mannequin with guarded entry: GPT-5.3-Codex is the primary OpenAI mannequin rated ‘Excessive functionality’ for cyber and the primary educated on to determine software program vulnerabilities. OpenAI pairs this with Trusted Entry for Cyber, expanded Aardvark beta, free codebase scanning for initiatives comparable to Subsequent.js.

Take a look at the Technical particulars and Strive it right here. Additionally, be at liberty to observe us on Twitter and don’t neglect to hitch our 100k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you possibly can be part of us on telegram as effectively.


Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles