Introduction
OpenAI has launched gpt‑oss‑120b and gpt‑oss‑20b, a brand new collection of open‑weight reasoning fashions. Launched below the Apache 2.0 license, these textual content‑solely fashions are designed for strong instruction following, software use, and powerful reasoning capabilities, making them properly‑suited to integration into superior agentic workflows. This launch displays OpenAI’s ongoing dedication to enabling innovation and inspiring collaborative security inside the AI neighborhood.
A key query is how these fashions evaluate to different main choices within the quick‑transferring open‑ and semi‑open‑weight ecosystem. On this weblog, we have a look at GPT‑OSS intimately and evaluate its capabilities with fashions like GLM‑4.5, Qwen3‑Pondering, DeepSeek‑R1, and Kimi K2.
GPT‑OSS: Structure and Core Strengths
The gpt‑oss fashions construct on the foundations of GPT‑2 and GPT‑3, incorporating a Combination‑of‑Specialists (MoE) design to enhance effectivity throughout each coaching and inference. This strategy prompts solely a subset of parameters per token, giving the fashions the size of very massive programs whereas controlling compute price.
There are two fashions within the household:
gpt‑oss‑120b: 116.8 billion complete parameters, with about 5.1 billion lively per token throughout 36 layers.
gpt‑oss‑20b: 20.9 billion complete parameters, with 3.6 billion lively per token throughout 24 layers.
Each fashions share a number of architectural selections:
Residual stream dimension of 2880.
Grouped Question Consideration with 64 question heads and eight key‑worth heads.
Rotary place embeddings for improved contextual reasoning.
Prolonged context size of 131,072 tokens utilizing YaRN.
To make deployment sensible, OpenAI utilized MXFP4 quantization to the MoE weights. This permits the 120 billion‑parameter mannequin to run on a single 80 GB GPU and the 20 billion‑parameter variant to function on {hardware} with as little as 16 GB of reminiscence.
One other notable characteristic is variable reasoning effort. Builders can specify “low,” “medium,” or “excessive” reasoning ranges through the system immediate, which dynamically adjusts the size of the Chain‑of‑Thought (CoT). This supplies flexibility in balancing accuracy, latency, and compute price.
The fashions are additionally skilled with constructed‑in assist for agentic workflows, together with:
A shopping software for actual‑time internet search and retrieval.
A Python software for stateful code execution in a Jupyter‑like surroundings.
Help for customized developer capabilities, enabling advanced workflows with interleaved reasoning, software use, and person interplay.
GPT‑OSS in Context: Evaluating Efficiency Throughout Fashions
The open‑mannequin ecosystem is stuffed with succesful contenders — GLM‑4.5, Qwen3 Pondering, DeepSeek R1, and Kimi K2 — every with completely different strengths and commerce‑offs. Evaluating them with GPT‑OSS offers a clearer view of how these fashions carry out throughout reasoning, coding, and agentic workflows.
Reasoning and Data
On broad information and reasoning duties, GPT‑OSS delivers among the highest scores relative to its measurement.
On MMLU‑Professional, GPT‑OSS‑120b reaches 90.0%, forward of GLM‑4.5 (84.6%), Qwen3 Pondering (84.4%), DeepSeek R1 (85.0%), and Kimi K2 (81.1%).
For competitors‑fashion math duties, GPT‑OSS shines. On AIME 2024, it hits 96.6% with instruments, and on AIME 2025, it pushes to 97.9%, outperforming all others.
On the GPQA PhD‑degree science benchmark, GPT‑OSS‑120b achieves 80.9% with instruments, similar to GLM‑4.5 (79.1%) and Qwen3 Pondering (81.1%), and simply behind DeepSeek R1 (81.0%).
What makes these numbers notable is the stability between mannequin measurement and efficiency. GPT‑OSS‑120b is a 116.8B‑parameter mannequin (with solely 5.1B lively parameters per token because of its Combination‑of‑Specialists design). GLM‑4.5 and Qwen3 Pondering are considerably bigger full‑parameter fashions, which partially explains their sturdy software use and coding outcomes. DeepSeek R1 additionally leans towards increased parameter counts and deeper token utilization for reasoning duties (as much as 20k tokens per question), whereas Kimi K2 is tuned as a smaller however extra specialised instruct mannequin.
This implies GPT‑OSS manages frontier‑degree reasoning scores whereas utilizing fewer lively parameters, making it extra environment friendly for builders who need deep reasoning with out the price of working very massive dense fashions.
Coding and Software program Engineering
Fashionable AI coding benchmarks concentrate on a mannequin’s capability to grasp massive codebases, make modifications, and execute multi‑step reasoning.
On SWE‑bench Verified, GPT‑OSS‑120b scores 62.4%, near GLM‑4.5 (64.2%) and DeepSeek R1 (≈65.8% in agentic mode).
On Terminal‑Bench, GLM‑4.5 leads with 37.5%, adopted by Kimi K2 at round 30%.
GLM‑4.5 additionally reveals sturdy leads to head‑to‑head agentic coding duties, with over 50% win charges in opposition to Kimi K2 and over 80% in opposition to Qwen3, whereas sustaining a excessive success fee for software‑based mostly coding workflows.
Right here once more, mannequin measurement issues. GLM‑4.5 is a a lot bigger dense mannequin than GPT‑OSS‑120b and Kimi K2, which supplies it an edge in agentic coding workflows. However for builders who need stable code‑enhancing capabilities in a mannequin that may run on a single 80GB GPU, GPT‑OSS presents an interesting stability.
Agentic Instrument Use and Operate Calling
Agentic capabilities — the place a mannequin autonomously calls instruments, executes capabilities, and solves multi‑step duties — are more and more vital.
On TAU‑bench Retail, GPT‑OSS‑120b scores 67.8%, in comparison with GLM‑4.5’s 79.7% and Kimi K2’s 70.6%.
On BFCL‑v3 (a perform‑calling benchmark), GLM‑4.5 leads with 77.8%, adopted by Qwen3 Pondering at 71.9% and GPT‑OSS round 67–68%.
These outcomes spotlight a commerce‑off: GLM‑4.5 dominates in perform‑calling and agentic workflows, but it surely does in order a considerably bigger, useful resource‑intensive mannequin. GPT‑OSS delivers aggressive outcomes whereas staying accessible to builders who can’t afford multi‑GPU clusters.
Placing It All Collectively
Right here’s a fast snapshot of how these fashions stack up:
Benchmark | GPT‑OSS‑120b (Excessive) | GLM‑4.5 | Qwen3 Pondering | DeepSeek R1 | Kimi K2 |
---|---|---|---|---|---|
MMLU‑Professional | 90.0% | 84.6% | 84.4% | 85.0% | 81.1% |
AIME 2024 | 96.6% (with instruments) | ~91% | ~91.4% | ~87.5% | ~69.6% |
AIME 2025 | 97.9% (with instruments) | ~92% | ~92.3% | ~87.5% | ~49.5% |
GPQA Diamond (Science) | ~80.9% (with instruments) | 79.1% | 81.1% | 81.0% | 75.1% |
SWE‑bench Verified | 62.4% | 64.2% | — | ~65.8% | 65.8% agentic |
TAU‑bench Retail | 67.8% | 79.7% | ~67.8% | ~63.9% | ~70.6% |
BFCL‑v3 Operate Calling | ~67–68% | 77.8% | 71.9% | 37.0% | — |
Key takeaways:
GPT‑OSS punches above its weight in reasoning and lengthy‑kind CoT duties whereas utilizing fewer lively parameters.
GLM‑4.5 is a heavyweight dense mannequin that excels at agentic workflows and performance‑calling however requires much more compute.
DeepSeek R1 and Qwen3 supply sturdy hybrid reasoning efficiency at bigger sizes, whereas Kimi K2 targets agentic coding workflows with smaller, extra specialised setups.
Conclusion
GPT‑OSS brings frontier‑degree reasoning and lengthy‑kind CoT capabilities with a smaller lively parameter footprint than many dense fashions. GLM‑4.5 leads in agentic workflows and performance‑calling however requires considerably extra compute. DeepSeek R1 and Qwen3 ship sturdy hybrid reasoning at bigger scales, whereas Kimi K2 focuses on specialised coding workflows with a compact setup.
This makes GPT‑OSS a compelling stability of reasoning efficiency, coding capability, and deployment effectivity, properly‑suited to experimentation, integration into agentic programs, or useful resource‑conscious manufacturing workloads.
If you wish to attempt the GPT‑OSS‑20B mannequin, its smaller measurement makes it sensible to run regionally by yourself hardwareusing Ollama and expose it through a public API with Clarifai’s Native Runners — supplying you with full management over your compute and maintaining your knowledge native. Try the tutorial right here.
If you wish to check out the total‑scale GPT‑OSS‑120B mannequin, you possibly can attempt it instantly on the playground right here.