Sample Page Title

September 16, 2025

19

OpenAI hasn’t launched an open-weight language mannequin since GPT-2 again in 2019. Six years later, they shocked everybody with two: gpt-oss-120b and the smaller gpt-oss-20b.

Naturally, we wished to know — how do they really carry out?

To search out out, we ran each fashions by means of our open-source workflow optimization framework, syftr. It evaluates fashions throughout totally different configurations — quick vs. low-cost, excessive vs. low accuracy — and consists of help for OpenAI’s new “pondering effort” setting.

In idea, extra pondering ought to imply higher solutions. In apply? Not all the time.

We additionally use syftr to discover questions like “is LLM-as-a-Decide truly working?” and “what workflows carry out nicely throughout many datasets?”.

Our first outcomes with GPT-OSS would possibly shock you: the most effective performer wasn’t the largest mannequin or the deepest thinker.

As a substitute, the 20b mannequin with low pondering effort persistently landed on the Pareto frontier, even rivaling the 120b medium configuration on benchmarks like FinanceBench, HotpotQA, and MultihopRAG. In the meantime, excessive pondering effort not often mattered in any respect.

How we arrange our experiments

We didn’t simply pit GPT-OSS in opposition to itself. As a substitute, we wished to see the way it stacked up in opposition to different sturdy open-weight fashions. So we in contrast gpt-oss-20b and gpt-oss-120b with:

qwen3-235b-a22b
glm-4.5-air
nemotron-super-49b
qwen3-30b-a3b
gemma3-27b-it
phi-4-multimodal-instruct

To check OpenAI’s new “pondering effort” function, we ran every GPT-OSS mannequin in three modes: low, medium, and excessive pondering effort. That gave us six configurations in complete:

gpt-oss-120b-low / -medium / -high
gpt-oss-20b-low / -medium / -high

For analysis, we solid a large internet: 5 RAG and agent modes, 16 embedding fashions, and a variety of circulation configuration choices. To evaluate mannequin responses, we used GPT-4o-mini and in contrast solutions in opposition to identified floor reality.

Lastly, we examined throughout 4 datasets:

FinanceBench (monetary reasoning)
HotpotQA (multi-hop QA)
MultihopRAG (retrieval-augmented reasoning)
PhantomWiki (artificial Q&A pairs)

We optimized workflows twice: as soon as for accuracy + latency, and as soon as for accuracy + price—capturing the tradeoffs that matter most in real-world deployments.

Optimizing for latency, price, and accuracy

After we optimized the GPT-OSS fashions, we checked out two tradeoffs: accuracy vs. latency and accuracy vs. price. The outcomes have been extra stunning than we anticipated:

GPT-OSS 20b (low pondering effort):
Quick, cheap, and persistently correct. This setup appeared on the Pareto frontier repeatedly, making it the most effective default alternative for many non-scientific duties. In apply, which means faster responses and decrease payments in comparison with increased pondering efforts.
GPT-OSS 120b (medium pondering effort):
Greatest fitted to duties that demand deeper reasoning, like monetary benchmarks. Use this when accuracy on advanced issues issues greater than price.
GPT-OSS 120b (excessive pondering effort):
Costly and often pointless. Hold it in your again pocket for edge circumstances the place different fashions fall brief. For our benchmarks, it didn’t add worth.

Figure 01 latency — Determine 1: Accuracy-latency optimization with syftr

Figure 02 cost — Determine 2: Accuracy-cost optimization with syftr

Studying the outcomes extra rigorously

At first look, the outcomes look simple. However there’s an necessary nuance: an LLM’s prime accuracy rating relies upon not simply on the mannequin itself, however on how the optimizer weighs it in opposition to different fashions within the combine. As an instance, let’s have a look at FinanceBench.

When optimizing for latency, all GPT-OSS fashions (besides excessive pondering effort) landed with related Pareto-frontiers. On this case, the optimizer had little cause to focus on the 20b low pondering configuration—its prime accuracy was solely 51%.

Figure 03 latency financebench — Determine 3: Per-LLM Pareto-frontiers for latency optimization on FinanceBench

When optimizing for price, the image shifts dramatically. The identical 20b low pondering configuration jumps to 57% accuracy, whereas the 120b medium configuration truly drops 22%. Why? As a result of the 20b mannequin is way cheaper, so the optimizer shifts extra weight towards it.

Figure 04 cost financebench — Determine 4: Per-LLM Pareto-frontiers for price optimization on FinanceBench

The takeaway: Efficiency relies on context. Optimizers will favor totally different fashions relying on whether or not you’re prioritizing velocity, price, or accuracy. And given the large search area of potential configurations, there could also be even higher setups past those we examined.

Discovering agentic workflows that work nicely in your setup

The brand new GPT-OSS fashions carried out strongly in our assessments — particularly the 20b with low pondering effort, which frequently outpaced dearer rivals. The larger lesson? Extra mannequin and extra effort doesn’t all the time imply extra accuracy. Generally, paying extra simply will get you much less.

That is precisely why we constructed syftr and made it open-source. Each use case is totally different, and the most effective workflow for you relies on the tradeoffs you care about most. Need decrease prices? Quicker responses? Most accuracy?

Run your personal experiments and discover the Pareto candy spot that balances these priorities in your setup.

Sample Page Title

How we arrange our experiments

Optimizing for latency, price, and accuracy

Studying the outcomes extra rigorously

Discovering agentic workflows that work nicely in your setup

Related Articles

RAVE Token Rockets Previous $9, Weekly Positive factors Prime 3,400% – Markets and Costs Bitcoin Information

Overview of trades of the Owl Sensible Ranges system for the week from April 6 to 10, 2026 – My Buying and selling –...

Illiberalism Is Not Inevitable – The Atlantic

LEAVE A REPLY Cancel reply

Latest Articles

RAVE Token Rockets Previous $9, Weekly Positive factors Prime 3,400% – Markets and Costs Bitcoin Information

Overview of trades of the Owl Sensible Ranges system for the week from April 6 to 10, 2026 – My Buying and selling –...

Illiberalism Is Not Inevitable – The Atlantic

From Inbox to Quote — Automating the P&C Submission Journey

10 Expensive Errors Seniors Make When Downsizing Their Residence

EDITOR PICKS

RAVE Token Rockets Previous $9, Weekly Positive factors Prime 3,400% –...

Overview of trades of the Owl Sensible Ranges system for the...

Illiberalism Is Not Inevitable – The Atlantic

POPULAR POSTS

Qubic’s Mining Pool Attacking Monero Falls Beneath Assault

Feedback on the brand new buying and selling dialog in Metatrader...

What’s nano-texture glass and do I would like it?

POPULAR CATEGORY