
This weblog publish focuses on new options and enhancements. For a complete record, together with bug fixes, please see the launch notes.
Benchmarking GPT-OSS Throughout H100s and B200s
OpenAI has launched gpt-oss-120b and gpt-oss-20b, a brand new era of open-weight reasoning fashions underneath the Apache 2.0 license. Constructed for sturdy instruction following, highly effective instrument use, and superior reasoning, these fashions are designed for next-generation agentic workflows.
With a Combination of Specialists (MoE) design, prolonged context size of 131K tokens, and quantization that enables the 120b mannequin to run on a single 80 GB GPU, GPT-OSS combines large scale with sensible deployment. Builders can regulate reasoning ranges from low to excessive to optimize pace, value, or accuracy, and use built-in searching, code execution, and customized instruments for complicated workflows.
Our analysis group benchmarked gpt-oss-120b throughout NVIDIA B200 and H100 GPUs utilizing vLLM, SGLang, and TensorRT-LLM. Assessments lined single-request situations and high-concurrency workloads with 50–100 requests. Key findings embody:
Single request pace: B200 with TensorRT-LLM delivers a 0.023s time-to-first-token (TTFT), outperforming dual-H100 setups in a number of instances.
Excessive concurrency: B200 sustains 7,236 tokens/sec at most load with decrease per-token latency.
Effectivity: One B200 can change two H100s for equal or higher efficiency, with decrease energy use and fewer complexity.
Efficiency good points: Some workloads see as much as 15x sooner inference in comparison with a single H100.
For detailed benchmarks on throughput, latency, time to first token, and different metrics, learn our full weblog on NVIDIA B200 vs H100.
In case you are trying to deploy GPT-OSS fashions on H100s, you are able to do it as we speak on Clarifai throughout a number of clouds. Assist for B200s is coming quickly, providing you with entry to the newest NVIDIA GPUs for testing and manufacturing.
Developer Plan
Final month we launched Native Runners, and the response from builders has been unimaginable. From AI hobbyists to manufacturing groups, many have been desirous to run open supply fashions regionally on their very own {hardware} whereas nonetheless making the most of the Clarifai platform. With Native Runners, you’ll be able to run and check fashions by yourself machines, then entry them by means of a public API for integration into any utility.
Now, with the arrival of the newest GPT-OSS fashions together with gpt-oss-20b, you’ll be able to run these superior reasoning fashions regionally with full management of your compute and the power to deploy agentic workflows immediately.
To make it even simpler, we’re introducing the Developer Plan at a promotional value of simply $1/month. It contains every thing within the Group Plan, plus:
Take a look at the Developer Plan and begin operating your personal fashions regionally as we speak. In case you are able to run GPT-OSS-20b in your {hardware}, comply with our step-by-step tutorial right here.
Printed Fashions
Now we have expanded our mannequin library with new open-weight and specialised fashions which can be prepared to make use of in your workflows.
The most recent additions embody:
GPT-OSS-120b – open-weight language mannequin designed for sturdy reasoning, superior instrument use, and environment friendly on-device deployment. This mannequin helps prolonged context lengths and variable reasoning ranges, making it splendid for complicated agentic functions.
GPT-5, GPT-5 Mini, and GPT-5 Nano – GPT-5 is the flagship mannequin for probably the most demanding reasoning and generative duties. GPT-5 Mini gives a sooner, cost-effective various for real-time functions. GPT-5 Nano delivers ultra-low-latency inference for edge and budget-sensitive deployments.
Qwen3-Coder-30B-A3B-Instruct – a high-efficiency coding mannequin with long-context help and robust agentic capabilities, well-suited for code era, refactoring, and growth automation.
You can begin exploring these fashions immediately within the Clarifai Playground or entry them through API to combine into your functions.
Ollama Assist
Ollama makes it easy to obtain and run highly effective open-source fashions immediately in your machine. With Clarifai Native Runners, now you can expose these regionally operating fashions through a safe public API.
We’ve additionally added Ollama toolkit to the Clarifai CLI, letting you obtain, run, and expose Ollama fashions with a single command.
Learn our step-by-step information on operating Ollama fashions regionally and making them accessible through API.
Playground Enhancements
Now you can evaluate a number of fashions facet by facet within the Playground as a substitute of testing them separately. Shortly spot variations in output, pace, and high quality to decide on the perfect match on your use case.
We’ve additionally added enhanced inference controls, Pythonic help, and mannequin model selectors for smoother experimentation.

Extra Updates
Python SDK:
Improved logging, pipeline dealing with, authentication, Native Runner help, and code validation.
Added dwell logging, verbose output, and integration with GitHub repositories for versatile mannequin initialization.
Platform:
Clarifai Organizations:
Prepared to start out constructing?
With Clarifai’s Compute Orchestration, you’ll be able to deploy GPT-OSS, Qwen3-Coder, and different open supply and your personal customized fashions on devoted GPUs like NVIDIA B200s and H100s, on-prem or within the cloud. Serve fashions, MCP servers, or full agentic workflows immediately out of your {hardware} with full management over efficiency, value, and safety.