Evaluating SGLANG, vLLM, and TensorRT-LLM with GPT-OSS-120B

Introduction

The ecosystem of LLM inference frameworks has been rising quickly. As fashions change into bigger and extra succesful, the frameworks that energy them are compelled to maintain tempo, optimizing for the whole lot from latency to throughput to reminiscence effectivity. For builders, researchers, and enterprises alike, the selection of framework can dramatically have an effect on each efficiency and value.

On this weblog, we convey these concerns collectively by evaluating SGLang, vLLM, and TensorRT-LLM. We consider how every performs when serving GPT-OSS-120B on 2x NVIDIA H100 GPUs. The outcomes spotlight the distinctive strengths of every framework and supply sensible steerage on which to decide on based mostly in your workload and {hardware}.

Overview of the Frameworks

SGLang: SGLang was designed across the concept of structured era. It brings distinctive abstractions like RadixAttention and specialised state administration that permit it to ship low latency for interactive purposes. This makes SGLang particularly interesting when the workload requires exact management over outputs, reminiscent of when producing structured knowledge codecs or working with agentic workflows.

vLLM: vLLM has established itself as one of many main open-source inference frameworks for serving massive language fashions at scale. Its key benefit lies in throughput, powered by steady batching and environment friendly reminiscence administration via PagedAttention. It additionally gives broad help for quantization methods like INT8, INT4, GPTQ, AWQ, and FP8, making it a flexible alternative for many who want to maximise tokens per second throughout many concurrent requests.

TensorRT-LLM: TensorRT-LLM is NVIDIA’s TensorRT-based inference runtime, purpose-built to extract most efficiency from NVIDIA GPUs. It’s deeply optimized for Hopper and Blackwell architectures, which implies it takes full benefit of {hardware} options within the H100 and B200. The result’s greater effectivity, sooner response instances, and higher scaling as workloads enhance. Whereas it requires a bit extra setup and tuning in comparison with different frameworks, TensorRT-LLM represents NVIDIA’s imaginative and prescient for production-grade inference efficiency.

Framework	Design Focus	Key Strengths
SGLANG	Structured era, RadixAttention	Low latency, environment friendly token era
vLLM	Steady batching, PagedAttention	Excessive throughput, helps quantization
TensorRT-LLM	TensorRT optimizations	GPU-level effectivity, lowest latency on H100/B200

Benchmark Setup and Outcomes

To guage the three frameworks pretty, we ran GPT-OSS-120B on 2x NVIDIA H100 GPUs below quite a lot of circumstances. The GPT-OSS-120B mannequin is a big mixture-of-experts mannequin that pushes the boundaries of open-weight efficiency. Its measurement and complexity make it a demanding benchmark, which is strictly why it’s preferrred for testing inference frameworks and {hardware}.

We measured three essential classes of efficiency:

Latency – How briskly the mannequin generates the primary token (TTFT) and the way shortly it produces subsequent tokens.
Throughput – What number of tokens per second might be generated below various ranges of concurrency.
Concurrency scaling – How nicely every framework holds up because the variety of simultaneous requests will increase.

Latency Outcomes

Let’s begin with latency. Whenever you care about responsiveness, two issues matter most: the time to first token and the per-token latency as soon as decoding begins.

Here is how the three frameworks stacked up:

Time to First Token (seconds)

Concurrency	vLLM	SGLang	TensorRT-LLM
1	0.053	0.125	0.177
10	1.91	1.155	2.496
50	7.546	3.08	4.14
100	1.87	8.991	5.467

Per-Token Latency (seconds)

Concurrency	vLLM	SGLang	TensorRT-LLM
1	0.005	0.004	0.004
10	0.011	0.01	0.009
50	0.021	0.015	0.018
100	0.019	0.021	0.049

What this exhibits:

vLLM was constantly the quickest to generate the primary token throughout all concurrency ranges, with glorious scaling traits.
SGLang had probably the most secure per-token latency, constantly round 4–21 ms throughout totally different masses.
TensorRT-LLM confirmed the slowest time to first token however maintained aggressive per-token efficiency at decrease concurrency ranges.

Throughput Outcomes

In relation to serving a lot of requests, throughput is the quantity to look at. Here is how the three frameworks carried out as concurrency elevated:

General Throughput (tokens/second)

Concurrency	vLLM	SGLang	TensorRT-LLM
1	187.15	230.96	242.79
10	863.15	988.18	867.21
50	2211.85	3108.75	2162.95
100	4741.62	3221.84	1942.64

Some of the vital findings was how vLLM achieved the best throughput at 100 concurrent requests, reaching 4,741 tokens per second. SGLang confirmed sturdy efficiency at average to excessive concurrency (50 requests), whereas TensorRT-LLM demonstrated one of the best single-request throughput however decrease scaling at excessive concurrency.

Framework Evaluation and Suggestions

SGLang

Strengths: Secure per-token latency, sturdy throughput at average concurrency, good total steadiness.
Weaknesses: Slower time-to-first-token at single requests, throughput drops at 100 concurrent requests.
Greatest For: Reasonable to high-throughput purposes, situations requiring constant token era timing.

vLLM

Strengths: Quickest time-to-first-token throughout all concurrency ranges, highest throughput at excessive concurrency, glorious scaling.
Weaknesses: Barely greater per-token latency at excessive masses.
Greatest For: Interactive purposes, high-concurrency deployments, situations prioritizing quick preliminary responses and most throughput scaling.

TensorRT-LLM

Strengths: Greatest single-request throughput, aggressive per-token latency at low concurrency, hardware-optimized efficiency.
Weaknesses: Slowest time-to-first-token, poor scaling at excessive concurrency, considerably degraded per-token latency at 100 requests.
Greatest For: Single-user or low-concurrency purposes, situations the place {hardware} optimization issues greater than scaling.

Conclusion

There is no such thing as a single framework that outperforms throughout all classes. As a substitute, every has been optimized for various objectives, and the appropriate alternative relies on workload and infrastructure.

Use vLLM for interactive purposes and high-concurrency deployments requiring quick responses and most throughput scaling.
Select SGLang when average throughput and constant efficiency are wanted.
Deploy TensorRT-LLM for single-user purposes or when maximizing {hardware} effectivity at low concurrency is the precedence.

The important thing takeaway is that selecting the best framework relies on workload sort and {hardware} availability, moderately than searching for a common winner. Working GPT-OSS-120B on NVIDIA H100 GPUs with these optimized inference frameworks unlocks highly effective choices for constructing and deploying AI purposes at scale.

It is value noting that these efficiency traits can shift dramatically relying in your GPU {hardware}. We additionally prolonged the benchmarks to B200 GPUs, the place TensorRT-LLM constantly outperformed each SGLang and vLLM throughout all metrics, because of its deeper optimization for NVIDIA’s newest {hardware} structure.

This highlights how framework choice is not nearly software program capabilities—it is equally about matching the appropriate framework to your particular {hardware} to unlock most efficiency potential.

You’ll be able to discover the full set of benchmark outcomes right here.

Bonus: Serve a Mannequin with Your Most popular Framework

Getting began with these frameworks is easy. With Clarifai’s Compute Orchestration, you’ll be able to serve GPT-OSS-120B or every other open-weight fashions or your personal customized fashions out of your most well-liked inference engine, whether or not it’s SGLang, vLLM, or TensorRT-LLM .

From establishing the runtime to deploying a production-ready API, you’ll be able to shortly go from mannequin to utility. The most effective half is that you’re not locked right into a single framework. You’ll be able to experiment with totally different runtimes, and select the one which finest aligns together with your efficiency and value necessities.

This flexibility makes it straightforward to combine cutting-edge frameworks into your workflows and ensures you might be at all times getting the very best efficiency out of your {hardware}. Take a look at the documentation to discover ways to add your personal fashions.

Sample Page Title

Introduction

Overview of the Frameworks

Benchmark Setup and Outcomes

Benchmark Setup and Outcomes

Latency Outcomes

Throughput Outcomes

Framework Evaluation and Suggestions

Conclusion

Related Articles

Technique shares dropped practically 50% in 2025, far outpacing bitcoin’s decline

📘 USER GUIDE: Easy methods to setup Metrics Professional for Discord (English & Français) – Analytics & Forecasts – 1 January 2026

Liquidity shock detector MT4 – My Buying and selling – 1 January 2026

LEAVE A REPLY Cancel reply

Latest Articles

Technique shares dropped practically 50% in 2025, far outpacing bitcoin’s decline

📘 USER GUIDE: Easy methods to setup Metrics Professional for Discord (English & Français) – Analytics & Forecasts – 1 January 2026

Liquidity shock detector MT4 – My Buying and selling – 1 January 2026

26 issues we expect will occur in 2026

Consuming much less meat and extra plant-based meals is among the impactful New Yr’s resolutions you may make

EDITOR PICKS

Technique shares dropped practically 50% in 2025, far outpacing bitcoin’s decline

📘 USER GUIDE: Easy methods to setup Metrics Professional for Discord...

Liquidity shock detector MT4 – My Buying and selling – 1...

POPULAR POSTS

What’s nano-texture glass and do I would like it?

Mock Take a look at English – SEM 1

Gemma 3 vs. MiniCPM vs. Qwen 2.5 VL

POPULAR CATEGORY