HomeSample Page

Sample Page Title


Blog thumbnail - Comparing SGLANG, vLLM, and TRTLM 
with GPT-OSS-120B.png.png

Introduction

The ecosystem of LLM inference frameworks has been rising quickly. As fashions change into bigger and extra succesful, the frameworks that energy them are compelled to maintain tempo, optimizing for the whole lot from latency to throughput to reminiscence effectivity. For builders, researchers, and enterprises alike, the selection of framework can dramatically have an effect on each efficiency and value.

On this weblog, we convey these concerns collectively by evaluating SGLang, vLLM, and TensorRT-LLM. We consider how every performs when serving GPT-OSS-120B on 2x NVIDIA H100 GPUs. The outcomes spotlight the distinctive strengths of every framework and supply sensible steerage on which to decide on based mostly in your workload and {hardware}.

Overview of the Frameworks

SGLang: SGLang was designed across the concept of structured era. It brings distinctive abstractions like RadixAttention and specialised state administration that permit it to ship low latency for interactive purposes. This makes SGLang particularly interesting when the workload requires exact management over outputs, reminiscent of when producing structured knowledge codecs or working with agentic workflows.

vLLM: vLLM has established itself as one of many main open-source inference frameworks for serving massive language fashions at scale. Its key benefit lies in throughput, powered by steady batching and environment friendly reminiscence administration via PagedAttention. It additionally gives broad help for quantization methods like INT8, INT4, GPTQ, AWQ, and FP8, making it a flexible alternative for many who want to maximise tokens per second throughout many concurrent requests.

TensorRT-LLM: TensorRT-LLM is NVIDIA’s TensorRT-based inference runtime, purpose-built to extract most efficiency from NVIDIA GPUs. It’s deeply optimized for Hopper and Blackwell architectures, which implies it takes full benefit of {hardware} options within the H100 and B200. The result’s greater effectivity, sooner response instances, and higher scaling as workloads enhance. Whereas it requires a bit extra setup and tuning in comparison with different frameworks, TensorRT-LLM represents NVIDIA’s imaginative and prescient for production-grade inference efficiency.

FrameworkDesign FocusKey Strengths
SGLANGStructured era, RadixAttentionLow latency, environment friendly token era
vLLMSteady batching, PagedAttentionExcessive throughput, helps quantization
TensorRT-LLMTensorRT optimizationsGPU-level effectivity, lowest latency on H100/B200

Benchmark Setup and Outcomes

Benchmark Setup and Outcomes

To guage the three frameworks pretty, we ran GPT-OSS-120B on 2x NVIDIA H100 GPUs below quite a lot of circumstances. The GPT-OSS-120B mannequin is a big mixture-of-experts mannequin that pushes the boundaries of open-weight efficiency. Its measurement and complexity make it a demanding benchmark, which is strictly why it’s preferrred for testing inference frameworks and {hardware}.

We measured three essential classes of efficiency:

  • Latency – How briskly the mannequin generates the primary token (TTFT) and the way shortly it produces subsequent tokens.
  • Throughput – What number of tokens per second might be generated below various ranges of concurrency.
  • Concurrency scaling – How nicely every framework holds up because the variety of simultaneous requests will increase.

Latency Outcomes

Let’s begin with latency. Whenever you care about responsiveness, two issues matter most: the time to first token and the per-token latency as soon as decoding begins.

Here is how the three frameworks stacked up:

Time to First Token (seconds)

ConcurrencyvLLMSGLangTensorRT-LLM
10.0530.1250.177
101.911.1552.496
507.5463.084.14
1001.878.9915.467

Per-Token Latency (seconds)

ConcurrencyvLLMSGLangTensorRT-LLM
10.0050.0040.004
100.0110.010.009
500.0210.0150.018
1000.0190.0210.049

What this exhibits:

  • vLLM was constantly the quickest to generate the primary token throughout all concurrency ranges, with glorious scaling traits.
  • SGLang had probably the most secure per-token latency, constantly round 4–21 ms throughout totally different masses.
  • TensorRT-LLM confirmed the slowest time to first token however maintained aggressive per-token efficiency at decrease concurrency ranges.

Throughput Outcomes

In relation to serving a lot of requests, throughput is the quantity to look at. Here is how the three frameworks carried out as concurrency elevated:

General Throughput (tokens/second)

ConcurrencyvLLMSGLangTensorRT-LLM
1187.15230.96242.79
10863.15988.18867.21
502211.853108.752162.95
1004741.623221.841942.64

Some of the vital findings was how vLLM achieved the best throughput at 100 concurrent requests, reaching 4,741 tokens per second. SGLang confirmed sturdy efficiency at average to excessive concurrency (50 requests), whereas TensorRT-LLM demonstrated one of the best single-request throughput however decrease scaling at excessive concurrency.

Framework Evaluation and Suggestions

SGLang

  • Strengths: Secure per-token latency, sturdy throughput at average concurrency, good total steadiness.

  • Weaknesses: Slower time-to-first-token at single requests, throughput drops at 100 concurrent requests.

  • Greatest For: Reasonable to high-throughput purposes, situations requiring constant token era timing.

vLLM

  • Strengths: Quickest time-to-first-token throughout all concurrency ranges, highest throughput at excessive concurrency, glorious scaling.

     

  • Weaknesses: Barely greater per-token latency at excessive masses.

     

  • Greatest For: Interactive purposes, high-concurrency deployments, situations prioritizing quick preliminary responses and most throughput scaling.

TensorRT-LLM

  • Strengths: Greatest single-request throughput, aggressive per-token latency at low concurrency, hardware-optimized efficiency.

     

  • Weaknesses: Slowest time-to-first-token, poor scaling at excessive concurrency, considerably degraded per-token latency at 100 requests.

     

  • Greatest For: Single-user or low-concurrency purposes, situations the place {hardware} optimization issues greater than scaling.

Conclusion

There is no such thing as a single framework that outperforms throughout all classes. As a substitute, every has been optimized for various objectives, and the appropriate alternative relies on workload and infrastructure.

  • Use vLLM for interactive purposes and high-concurrency deployments requiring quick responses and most throughput scaling.
  • Select SGLang when average throughput and constant efficiency are wanted.
  • Deploy TensorRT-LLM for single-user purposes or when maximizing {hardware} effectivity at low concurrency is the precedence.

The important thing takeaway is that selecting the best framework relies on workload sort and {hardware} availability, moderately than searching for a common winner. Working GPT-OSS-120B on NVIDIA H100 GPUs with these optimized inference frameworks unlocks highly effective choices for constructing and deploying AI purposes at scale.

It is value noting that these efficiency traits can shift dramatically relying in your GPU {hardware}. We additionally prolonged the benchmarks to B200 GPUs, the place TensorRT-LLM constantly outperformed each SGLang and vLLM throughout all metrics, because of its deeper optimization for NVIDIA’s newest {hardware} structure.

This highlights how framework choice is not nearly software program capabilities—it is equally about matching the appropriate framework to your particular {hardware} to unlock most efficiency potential.

 

You’ll be able to discover the full set of benchmark outcomes right here.

Bonus: Serve a Mannequin with Your Most popular Framework

Getting began with these frameworks is easy. With Clarifai’s Compute Orchestration, you’ll be able to serve GPT-OSS-120B or every other open-weight fashions or your personal customized fashions out of your most well-liked inference engine, whether or not it’s SGLang, vLLM, or TensorRT-LLM .

From establishing the runtime to deploying a production-ready API, you’ll be able to shortly go from mannequin to utility. The most effective half is that you’re not locked right into a single framework. You’ll be able to experiment with totally different runtimes, and select the one which finest aligns together with your efficiency and value necessities.

This flexibility makes it straightforward to combine cutting-edge frameworks into your workflows and ensures you might be at all times getting the very best efficiency out of your {hardware}. Take a look at the documentation to discover ways to add your personal fashions.



Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles