TL;DR
Utilizing customized CUDA kernels and speculative decoding optimized for reasoning workloads, we achieved 414 tokens per second throughput on Kimi K2.5 working on Nvidia B200 GPUs, making us one of many first suppliers to achieve 400+ tokens per second on a trillion-parameter reasoning mannequin.
Forward of Nvidia GTC, we’re excited to share that Clarifai Reasoning Engine achieves 414 tokens per second (TPS) throughput on Kimi K2.5, positioning us among the many prime inference suppliers for frontier reasoning fashions as measured by Synthetic Evaluation. Operating on Nvidia B200 GPU infrastructure, our platform delivers production-grade efficiency for agentic workflows and complicated reasoning duties.

Determine 1: Clarifai achieves 414 tokens per second on Kimi K2.5, rating among the many quickest inference suppliers on Synthetic Evaluation benchmarks.
Why Kimi K2.5 efficiency issues
Kimi K2.5 is a 1-trillion-parameter reasoning mannequin with a 384-expert Combination-of-Specialists structure that prompts 32 billion parameters per request. Constructed by Moonshot AI with native multimodal coaching on 15 trillion combined visible and textual content tokens, the mannequin delivers sturdy efficiency throughout key benchmarks: 50.2% HLE with instruments, 76.8% SWE-Bench Verified, and 78.4% BrowseComp.
As a reasoning mannequin, Kimi K2.5 generates prolonged pondering sequences earlier than last solutions. Clarifai achieves a time to first reply token of 6 seconds, which incorporates the mannequin’s inside pondering time earlier than offering a response. Throughput immediately impacts end-to-end response time for agentic methods, code era, and multimodal reasoning duties. At 414 TPS, we ship the velocity required for manufacturing deployments.

Determine 2: Time to first Reply token (TTFT) efficiency throughout inference suppliers, measured by Synthetic Evaluation with 10,000 enter tokens.
How we optimize for throughput
Clarifai Reasoning Engine makes use of three core optimizations for giant reasoning fashions:
Customized CUDA kernels scale back reminiscence stalls and improve cache locality. By optimizing low-level GPU operations, we preserve streaming multiprocessors lively throughout inference reasonably than ready on knowledge motion.
Speculative decoding predicts potential token paths and prunes misses shortly. This reduces wasted computation throughout the mannequin’s pondering sequence, a sample widespread in reasoning workloads.
Adaptive optimization constantly learns from workload conduct. The system dynamically adjusts batching, reminiscence reuse, and execution paths based mostly on precise request patterns. These enhancements compound over time, particularly for the repetitive duties widespread in agentic workflows.
Operating on Nvidia B200 infrastructure offers us the {hardware} basis to push efficiency boundaries, whereas our inference optimization stack delivers the software-level features.
Constructing with Kimi K2.5
Kimi K2.5 is now accessible on the Clarifai Platform. Attempt it out on the Playground or by way of the API to get began.
Should you want devoted compute to deploy Kimi K2.5 and different related prime open fashions at scale for manufacturing workloads, get in contact with our workforce.