Mistral AI has launched Mistral Small 4, a brand new mannequin within the Mistral Small household designed to consolidate a number of beforehand separate capabilities right into a single deployment goal. Mistral crew describes Small 4 as its first mannequin to mix the roles related to Mistral Small for instruction following, Magistral for reasoning, Pixtral for multimodal understanding, and Devstral for agentic coding. The result’s a single mannequin that may function as a common assistant, a reasoning mannequin, and a multimodal system with out requiring mannequin switching throughout workflows.
Structure: 128 Consultants, Sparse Activation
Architecturally, Mistral Small 4 is a Combination-of-Consultants (MoE) mannequin with 128 consultants and 4 energetic consultants per token. The mannequin has 119B whole parameters, with 6B energetic parameters per token, or 8B together with embedding and output layers.
Lengthy Context and Multimodal Assist
The mannequin helps a 256k context window, which is a significant soar for sensible engineering use instances. Lengthy-context capability issues much less as a advertising and marketing quantity and extra as an operational simplifier: it reduces the necessity for aggressive chunking, retrieval orchestration, and context pruning in duties reminiscent of long-document evaluation, codebase exploration, multi-file reasoning, and agentic workflows. Mistral positions the mannequin for common chat, coding, agentic duties, and complicated reasoning, with textual content and picture inputs and textual content output. That locations Small 4 within the more and more essential class of general-purpose fashions which can be anticipated to deal with each language-heavy and visually grounded enterprise duties below one API floor.
Configurable Reasoning at Inference Time
A extra essential product choice than the uncooked parameter rely is the introduction of configurable reasoning effort. Small 4 exposes a per-request reasoning_effort parameter that enables builders to commerce latency for deeper test-time reasoning. Within the official documentation, reasoning_effort="none" is described as producing quick responses with a chat type equal to Mistral Small 3.2, whereas reasoning_effort="excessive" is meant for extra deliberate, step-by-step reasoning with verbosity akin to earlier Magistral fashions. This adjustments the deployment sample. As a substitute of routing between one quick mannequin and one reasoning mannequin, dev groups can maintain a single mannequin in service and range inference habits at request time. That’s cleaner from a programs perspective and simpler to handle in merchandise the place solely a subset of queries really need costly reasoning.
Efficiency Claims and Throughput Positioning
Mistral crew additionally emphasizes inference effectivity. Small 4 delivers a 40% discount in end-to-end completion time in a latency-optimized setup and 3x extra requests per second in a throughput-optimized setup, each measured towards Mistral Small 3. Mistral will not be presenting Small 4 as only a bigger reasoning mannequin, however as a system geared toward enhancing the economics of deployment below actual serving masses.
Benchmark Outcomes and Output Effectivity
On reasoning benchmarks, Mistral’s launch focuses on each high quality and output effectivity. The Mistral’s analysis crew studies that Mistral Small 4 with reasoning matches or exceeds GPT-OSS 120B throughout AA LCR, LiveCodeBench, and AIME 2025, whereas producing shorter outputs. Within the numbers revealed by Mistral, Small 4 scores 0.72 on AA LCR with 1.6K characters, whereas Qwen fashions require 5.8K to six.1K characters for comparable efficiency. On LiveCodeBench, Mistral crew states that Small 4 outperforms GPT-OSS 120B whereas producing 20% much less output. These are company-published outcomes, however they spotlight a extra sensible metric than benchmark rating alone: efficiency per generated token. For manufacturing workloads, shorter outputs can immediately cut back latency, inference value, and downstream parsing overhead.

Deployment Particulars
For self-hosting, Mistral offers particular infrastructure steering. The corporate lists a minimal deployment goal of 4x NVIDIA HGX H100, 2x NVIDIA HGX H200, or 1x NVIDIA DGX B200, with bigger configurations beneficial for finest efficiency. The mannequin card on HuggingFace lists help throughout vLLM, llama.cpp, SGLang, and Transformers, although some paths are marked work in progress, and vLLM is the beneficial choice. Mistral crew additionally offers a customized Docker picture and notes that fixes associated to software calling and reasoning parsing are nonetheless being upstreamed. That’s helpful element for engineering groups as a result of it clarifies that help exists, however some items are nonetheless stabilizing within the broader open-source serving stack.
Key Takeaways
- One unified mannequin: Mistral Small 4 combines instruct, reasoning, multimodal, and agentic coding capabilities in a single mannequin.
- Sparse MoE design: It makes use of 128 consultants with 4 energetic consultants per token, concentrating on higher effectivity than dense fashions of comparable whole measurement.
- Lengthy-context help: The mannequin helps a 256k context window and accepts textual content and picture inputs with textual content output.
- Reasoning is configurable: Builders can alter
reasoning_effortat inference time as an alternative of routing between separate quick and reasoning fashions. - Open deployment focus: It’s launched below Apache 2.0 and helps serving by means of stacks reminiscent of vLLM, with a number of checkpoint variants on Hugging Face.
Try Mannequin Card on HF and Technical particulars. Additionally, be happy to observe us on Twitter and don’t neglect to hitch our 120k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you may be a part of us on telegram as properly.