NVIDIA and Mistral AI Bring 10x Faster Inference for the Mistral 3 Family on GB200 NVL72 GPU Systems

nvidia-and-mistral-ai-bring-10x-faster-inference-for-the-mistral-3-family-on-gb200-nvl72-gpu-systems

Source: MarkTechPost

NVIDIA announced today a significant expansion of its strategic collaboration with Mistral AI. This partnership coincides with the release of the new Mistral 3 frontier open model family, marking a pivotal moment where hardware acceleration and open-source model architecture have converged to redefine performance benchmarks.

This collaboration is a massive leap in inference speed: the new models now run up to 10x faster on NVIDIA GB200 NVL72 systems compared to the previous generation H200 systems. This breakthrough unlocks unprecedented efficiency for enterprise-grade AI, promising to solve the latency and cost bottlenecks that have historically plagued the large-scale deployment of reasoning models.

A Generational Leap: 10x Faster on Blackwell

As enterprise demand shifts from simple chatbots to high-reasoning, long-context agents, inference efficiency has become the critical bottleneck. The collaboration between NVIDIA and Mistral AI addresses this head-on by optimizing the Mistral 3 family specifically for the NVIDIA Blackwell architecture.

Where production AI systems must deliver both strong user experience (UX) and cost-efficient scale, the NVIDIA GB200 NVL72 provides up to 10x higher performance than the previous-generation H200. This is not merely a gain in raw speed; it translates to significantly higher energy efficiency. The system exceeds 5,000,000 tokens per second per megawatt (MW) at user interactivity rates of 40 tokens per second.

For data centers grappling with power constraints, this efficiency gain is as critical as the performance boost itself. This generational leap ensures a lower per-token cost while maintaining the high throughput required for real-time applications.

The engine driving this performance is the newly released Mistral 3 family. This suite of models delivers industry-leading accuracy, efficiency, and customization capabilities, covering the spectrum from massive data center workloads to edge device inference.

Mistral Large 3: The Flagship MoE

At the top of the hierarchy sits Mistral Large 3, a state-of-the-art sparse Multimodal and Multilingual Mixture-of-Experts (MoE) model.

  • Total Parameters: 675 Billion
  • Active Parameters: 41 Billion
  • Context Window: 256K tokens

Trained on NVIDIA Hopper GPUs, Mistral Large 3 is designed to handle complex reasoning tasks, offering parity with top-tier closed models while retaining the flexibility of open weights.

Ministral 3: Dense Power at the Edge

Complementing the large model is the Ministral 3 series, a suite of small, dense, high-performance models designed for speed and versatility.

  • Sizes: 3B, 8B, and 14B parameters.
  • Variants: Base, Instruct, and Reasoning for each size (nine models total).
  • Context Window: 256K tokens across the board.

The Ministral 3 series excel at GPQA Diamond Accuracy benchmark by utilizing 100 less tokens while delivery higher accuracy :

Significant Engineering Behind the Speed: A Comprehensive Optimization Stack

The “10x” performance claim is driven by a comprehensive stack of optimizations co-developed by Mistral and NVIDIA engineers. The teams adopted an “extreme co-design” approach, merging hardware capabilities with model architecture adjustments.

TensorRT-LLM Wide Expert Parallelism (Wide-EP)

To fully exploit the massive scale of the GB200 NVL72, NVIDIA employed Wide Expert Parallelism within TensorRT-LLM. This technology provides optimized MoE GroupGEMM kernels, expert distribution, and load balancing.

Crucially, Wide-EP exploits the NVL72’s coherent memory domain and NVLink fabric. It is highly resilient to architectural variations across large MoEs. For instance, Mistral Large 3 utilizes roughly 128 experts per layer, about half as many as comparable models like DeepSeek-R1. Despite this difference, Wide-EP enables the model to realize the high-bandwidth, low-latency, non-blocking benefits of the NVLink fabric, ensuring that the model’s massive size does not result in communication bottlenecks.

Native NVFP4 Quantization

One of the most significant technical advancements in this release is the support for NVFP4, a quantization format native to the Blackwell architecture.

For Mistral Large 3, developers can deploy a compute-optimized NVFP4 checkpoint quantized offline using the open-source llm-compressor library.

This approach reduces compute and memory costs while strictly maintaining accuracy. It leverages NVFP4’s higher-precision FP8 scaling factors and finer-grained block scaling to control quantization error. The recipe specifically targets the MoE weights while keeping other components at original precision, allowing the model to deploy seamlessly on the GB200 NVL72 with minimal accuracy loss.

Disaggregated Serving with NVIDIA Dynamo

Mistral Large 3 utilizes NVIDIA Dynamo, a low-latency distributed inference framework, to disaggregate the prefill and decode phases of inference.

In traditional setups, the prefill phase (processing the input prompt) and the decode phase (generating the output) compete for resources. By rate-matching and disaggregating these phases, Dynamo significantly boosts performance for long-context workloads, such as 8K input/1K output configurations. This ensures high throughput even when utilizing the model’s massive 256K context window.

From Cloud to Edge: Ministral 3 Performance

The optimization efforts extend beyond the massive data centers. Recognizing the growing need for local AI, the Ministral 3 series is engineered for edge deployment, offering flexibility for a variety of needs.

RTX and Jetson Acceleration

The dense Ministral models are optimized for platforms like the NVIDIA GeForce RTX AI PC and NVIDIA Jetson robotics modules.

  • RTX 5090: The Ministral-3B variants can reach blistering inference speeds of 385 tokens per second on the NVIDIA RTX 5090 GPU. This brings workstation-class AI performance to local PCs, enabling fast iteration and greater data privacy.
  • Jetson Thor: For robotics and edge AI, developers can use the vLLM container on NVIDIA Jetson Thor. The Ministral-3-3B-Instruct model achieves 52 tokens per second for single concurrency, scaling up to 273 tokens per second with a concurrency of 8.

Broad Framework Support

NVIDIA has collaborated with the open-source community to ensure these models are usable everywhere.

  • Llama.cpp & Ollama: NVIDIA collaborated with these popular frameworks to ensure faster iteration and lower latency for local development.
  • SGLang: NVIDIA collaborated with SGLang to create an implementation of Mistral Large 3 that supports both disaggregation and speculative decoding.
  • vLLM: NVIDIA worked with vLLM to expand support for kernel integrations, including speculative decoding (EAGLE), Blackwell support, and expanded parallelism.

Production-Ready with NVIDIA NIM

To streamline enterprise adoption, the new models will be available through NVIDIA NIM microservices.

Mistral Large 3 and Ministral-14B-Instruct are currently available through the NVIDIA API catalog and preview API. Soon, enterprise developers will be able to use downloadable NVIDIA NIM microservices. This provides a containerized, production-ready solution that allows enterprises to deploy the Mistral 3 family with minimal setup on any GPU-accelerated infrastructure.

This availability ensures that the specific “10x” performance advantage of the GB200 NVL72 can be realized in production environments without complex custom engineering, democratizing access to frontier-class intelligence.

Conclusion: A New Standard for Open Intelligence

The release of the NVIDIA-accelerated Mistral 3 open model family represents a major leap for AI in the open-source community. By offering frontier-level performance under an open source license, and backing it with a robust hardware optimization stack, Mistral and NVIDIA are meeting developers where they are.

From the massive scale of the GB200 NVL72 utilizing Wide-EP and NVFP4, to the edge-friendly density of Ministral on an RTX 5090, this partnership delivers a scalable, efficient path for artificial intelligence. With upcoming optimizations such as speculative decoding with multitoken prediction (MTP) and EAGLE-3 expected to push performance even further, the Mistral 3 family is poised to become a foundational element of the next generation of AI applications.

Available to test!

If you are a developer looking to benchmark these performance gains, you can download the Mistral 3 models directly from Hugging Face or test the deployment-free hosted versions on build.nvidia.com/mistralai to evaluate the latency and throughput for your specific use case.


Check out the Models on Hugging Face. You can find details on Corporate Blog and Technical/Developer Blog.

Thanks to the NVIDIA AI team for the thought leadership/ Resources for this article. NVIDIA AI team has supported this content/article.

Jean-marc is a successful AI business executive .He leads and accelerates growth for AI powered solutions and started a computer vision company in 2006. He is a recognized speaker at AI conferences and has an MBA from Stanford.