NVIDIA Blackwell: Massive Performance Leaps for MoE Inference

Discover how NVIDIA Blackwell and TensorRT-LLM deliver up to 2.8x throughput increases for Mixture of Experts (MoE) models like DeepSeek-R1.

by HowAIWorks Team
NVIDIABlackwellMoEInferenceTensorRT-LLMDeepSeek-R1AI Infrastructure

Introduction

As AI models grow in complexity and intelligence, the demand for efficient, high-performance inference platforms has never been higher. NVIDIA's Blackwell architecture, combined with the latest software optimizations in TensorRT-LLM, is setting new benchmarks for Mixture of Experts (MoE) models.

Recent benchmarks show that NVIDIA Blackwell is delivering massive performance leaps, particularly for sparse MoE architectures like DeepSeek-R1. This article explores the architectural and software innovations driving these gains.

The Power of GB200 NVL72 for MoE

The NVIDIA GB200 NVL72 is a rack-scale platform designed to handle the most demanding AI workloads. It connects 72 Blackwell GPUs using fifth-generation NVLink, offering a staggering 1,800 GB/s of bidirectional bandwidth between all chips.

  • Optimized for Sparse MoE: Sparse MoE models, like the 671B parameter DeepSeek-R1, activate only a fraction of their parameters per token (37B for DeepSeek-R1). This requires high-speed data exchange between experts.
  • NVFP4 Data Format: Blackwell introduces hardware acceleration for the NVFP4 format—a 4-bit floating point format that maintains accuracy while significantly boosting compute efficiency.
  • Disaggregated Serving: This technique splits prefill and decode operations across different GPUs, taking full advantage of the NVL72's architecture to minimize latency.

Software Innovations in TensorRT-LLM

Hardware alone isn't enough; NVIDIA’s software stack plays a critical role. The NVIDIA TensorRT-LLM library has introduced several enhancements that have increased Blackwell's throughput by up to 2.8x in just three months.

  • Programmatic Dependent Launch (PDL): PDL reduces kernel launch latencies, which is essential for maintaining high throughput across various interactivity levels.
  • Kernel Optimizations: Low-level tweaks ensure that Blackwell Tensor Cores are utilized as efficiently as possible.
  • All-to-All Communication: New primitives eliminate intermediate buffers on the receiver side, streamlining the data flow between experts in a distributed system.

HGX B200: Performance for Air-Cooled Deployments

For environments that don't require full rack-scale liquid cooling, the NVIDIA HGX B200 platform provides exceptional performance. It uses eight Blackwell GPUs connected via NVLink and leverages two key technologies:

  1. Multi-Token Prediction (MTP): This significantly increases throughput across all tested input/output sequence lengths.
  2. NVFP4 Acceleration: By utilizing the NVFP4 format via the full NVIDIA software stack, HGX B200 achieves a substantial throughput boost without sacrificing model accuracy.

Conclusion

NVIDIA's commitment to continuous optimization across the entire stack—from hardware architecture to low-level software libraries—is delivering unprecedented value. By leveraging the full capabilities of the Blackwell platform, enterprises can serve more users with lower latency and higher efficiency.

As models like DeepSeek-R1 continue to push the boundaries of what's possible, platforms like GB200 NVL72 and HGX B200 will be the foundation for the next generation of intelligent AI applications.

Sources

Frequently Asked Questions

NVIDIA Blackwell GPUs have seen a throughput increase of up to 2.8x for DeepSeek-R1 inference over the past three months due to software optimizations.
The primary platforms are GB200 NVL72 (rack-scale with 72 GPUs) and HGX B200 (eight GPUs connected via NVLink).
Key technologies include NVFP4 data format, Programmatic Dependent Launch (PDL), and optimized all-to-all communication primitives.
It uses fifth-generation NVLink to provide 1,800 GB/s of bidirectional bandwidth, enabling frequent data exchanges required by sparse MoE architectures.

Continue Your AI Journey

Explore our lessons and glossary to deepen your understanding.