NVIDIA Blackwell: Performance Leaps for MoE

Introduction

As AI models grow in complexity and intelligence, the demand for efficient, high-performance inference platforms has never been higher. NVIDIA's Blackwell architecture, combined with the latest software optimizations in TensorRT-LLM, is setting new benchmarks for Mixture of Experts (MoE) models.

Recent benchmarks show that NVIDIA Blackwell is delivering massive performance leaps, particularly for sparse MoE architectures like DeepSeek-R1. This article explores the architectural and software innovations driving these gains.

The Power of GB200 NVL72 for MoE

The NVIDIA GB200 NVL72 is a rack-scale platform designed to handle the most demanding AI workloads. It connects 72 Blackwell GPUs using fifth-generation NVLink, offering a staggering 1,800 GB/s of bidirectional bandwidth between all chips.

Optimized for Sparse MoE: Sparse MoE models, like the 671B parameter DeepSeek-R1, activate only a fraction of their parameters per token (37B for DeepSeek-R1). This requires high-speed data exchange between experts.
NVFP4 Data Format: Blackwell introduces hardware acceleration for the NVFP4 format—a 4-bit floating point format that maintains accuracy while significantly boosting compute efficiency.
Disaggregated Serving: This technique splits prefill and decode operations across different GPUs, taking full advantage of the NVL72's architecture to minimize latency.

Software Innovations in TensorRT-LLM

Hardware alone isn't enough; NVIDIA’s software stack plays a critical role. The NVIDIA TensorRT-LLM library has introduced several enhancements that have increased Blackwell's throughput by up to 2.8x in just three months.

Programmatic Dependent Launch (PDL): PDL reduces kernel launch latencies, which is essential for maintaining high throughput across various interactivity levels.
Kernel Optimizations: Low-level tweaks ensure that Blackwell Tensor Cores are utilized as efficiently as possible.
All-to-All Communication: New primitives eliminate intermediate buffers on the receiver side, streamlining the data flow between experts in a distributed system.

HGX B200: Performance for Air-Cooled Deployments

For environments that don't require full rack-scale liquid cooling, the NVIDIA HGX B200 platform provides exceptional performance. It uses eight Blackwell GPUs connected via NVLink and leverages two key technologies:

Multi-Token Prediction (MTP): This significantly increases throughput across all tested input/output sequence lengths.
NVFP4 Acceleration: By utilizing the NVFP4 format via the full NVIDIA software stack, HGX B200 achieves a substantial throughput boost without sacrificing model accuracy.

Conclusion

NVIDIA's commitment to continuous optimization across the entire stack—from hardware architecture to low-level software libraries—is delivering unprecedented value. By leveraging the full capabilities of the Blackwell platform, enterprises can serve more users with lower latency and higher efficiency.

As models like DeepSeek-R1 continue to push the boundaries of what's possible, platforms like GB200 NVL72 and HGX B200 will be the foundation for the next generation of intelligent AI applications.

NVIDIA Blackwell: Performance Leaps for MoE

Introduction

The Power of GB200 NVL72 for MoE

Software Innovations in TensorRT-LLM

HGX B200: Performance for Air-Cooled Deployments

Conclusion

Sources

Frequently Asked Questions

What is the performance leap for DeepSeek-R1 on Blackwell?

What are the key Blackwell platforms for MoE inference?

What technologies drive these performance gains?

How does GB200 NVL72 handle large MoE models?

Related Articles

NVIDIA PersonaPlex: Controlled Full-Duplex Speech AI

Tencent HPC-Ops: SOTA Performance for LLM Inference

HunyuanImage-3.0: Tencent's Massive 80B MoE Multimodal Model

Continue Your AI Journey