Introduction
As AI models grow in complexity and intelligence, the demand for efficient, high-performance inference platforms has never been higher. NVIDIA's Blackwell architecture, combined with the latest software optimizations in TensorRT-LLM, is setting new benchmarks for Mixture of Experts (MoE) models.
Recent benchmarks show that NVIDIA Blackwell is delivering massive performance leaps, particularly for sparse MoE architectures like DeepSeek-R1. This article explores the architectural and software innovations driving these gains.
The Power of GB200 NVL72 for MoE
The NVIDIA GB200 NVL72 is a rack-scale platform designed to handle the most demanding AI workloads. It connects 72 Blackwell GPUs using fifth-generation NVLink, offering a staggering 1,800 GB/s of bidirectional bandwidth between all chips.
- Optimized for Sparse MoE: Sparse MoE models, like the 671B parameter DeepSeek-R1, activate only a fraction of their parameters per token (37B for DeepSeek-R1). This requires high-speed data exchange between experts.
- NVFP4 Data Format: Blackwell introduces hardware acceleration for the NVFP4 format—a 4-bit floating point format that maintains accuracy while significantly boosting compute efficiency.
- Disaggregated Serving: This technique splits prefill and decode operations across different GPUs, taking full advantage of the NVL72's architecture to minimize latency.
Software Innovations in TensorRT-LLM
Hardware alone isn't enough; NVIDIA’s software stack plays a critical role. The NVIDIA TensorRT-LLM library has introduced several enhancements that have increased Blackwell's throughput by up to 2.8x in just three months.
- Programmatic Dependent Launch (PDL): PDL reduces kernel launch latencies, which is essential for maintaining high throughput across various interactivity levels.
- Kernel Optimizations: Low-level tweaks ensure that Blackwell Tensor Cores are utilized as efficiently as possible.
- All-to-All Communication: New primitives eliminate intermediate buffers on the receiver side, streamlining the data flow between experts in a distributed system.
HGX B200: Performance for Air-Cooled Deployments
For environments that don't require full rack-scale liquid cooling, the NVIDIA HGX B200 platform provides exceptional performance. It uses eight Blackwell GPUs connected via NVLink and leverages two key technologies:
- Multi-Token Prediction (MTP): This significantly increases throughput across all tested input/output sequence lengths.
- NVFP4 Acceleration: By utilizing the NVFP4 format via the full NVIDIA software stack, HGX B200 achieves a substantial throughput boost without sacrificing model accuracy.
Conclusion
NVIDIA's commitment to continuous optimization across the entire stack—from hardware architecture to low-level software libraries—is delivering unprecedented value. By leveraging the full capabilities of the Blackwell platform, enterprises can serve more users with lower latency and higher efficiency.
As models like DeepSeek-R1 continue to push the boundaries of what's possible, platforms like GB200 NVL72 and HGX B200 will be the foundation for the next generation of intelligent AI applications.