Tencent HPC-Ops: SOTA Performance for LLM Inference

Tencent releases HPC-Ops, a production-grade high-performance operator library for LLM inference, delivering up to 2.22x speedup on NVIDIA H20 GPUs.

by HowAIWorks Team
LLMInference OptimizationTencentCUDAAI InfrastructureOpen SourceNVIDIA H20Machine LearningHigh Performance Computing

Introduction

In the rapidly evolving world of Large Language Models (LLMs), the efficiency of the underlying operator library is often the "hidden" bottleneck that determines the speed and cost of inference. As models grow larger and deployment scales increase, the demand for "bare-metal" performance optimization becomes paramount.

Today, Tencent's Hunyuan AI Infra team has released HPC-Ops, a production-grade, high-performance operator library specifically tailored for LLM inference. Following the groundbreaking release of HunyuanImage 3.0-Instruct, HPC-Ops is now available to the open-source community, offering state-of-the-art (SOTA) performance and a modern approach to CUDA kernel development.

Why HPC-Ops Matters

Modern LLM inference involves complex mathematical operations—specifically Attention, Grouped GEMM, and Fused Mixture-of-Experts (MoE). Traditional generic libraries often fail to capture the specific architectural nuances of the latest hardware, leading to wasted compute cycles.

HPC-Ops addresses this by providing deeply optimized kernels tailored for the NVIDIA SM90 architecture (such as the H20 and H100 GPUs). For developers and AI engineers, this means:

  • Tangible Speedups: Observed performance gains of up to 2.22x per operator.
  • Production Ready: Not just a research prototype, but a library proven in Tencent's massive AI infrastructure.
  • Framework Compatibility: Designed for easy integration with vLLM and SGLang, the leading frameworks for LLM serving.

Key Features and Precision Support

HPC-Ops isn't just about raw speed; it's about flexibility across different deployment scenarios. One of its standout features is rich support for modern data types and quantization schemes.

  • Universal Precision: Native support for BF16 and FP8.
  • Advanced Quantization: Support for both per-tensor and block-wise scaling for FP8 weights, allowing developers to balance precision and performance according to their specific requirements.
  • Modern CUDA Approach: The library serves as a "hands-on" tutorial for using CuTe and CUTLASS, demonstrating how to build SOTA kernels with just hundreds of lines of code rather than thousands.

Breakdown of Optimized Kernels

The core of HPC-Ops lies in its specialized support for three critical LLM components:

1. Optimized Attention

HPC-Ops provides highly optimized kernels for both the Prefill and Decode phases of attention. This includes robust support for Paged Attention, which is essential for managing memory in production environments with varying sequence lengths.

2. Grouped GEMM

For models that require handling multiple small matrix multiplications simultaneously, the library offers Quantized Grouped GEMM. This is particularly effective for FP8 weights, significantly reducing the memory bandwidth bottleneck.

3. Fused MoE

Mixture-of-Experts models are becoming the industry standard for high-performance LLMs (like Mixtral or DeepSeek-V3). HPC-Ops includes Quantized Fused MoE kernels that combine weight quantization with expert-level scaling, maximizing the throughput of MoE architectures.

Getting Started with HPC-Ops

The library is designed to be developer-friendly. To get started, you'll need the following:

  • Hardware: NVIDIA SM90 architecture (e.g., H20/H100).
  • Software: Python 3.8+, C++17 compilers, and CUDA Toolkit 12.8+.

Installation from Source

Building the wheel is straightforward:

git clone https://github.com/Tencent/hpc-ops.git
cd hpc-ops
make wheel
python3 -m pip install dist/*.whl

Basic Usage Example (FP8 GroupGEMM)

Integrating a high-performance kernel into your Python code is as simple as a few lines:

import torch
import hpc

# Setup dummy data
num_tokens = 1024
num_group, n, k = 8, 4096, 4096
x = torch.randn((num_tokens, k), device="cuda").to(torch.float8_e4m3fn)
w = torch.randn((num_group, n, k), device="cuda").to(torch.float8_e4m3fn)
scale = torch.full((num_group,), 1.0, device="cuda")

# Run the optimized kernel
output = hpc.group_gemm_pertensor_fp8(x, w, num_tokens_per_group, scale)

The Roadmap Ahead

Tencent has shared an ambitious roadmap for HPC-Ops, indicating that this is just the beginning:

  • Sparse Attention Kernels: Aimed at long-context LLMs to boost throughput for memory-bound workloads.
  • Extended Quantization: Support for 4-bit/8-bit mixed precision scenarios.
  • Boundary-Breaking Kernels: Overlapping computation with inter-GPU communication to minimize overhead in large-scale distributed inference.

Conclusion

The release of HPC-Ops represents a significant contribution to the AI infrastructure ecosystem. By making high-performance, production-proven CUDA kernels accessible, Tencent enables developers to squeeze every ounce of performance out of modern NVIDIA hardware.

Whether you are building a new inference framework or looking to optimize an existing production pipeline, HPC-Ops provides the tools and the technical roadmap to ensure your LLM inference is as fast and efficient as possible.

We encourage developers to explore the repository, star it to follow its progress, and contribute optimizations that help refine this toolkit for the broader community.

Sources

Frequently Asked Questions

HPC-Ops is a high-performance operator library for Large Language Model (LLM) inference, developed by the Tencent Hunyuan AI Infra team and used in their large-scale production environments.
HPC-Ops is specifically optimized for NVIDIA SM90 architecture (like H20/H100) and requires CUDA 12.8 or higher.
HPC-Ops delivers SOTA performance with up to 2.22x speedup on NVIDIA H20 GPUs compared to existing baseline operators.
Yes, it features a clean API designed for seamless integration into popular inference frameworks like vLLM and SGLang.
HPC-Ops provides native support for multiple data types including BF16 and FP8 with different quantization schemes (per-tensor and block-wise).

Continue Your AI Journey

Explore our lessons and glossary to deepen your understanding.