Tencent Hunyuan Introduces UniRL: Universal RL Post-Training for Multimodal Models

Tencent Hunyuan releases UniRL, a unified infrastructure for reinforcement learning post-training across diverse model families including LLMs, VLMs, and diffusion models.

by Evgeny Ivanov
TencentHunyuanUniRLReinforcement LearningMultimodal

Introduction

Tencent Hunyuan has recently rolled out UniRL, a new infrastructure designed for the reinforcement learning (RL) post-training of multimodal models. This release represents a significant effort to consolidate a single, generalized RL loop capable of serving various model families, including Large Language Models (LLMs), Vision-Language Models (VLMs), diffusion models, flow matching, and unified multimodal systems.

A Universal RL Pipeline

The standard RL pipeline typically follows a familiar sequence:

  • Generate
  • Score
  • Advantage
  • Update
  • Sync

UniRL attempts to make this cycle truly universal. In this framework, the model and the algorithm are separated into two independent axes. This decoupling allows developers to mix and match different model families with various RL algorithms without relying on rigid, hard-coded scenarios.

Broad Model Coverage

UniRL boasts extensive coverage across multiple modalities, supporting:

  • Text-to-image
  • Text/image-to-video
  • Vision-language tasks
  • Text-only LLMs and VLMs
  • LLM-to-diffusion prompt enhancers
  • Mixed autoregressive and diffusion generation (such as Hunyuan-Image 3 and Bagel)

Pluggable Engines and Scalability

The framework features pluggable rollout engines managed through a unified typed contract. Current supported engines include train-side, SGLang, and vLLM-Omni. For robust scalability, UniRL utilizes FSDP2 sharding and offers multiple deployment modes, all of which can be seamlessly switched from a single configuration file.

Custom Tencent Algorithms

In addition to the core infrastructure, Tencent has integrated two proprietary algorithms into UniRL:

  • Flow-DPPO: A policy optimization method specifically tailored for flow and diffusion models, featuring trust-region masks based on exact divergence.
  • DRPO: A reinforcement learning approach designed for LLMs, utilizing a smoothed advantage-weighted quadratic regularizer.

Conclusion

UniRL emerges as a strong step forward towards a standardized post-training stack. It provides a robust foundation for models that simultaneously write, see, and generate content while leveraging different types of rollout engines.

Resources:

Continue Your AI Journey

Explore our lessons and glossary to deepen your understanding.