Tencent Hunyuan's UniRL: Universal RL for Multimodal Models

Introduction

Tencent Hunyuan has recently rolled out UniRL, a new infrastructure designed for the reinforcement learning (RL) post-training of multimodal models. This release represents a significant effort to consolidate a single, generalized RL loop capable of serving various model families, including Large Language Models (LLMs), Vision-Language Models (VLMs), diffusion models, flow matching, and unified multimodal systems.

A Universal RL Pipeline

The standard RL pipeline typically follows a familiar sequence:

Generate
Score
Advantage
Update
Sync

UniRL attempts to make this cycle truly universal. In this framework, the model and the algorithm are separated into two independent axes. This decoupling allows developers to mix and match different model families with various RL algorithms without relying on rigid, hard-coded scenarios.

Broad Model Coverage

UniRL boasts extensive coverage across multiple modalities, supporting:

Text-to-image
Text/image-to-video
Vision-language tasks
Text-only LLMs and VLMs
LLM-to-diffusion prompt enhancers
Mixed autoregressive and diffusion generation (such as Hunyuan-Image 3 and Bagel)

Pluggable Engines and Scalability

The framework features pluggable rollout engines managed through a unified typed contract. Current supported engines include train-side, SGLang, and vLLM-Omni. For robust scalability, UniRL utilizes FSDP2 sharding and offers multiple deployment modes, all of which can be seamlessly switched from a single configuration file.

Custom Tencent Algorithms

In addition to the core infrastructure, Tencent has integrated two proprietary algorithms into UniRL:

Flow-DPPO: A policy optimization method specifically tailored for flow and diffusion models, featuring trust-region masks based on exact divergence.
DRPO: A reinforcement learning approach designed for LLMs, utilizing a smoothed advantage-weighted quadratic regularizer.

Conclusion

UniRL emerges as a strong step forward towards a standardized post-training stack. It provides a robust foundation for models that simultaneously write, see, and generate content while leveraging different types of rollout engines.

Resources:

Code: GitHub - Tencent-Hunyuan/UniRL
Paper: arXiv:2606.09821

Tencent Hunyuan's UniRL: Universal RL for Multimodal Models

Introduction

A Universal RL Pipeline

Broad Model Coverage

Pluggable Engines and Scalability

Custom Tencent Algorithms

Conclusion

DeepSeek Makes Its 75% V4-Pro Discount Permanent

Sakana AI to Focus on Algorithmic Evolution of AI

Related Articles

URKL: World's First Humanoid Robot Combat League Opens in Shenzhen

Sony AI Ace: First Robot to Beat Pro Table Tennis Players

Xiaomi MiMo-V2.5: The Next Generation of Open Agentic Models

Continue Your AI Journey