Youtu-VL: Unified Vision-Language Supervision

Introduction

In the rapidly evolving landscape of Multimodal Large Language Models (MLLMs), many models excel at describing images but struggle with fine-grained visual tasks like precise object detection or depth estimation. This limitation often stems from a "text-dominant optimization bias," where visual signals are treated only as passive inputs.

Tencent Youtu Lab has addressed this challenge with the announcement of Youtu-VL, a versatile 4B-parameter model that introduces a fundamental shift in how vision and language are supervised. By pioneering the Vision-Language Unified Autoregressive Supervision (VLUAS) paradigm, Youtu-VL treats visual details not just as context, but as targets to be predicted, enabling a single model to master both conversational AI and dense visual perception.

Key Innovation: Vision as Target

The core breakthrough of Youtu-VL is the shift from "vision-as-input" to "vision-as-target". Traditional VLMs often drop redundant visual details during training because they are only optimized to generate text.

Youtu-VL solves this by:

Unified Multimodal Vocabulary: Expanding the text lexicon to include a synergistic visual codebook of 150,000 tokens.
Autoregressive Vision Reconstruction: During training, the model is tasked with reconstructing discrete visual tokens. This explicitly preserves dense visual information that is usually lost in standard VLM architectures.
Synergistic Vision Tokenizer: A novel tokenizer that fuses semantic features from SigLIP-2 with geometric features from DINOv3, ensuring the model understands both "what" is in the image and "where" it is located.

Architecture and Performance

Despite its compact size of 4 billion parameters, Youtu-VL achieves State-of-the-Art (SOTA) or highly competitive results across a wide array of benchmarks. It functions as a "generalist visual agent" that can perform complex vision-centric tasks without any specialized task-specific modules or heads.

Vision-Centric Benchmarks

Visual Grounding: Achieved 91.8% mAP on RefCOCO.
Object Detection: 47.1% mAP on COCO 2017.
Semantic Segmentation: 54.2 mIoU on ADE20k.
Depth Estimation: High precision on NYUv2 (90.4% δ1).
Human Pose Estimation: 89.1% PCKh@0.5 on MPII.

Multimodal Reasoning

Beyond dense vision, Youtu-VL remains a powerful conversational partner. It scores significantly high on general multimodal benchmarks like MMBench (83.9) and DocVQA (94.4), showcasing its strength in OCR and complex reasoning.

Open Source and Accessibility

Tencent has made Youtu-VL highly accessible to the developer community by releasing it on platforms like ModelScope and Hugging Face. Notably, the team has provided GGUF quantized versions, allowing researchers and enthusiasts to run this powerful 4B model on local machines using llama.cpp.

Quickstart with llama.cpp

To run the Q8_0 quantized version locally, you can use the following command:

llama-server -hf tencent/Youtu-VL-4B-Instruct-GGUF:Q8_0 \
  --port 8080 \
  --image-max-tokens 2048 \
  --temp 0.1 \
  -n 12280

Conclusion

Youtu-VL represents a significant step towards truly unified artificial intelligence. By breaking the silos between language modeling and computer vision tasks like segmentation and detection, Tencent Youtu Lab has provided a robust foundation for the next generation of visual agents. Its efficiency and performance make it a compelling choice for both edge deployment and large-scale multimodal research.

Youtu-VL: Unified Vision-Language Supervision

Introduction

Key Innovation: Vision as Target

Architecture and Performance

Vision-Centric Benchmarks

Multimodal Reasoning

Open Source and Accessibility

Quickstart with llama.cpp

Conclusion

Sources

Frequently Asked Questions

What is Youtu-VL?

What is the VLUAS paradigm?

What vision-centric tasks can Youtu-VL perform?

Is Youtu-VL available for local execution?

Related Articles

LingBot-Depth: Precision Spatial Perception for Embodied AI

Tencent HPC-Ops: SOTA Performance for LLM Inference

HunyuanImage-3.0: Tencent's Massive 80B MoE Multimodal Model

Continue Your AI Journey