Youtu-VL: Unified Vision-Language Supervision

Tencent Youtu Lab introduces Youtu-VL, a 4B parameter model that pioneers the 'vision-as-target' paradigm for advanced visual perception.

by HowAIWorks Team
Computer VisionMultimodalTencentVLMOpen SourceDeep LearningGGUFModelScope

Introduction

In the rapidly evolving landscape of Multimodal Large Language Models (MLLMs), many models excel at describing images but struggle with fine-grained visual tasks like precise object detection or depth estimation. This limitation often stems from a "text-dominant optimization bias," where visual signals are treated only as passive inputs.

Tencent Youtu Lab has addressed this challenge with the announcement of Youtu-VL, a versatile 4B-parameter model that introduces a fundamental shift in how vision and language are supervised. By pioneering the Vision-Language Unified Autoregressive Supervision (VLUAS) paradigm, Youtu-VL treats visual details not just as context, but as targets to be predicted, enabling a single model to master both conversational AI and dense visual perception.

Key Innovation: Vision as Target

The core breakthrough of Youtu-VL is the shift from "vision-as-input" to "vision-as-target". Traditional VLMs often drop redundant visual details during training because they are only optimized to generate text.

Youtu-VL solves this by:

  • Unified Multimodal Vocabulary: Expanding the text lexicon to include a synergistic visual codebook of 150,000 tokens.
  • Autoregressive Vision Reconstruction: During training, the model is tasked with reconstructing discrete visual tokens. This explicitly preserves dense visual information that is usually lost in standard VLM architectures.
  • Synergistic Vision Tokenizer: A novel tokenizer that fuses semantic features from SigLIP-2 with geometric features from DINOv3, ensuring the model understands both "what" is in the image and "where" it is located.

Architecture and Performance

Despite its compact size of 4 billion parameters, Youtu-VL achieves State-of-the-Art (SOTA) or highly competitive results across a wide array of benchmarks. It functions as a "generalist visual agent" that can perform complex vision-centric tasks without any specialized task-specific modules or heads.

Vision-Centric Benchmarks

  • Visual Grounding: Achieved 91.8% mAP on RefCOCO.
  • Object Detection: 47.1% mAP on COCO 2017.
  • Semantic Segmentation: 54.2 mIoU on ADE20k.
  • Depth Estimation: High precision on NYUv2 (90.4% δ1).
  • Human Pose Estimation: 89.1% PCKh@0.5 on MPII.

Multimodal Reasoning

Beyond dense vision, Youtu-VL remains a powerful conversational partner. It scores significantly high on general multimodal benchmarks like MMBench (83.9) and DocVQA (94.4), showcasing its strength in OCR and complex reasoning.

Open Source and Accessibility

Tencent has made Youtu-VL highly accessible to the developer community by releasing it on platforms like ModelScope and Hugging Face. Notably, the team has provided GGUF quantized versions, allowing researchers and enthusiasts to run this powerful 4B model on local machines using llama.cpp.

Quickstart with llama.cpp

To run the Q8_0 quantized version locally, you can use the following command:

llama-server -hf tencent/Youtu-VL-4B-Instruct-GGUF:Q8_0 \
  --port 8080 \
  --image-max-tokens 2048 \
  --temp 0.1 \
  -n 12280

Conclusion

Youtu-VL represents a significant step towards truly unified artificial intelligence. By breaking the silos between language modeling and computer vision tasks like segmentation and detection, Tencent Youtu Lab has provided a robust foundation for the next generation of visual agents. Its efficiency and performance make it a compelling choice for both edge deployment and large-scale multimodal research.

Sources

Frequently Asked Questions

Youtu-VL is a lightweight yet robust 4B parameter Vision-Language Model developed by Tencent Youtu Lab, designed to bridge the gap between language understanding and fine-grained visual perception.
VLUAS stands for Vision-Language Unified Autoregressive Supervision. It shifts the training objective from treatng vision as a passive input to a supervised target, explicitly reconstructing visual tokens alongside text.
Without task-specific modules, it can handle object detection, semantic segmentation, depth estimation, visual grounding, and human pose estimation.
Yes, Tencent has released GGUF quantized versions (Q8_0, F16) which can be easily run using llama.cpp on consumer hardware.

Continue Your AI Journey

Explore our lessons and glossary to deepen your understanding.