Xiaomi-Robotics-0: Scaling VLA Models for Real-Time Robot Control

Xiaomi-Robotics-0 is a breakthrough Vision-Language-Action (VLA) model trained on 200M timesteps, enabling complex tasks like Lego disassembly and adaptive robot behavior.

by HowAIWorks Team
XiaomiRoboticsVLA ModelVision-Language-ActionAI ResearchRobot ControlLego DisassemblyMachine LearningReal-time AI

Introduction

The field of robotics is undergoing a fundamental shift from specialized, task-specific controllers to general-purpose Vision-Language-Action (VLA) models. These models aim to combine the reasoning power of large language models with the visual perception of vision models and the physical dexterity required for robotic control. In a significant leap forward, the Xiaomi Robotics team has announced Xiaomi-Robotics-0, a massive VLA model designed for real-time interaction and complex physical manipulation.

Unlike previous iterations that often struggled with the trade-off between reasoning capacity and real-time execution, Xiaomi-Robotics-0 is optimized for low-latency rollouts. By training on an unprecedented scale of both robotic and general vision-language data, the model demonstrates a level of responsiveness and adaptability that brings us closer to robots that can operate alongside humans in dynamic environments. Similar to advancements we've seen in NVIDIA PersonaPlex for audio, Xiaomi-Robotics-0 pushes the boundaries of integrated multimodal intelligence.

Large-Scale VLA Training

At the core of Xiaomi-Robotics-0 is a training regimen of massive proportions. To build a model that understands both "what" a task is and "how" to physically execute it, the team leveraged two distinct types of data:

1. Robotic Trajectory Data: 200 Million Timesteps

The model was trained on approximately 200 million timesteps of robotic trajectories. This data provides the "muscle memory" for the robot, allowing it to learn the fine-grained control signals required for smooth and precise movements. The dataset includes specialized tasks that are notoriously difficult for AI:

  • Lego Disassembly: Over 338 hours of data focused on the delicate task of identifying and removing individual Lego bricks.
  • Towel Folding: 400 hours of data capturing the complex deformable nature of textiles.

2. General Vision-Language Data: 80 Million Samples

To prevent the model from becoming a "blind" controller, the researchers integrated over 80 million samples of general vision-language data. This integration is crucial for maintaining a broad understanding of the visual world and preventing catastrophic forgetting—a common issue where a model loses its general reasoning abilities as it becomes specialized in a specific domain.

Performance and Adaptive Capabilities

What sets Xiaomi-Robotics-0 apart from standard robotic models is its ability to handle complexity and uncertainty. The model doesn't just follow a fixed script; it perceives and reacts to its environment in real-time.

Complex Manipulation: The Lego Challenge

One of the most impressive demonstrations of Xiaomi-Robotics-0 is its ability to disassemble Lego structures. The model can handle structures consisting of up to 20 bricks, identifying the correct order and motion required to detach each piece without toppling the rest of the build. This requires a high degree of spatial awareness and force control.

Adaptive Grasping and Robustness

In real-world scenarios, things rarely go perfectly. An object might slip, or a robot might fail to get a secure grip on the first try. Xiaomi-Robotics-0 is designed with adaptive behavior to mitigate these failures. If the model detects a grasping error, it can autonomously switch its motion strategy—such as adjusting the approach angle or increasing the grip force—to ensure the task is completed. This "retry" logic is a critical component for robots operating in unconstrained environments.

Overcoming Catastrophic Forgetting

A major challenge in creating advanced VLA models is the tendency for models to lose their general visual understanding when fine-tuned on specific robotic tasks. If a model only sees robot-centric views (like a camera mounted on a gripper), it may lose the ability to recognize common objects from a distance or understand broader context.

Xiaomi-Robotics-0 addresses this by using a joint training approach. By continuously training on general Vision-Language (VL) datasets alongside the robotic action data, the model maintains a "generalist" core while developing "specialist" skills. This ensures that the robot can follow natural language instructions like "pick up the red block next to the coffee mug" without needing a separate object detector.

Conclusion

Xiaomi-Robotics-0 represents a landmark achievement in the scaling of robotic intelligence. By combining massive datasets, real-time optimization, and adaptive behavioral logic, Xiaomi has created a model that is both highly capable and remarkably robust. The ability to perform delicate tasks like Lego disassembly while maintaining a broad understanding of the visual world sets a new benchmark for VLA models.

As we look toward the future, the techniques used in Xiaomi-Robotics-0—particularly the mitigation of catastrophic forgetting and the focus on real-time execution—will likely become standard practice for the next generation of humanoid robots and industrial assistants. We are moving from robots that follow instructions to robots that understand and adapt to the world around them.

Sources

Frequently Asked Questions

Xiaomi-Robotics-0 is an advanced Vision-Language-Action (VLA) model designed for high-performance, real-time robotic control and complex manipulation tasks.
The model was trained on a massive dataset of 200 million timesteps of robot trajectories and over 80 million samples of general vision-language data.
Yes, Xiaomi-Robotics-0 features adaptive behavior that allows it to switch grasping motions or retry actions when it detects a failure, such as a slipped object.
The model has demonstrated the ability to disassemble Lego structures consisting of up to 20 bricks and fold towels, tasks requiring high precision and visual understanding.
Yes, the source code and pre-trained models are available on GitHub and Hugging Face for the research community.

Continue Your AI Journey

Explore our lessons and glossary to deepen your understanding.