Xiaomi-Robotics-0: Scaling VLA Models for Real-Time Robot Control

Introduction

The field of robotics is undergoing a fundamental shift from specialized, task-specific controllers to general-purpose Vision-Language-Action (VLA) models. These models aim to combine the reasoning power of large language models with the visual perception of vision models and the physical dexterity required for robotic control. In a significant leap forward, the Xiaomi Robotics team has announced Xiaomi-Robotics-0, a massive VLA model designed for real-time interaction and complex physical manipulation.

Unlike previous iterations that often struggled with the trade-off between reasoning capacity and real-time execution, Xiaomi-Robotics-0 is optimized for low-latency rollouts. By training on an unprecedented scale of both robotic and general vision-language data, the model demonstrates a level of responsiveness and adaptability that brings us closer to robots that can operate alongside humans in dynamic environments. Similar to advancements we've seen in NVIDIA PersonaPlex for audio, Xiaomi-Robotics-0 pushes the boundaries of integrated multimodal intelligence.

Large-Scale VLA Training

At the core of Xiaomi-Robotics-0 is a training regimen of massive proportions. To build a model that understands both "what" a task is and "how" to physically execute it, the team leveraged two distinct types of data:

1. Robotic Trajectory Data: 200 Million Timesteps

The model was trained on approximately 200 million timesteps of robotic trajectories. This data provides the "muscle memory" for the robot, allowing it to learn the fine-grained control signals required for smooth and precise movements. The dataset includes specialized tasks that are notoriously difficult for AI:

Lego Disassembly: Over 338 hours of data focused on the delicate task of identifying and removing individual Lego bricks.
Towel Folding: 400 hours of data capturing the complex deformable nature of textiles.

2. General Vision-Language Data: 80 Million Samples

To prevent the model from becoming a "blind" controller, the researchers integrated over 80 million samples of general vision-language data. This integration is crucial for maintaining a broad understanding of the visual world and preventing catastrophic forgetting—a common issue where a model loses its general reasoning abilities as it becomes specialized in a specific domain.

Performance and Adaptive Capabilities

What sets Xiaomi-Robotics-0 apart from standard robotic models is its ability to handle complexity and uncertainty. The model doesn't just follow a fixed script; it perceives and reacts to its environment in real-time.

Complex Manipulation: The Lego Challenge

One of the most impressive demonstrations of Xiaomi-Robotics-0 is its ability to disassemble Lego structures. The model can handle structures consisting of up to 20 bricks, identifying the correct order and motion required to detach each piece without toppling the rest of the build. This requires a high degree of spatial awareness and force control.

Adaptive Grasping and Robustness

In real-world scenarios, things rarely go perfectly. An object might slip, or a robot might fail to get a secure grip on the first try. Xiaomi-Robotics-0 is designed with adaptive behavior to mitigate these failures. If the model detects a grasping error, it can autonomously switch its motion strategy—such as adjusting the approach angle or increasing the grip force—to ensure the task is completed. This "retry" logic is a critical component for robots operating in unconstrained environments.

Overcoming Catastrophic Forgetting

A major challenge in creating advanced VLA models is the tendency for models to lose their general visual understanding when fine-tuned on specific robotic tasks. If a model only sees robot-centric views (like a camera mounted on a gripper), it may lose the ability to recognize common objects from a distance or understand broader context.

Xiaomi-Robotics-0 addresses this by using a joint training approach. By continuously training on general Vision-Language (VL) datasets alongside the robotic action data, the model maintains a "generalist" core while developing "specialist" skills. This ensures that the robot can follow natural language instructions like "pick up the red block next to the coffee mug" without needing a separate object detector.

Conclusion

Xiaomi-Robotics-0 represents a landmark achievement in the scaling of robotic intelligence. By combining massive datasets, real-time optimization, and adaptive behavioral logic, Xiaomi has created a model that is both highly capable and remarkably robust. The ability to perform delicate tasks like Lego disassembly while maintaining a broad understanding of the visual world sets a new benchmark for VLA models.

As we look toward the future, the techniques used in Xiaomi-Robotics-0—particularly the mitigation of catastrophic forgetting and the focus on real-time execution—will likely become standard practice for the next generation of humanoid robots and industrial assistants. We are moving from robots that follow instructions to robots that understand and adapt to the world around them.

Xiaomi-Robotics-0: Scaling VLA Models for Real-Time Robot Control

Introduction

Large-Scale VLA Training

1. Robotic Trajectory Data: 200 Million Timesteps

2. General Vision-Language Data: 80 Million Samples

Performance and Adaptive Capabilities

Complex Manipulation: The Lego Challenge

Adaptive Grasping and Robustness

Overcoming Catastrophic Forgetting

Conclusion

Sources

Frequently Asked Questions

What is Xiaomi-Robotics-0?

How much data was used to train Xiaomi-Robotics-0?

Can the model handle failures during operation?

What complex tasks can the robot perform?

Is the code for Xiaomi-Robotics-0 available?

Related Articles

Waymo World Model: Generative AI for Safer Autonomous Driving

NVIDIA PersonaPlex: Controlled Full-Duplex Speech AI

LingBot-Depth: Precision Spatial Perception for Embodied AI

Continue Your AI Journey