Xiaomi MiMo-V2: Three New State-of-the-Art AI Models

Introduction

Xiaomi has made a massive splash in the AI landscape with the release of MiMo-V2, a suite of three advanced models: MiMo-V2-Pro (LLM), MiMo-V2-Omni (Omnimodal), and MiMo-V2-TTS (Speech-to-Text). Led by Luo Fuli, a key figure in the development of DeepSeek R1, the MiMo team has delivered a series of models that rival the best in the world across benchmarks and real-world applications.

From a trillion-parameter flagship to an autonomous agent capable of online shopping, the MiMo-V2 release signals Xiaomi's intent to become a primary player in the global AI race.

MiMo-V2-Pro: The Flagship LLM

The MiMo-V2-Pro is the heavyweight of the release. It utilizes a Mixture-of-Experts (MoE) architecture with a staggering 1 trillion parameters in total, of which 42 billion are active during any given inference step.

Key Technical Specifications

Hybrid Attention: Combines different attention mechanisms for better efficiency and performance.
1-Million Token Context: Able to process vast amounts of data in a single session.
Provenance: Known during testing as "Hunter Alpha" on OpenRouter.

Benchmark Performance

MiMo-V2-Pro has achieved remarkable scores across international and domestic benchmarks:

Artificial Analysis Intelligence Index: 49 points (8th in the world, 2nd among Chinese LLMs).
PinchBench: 84.0 (3rd place, immediately following Claude Sonnet 4.6).
ClawEval: 61.5 (3rd place, outperforming GPT-5.2).
GDPval-AA (Agent Efficiency): Elo 1434 (the highest score for a Chinese model).

Pricing Structure

Xiaomi has set a competitive pricing tier for the Pro model:

Standard Context (up to 256K): $1.00 Input / $3.00 Output per million tokens.
Extended Context (256K to 1M): $2.00 Input / $6.00 Output per million tokens.

MiMo-V2-Omni: The Autonomous Multimodal Agent

The MiMo-V2-Omni model is designed for seamless multimodal interaction. It processes text, images, video, and audio through a unified base with dedicated encoders for each modality.

Multimodal Breakthroughs

10+ Hours Audio Processing: Supports continuous audio input in a single request.
MM-BrowserComp: Scored 52.0, surpassing Gemini 3 Pro.
GDPval-AA: Elo 1435, also outperforming Gemini 3 Pro.

Real-World Demonstrations

Xiaomi showcased the model's autonomous capabilities in two impressive demos:

Autonomous Shopping: The model navigated the entire purchase cycle—searching for reviews on Xiaohongshu, comparing prices on JD.com, bargaining with customer support, and completing the order.
Content Creation: From a single text prompt, the model generated a 15-second multi-scene video, synthesized audio, corrected font rendering errors, and published the result to TikTok.

Pricing: $0.40 Input / $2.00 Output per million tokens.

MiMo-V2-TTS: Advanced Speech Synthesis

The MiMo-V2-TTS model focuses on high-fidelity, emotional speech synthesis. Trained on hundreds of millions of hours of audio and refined through multi-dimensional Reinforcement Learning (RL), it pushes the boundaries of AI voice technology.

Capabilities

Emotional Expressiveness: Fine-grained emotion control at the sentence level.
Singing: Maintains pitch and rhythm with high accuracy.
Dialect Support: Native support for Sichuan, Henan, Cantonese, and Taiwanese dialects.
Prosody Transformation: Automatically translates punctuation and particles into natural prosody without extra markup.

Special Offer: Available for free for a limited time (TBA).

Conclusion

The release of MiMo-V2 marks a significant milestone for Xiaomi. By combining massive scale (MiMo-V2-Pro), sophisticated multimodal autonomy (MiMo-V2-Omni), and nuanced emotional expression (MiMo-V2-TTS), Xiaomi has provided a comprehensive toolkit for the next generation of AI applications. With Luo Fuli at the helm, the MiMo project is clearly positioned to be a top contender in the rapidly evolving AI ecosystem.

Sources

Want to stay updated on the latest AI models? Explore our AI models catalog, join our AI fundamentals courses, or visit our AI glossary for detailed definitions.

Xiaomi MiMo-V2: Three New State-of-the-Art AI Models

Introduction

MiMo-V2-Pro: The Flagship LLM

Key Technical Specifications

Benchmark Performance

Pricing Structure

MiMo-V2-Omni: The Autonomous Multimodal Agent

Multimodal Breakthroughs

Real-World Demonstrations

MiMo-V2-TTS: Advanced Speech Synthesis

Capabilities

Conclusion

Sources

Frequently Asked Questions

What is the architecture of MiMo-V2-Pro?

How does MiMo-V2-Omni handle multimodal inputs?

What makes MiMo-V2-TTS unique?

Who led the development of the MiMo-V2 models?

Are the MiMo-V2 models available for use?

Related Articles

DeepSeek-V4: Pro and Flash Models with 1M Context

Odyssey-2 Max: A New SOTA in Real-Time Physics World Models

Xiaomi MiMo-V2.5: The Next Generation of Open Agentic Models

Continue Your AI Journey