Xiaomi MiMo-V2: Three New State-of-the-Art AI Models

Xiaomi releases MiMo-V2, a groundbreaking AI trio featuring a 1T-parameter Pro model, an omnimodal agent, and a next-gen TTS system.

by HowAIWorks Team
aixiaomimimo-v2llmmultimodalttsdeepseekmoebenchmarksagents

Introduction

Xiaomi has made a massive splash in the AI landscape with the release of MiMo-V2, a suite of three advanced models: MiMo-V2-Pro (LLM), MiMo-V2-Omni (Omnimodal), and MiMo-V2-TTS (Speech-to-Text). Led by Luo Fuli, a key figure in the development of DeepSeek R1, the MiMo team has delivered a series of models that rival the best in the world across benchmarks and real-world applications.

From a trillion-parameter flagship to an autonomous agent capable of online shopping, the MiMo-V2 release signals Xiaomi's intent to become a primary player in the global AI race.

MiMo-V2-Pro: The Flagship LLM

The MiMo-V2-Pro is the heavyweight of the release. It utilizes a Mixture-of-Experts (MoE) architecture with a staggering 1 trillion parameters in total, of which 42 billion are active during any given inference step.

Key Technical Specifications

  • Hybrid Attention: Combines different attention mechanisms for better efficiency and performance.
  • 1-Million Token Context: Able to process vast amounts of data in a single session.
  • Provenance: Known during testing as "Hunter Alpha" on OpenRouter.

Benchmark Performance

MiMo-V2-Pro has achieved remarkable scores across international and domestic benchmarks:

  • Artificial Analysis Intelligence Index: 49 points (8th in the world, 2nd among Chinese LLMs).
  • PinchBench: 84.0 (3rd place, immediately following Claude Sonnet 4.6).
  • ClawEval: 61.5 (3rd place, outperforming GPT-5.2).
  • GDPval-AA (Agent Efficiency): Elo 1434 (the highest score for a Chinese model).

Pricing Structure

Xiaomi has set a competitive pricing tier for the Pro model:

  • Standard Context (up to 256K): $1.00 Input / $3.00 Output per million tokens.
  • Extended Context (256K to 1M): $2.00 Input / $6.00 Output per million tokens.

MiMo-V2-Omni: The Autonomous Multimodal Agent

The MiMo-V2-Omni model is designed for seamless multimodal interaction. It processes text, images, video, and audio through a unified base with dedicated encoders for each modality.

Multimodal Breakthroughs

  • 10+ Hours Audio Processing: Supports continuous audio input in a single request.
  • MM-BrowserComp: Scored 52.0, surpassing Gemini 3 Pro.
  • GDPval-AA: Elo 1435, also outperforming Gemini 3 Pro.

Real-World Demonstrations

Xiaomi showcased the model's autonomous capabilities in two impressive demos:

  1. Autonomous Shopping: The model navigated the entire purchase cycle—searching for reviews on Xiaohongshu, comparing prices on JD.com, bargaining with customer support, and completing the order.
  2. Content Creation: From a single text prompt, the model generated a 15-second multi-scene video, synthesized audio, corrected font rendering errors, and published the result to TikTok.

Pricing: $0.40 Input / $2.00 Output per million tokens.

MiMo-V2-TTS: Advanced Speech Synthesis

The MiMo-V2-TTS model focuses on high-fidelity, emotional speech synthesis. Trained on hundreds of millions of hours of audio and refined through multi-dimensional Reinforcement Learning (RL), it pushes the boundaries of AI voice technology.

Capabilities

  • Emotional Expressiveness: Fine-grained emotion control at the sentence level.
  • Singing: Maintains pitch and rhythm with high accuracy.
  • Dialect Support: Native support for Sichuan, Henan, Cantonese, and Taiwanese dialects.
  • Prosody Transformation: Automatically translates punctuation and particles into natural prosody without extra markup.

Special Offer: Available for free for a limited time (TBA).

Conclusion

The release of MiMo-V2 marks a significant milestone for Xiaomi. By combining massive scale (MiMo-V2-Pro), sophisticated multimodal autonomy (MiMo-V2-Omni), and nuanced emotional expression (MiMo-V2-TTS), Xiaomi has provided a comprehensive toolkit for the next generation of AI applications. With Luo Fuli at the helm, the MiMo project is clearly positioned to be a top contender in the rapidly evolving AI ecosystem.

Sources


Want to stay updated on the latest AI models? Explore our AI models catalog, join our AI fundamentals courses, or visit our AI glossary for detailed definitions.

Frequently Asked Questions

MiMo-V2-Pro uses a Mixture-of-Experts (MoE) architecture with a total of 1 trillion parameters, 42 billion of which are active during inference. It also features hybrid attention and a 1-million-token context window.
MiMo-V2-Omni features a unified base with individual encoders for text, images, video, and audio. It supports continuous audio processing for over 10 hours in a single request.
MiMo-V2-TTS is trained on hundreds of millions of hours of audio and fine-tuned via multi-dimensional RL. It can synthesize emotional speech, sing while maintaining rhythm/pitch, and speak various Chinese dialects.
The MiMo team is led by Luo Fuli, one of the key authors behind the renowned DeepSeek R1 model.
Yes, all models are accessible via API at platform.xiaomimimo.com and through MiMo Studio.

Continue Your AI Journey

Explore our lessons and glossary to deepen your understanding.