LongCat-Flash-Omni: 560B Omni-Modal Model Released

Introduction

Meituan's LongCat team has open-sourced LongCat-Flash-Omni, a state-of-the-art omni-modal artificial intelligence model with 560 billion parameters (27 billion activated), designed to excel at real-time audio-visual interaction. The model represents a significant advancement in multimodal AI capabilities, seamlessly integrating powerful offline multi-modal understanding with real-time audio–visual interaction within a single unified framework.

LongCat-Flash-Omni leverages the high-performance Shortcut-connected Mixture-of-Experts (MoE) architecture with zero-computation experts, augmented by efficient multimodal perception and speech reconstruction modules. Through an effective curriculum-inspired progressive training strategy, the model achieves comprehensive multimodal capabilities while maintaining strong unimodal performance. The model is released under the MIT License, making it accessible to researchers and developers worldwide.

This release marks an important milestone in open-source omni-modal AI, as LongCat-Flash-Omni demonstrates state-of-the-art performance across multiple benchmarks while supporting a context window of up to 128K tokens, enabling advanced capabilities in long-term memory, multi-turn dialogue, and temporal reasoning across multiple modalities.

Model Architecture and Key Features

Unified Omni-Modal Capabilities

LongCat-Flash-Omni is an open-source omni-modal model that achieves state-of-the-art cross-modal comprehension performance. The model seamlessly integrates:

Offline multi-modal understanding: Powerful capabilities for processing text, images, audio, and video
Real-time audio–visual interaction: Low-latency processing and streaming speech generation
Unified framework: All capabilities integrated within a single all-in-one architecture

The model's architecture combines an efficient large language model (LLM) backbone with carefully designed lightweight modality encoders and decoders, along with a chunk-wise audio–visual feature interleaving mechanism. This design enables the model to achieve low-latency, high-quality audio–visual processing while maintaining strong performance across all modalities.

Large-Scale with Low-Latency Processing

One of the model's standout features is its ability to handle large-scale processing with low latency:

560 billion total parameters with 27 billion activated through MoE architecture
128K token context window for advanced long-term memory and multi-turn dialogue
Low-latency audio–visual interaction through efficient architecture design
Streaming speech generation capabilities for real-time applications

The efficient LLM backbone, combined with lightweight modality encoders and decoders, enables the model to process audio and visual inputs with minimal delay while maintaining high quality. This makes it suitable for real-time applications such as live video analysis, interactive voice assistants, and streaming media processing.

Effective Early-Fusion Training

LongCat-Flash-Omni adopts an innovative multi-stage pretraining pipeline that progressively incorporates text, audio, and visual modalities:

Balanced data strategy: Ensures strong performance across all modalities
Early-fusion training paradigm: Integrates modalities from the beginning of training
Progressive incorporation: Text, audio, and visual modalities added in stages
No modality degradation: Maintains strong performance in each individual modality

This training approach ensures that the model achieves comprehensive omni-modal performance without sacrificing capabilities in any single modality. The curriculum-inspired progressive training strategy allows the model to learn complex cross-modal relationships while maintaining expertise in unimodal tasks.

Efficient Training Infrastructure

The LongCat team developed a Modality-Decoupled Parallelism training scheme inspired by the concept of modality decoupling:

Significantly enhanced efficiency: Optimized for large-scale multimodal training
Parallel processing: Different modalities can be processed in parallel
Scalable architecture: Designed to handle the complexity of omni-modal training

This training infrastructure enables efficient training of the massive 560B parameter model while managing the computational complexity of processing multiple modalities simultaneously.

Evaluation Results

Omni-Modality Performance

LongCat-Flash-Omni demonstrates strong performance across multiple omni-modal benchmarks:

Benchmark	LongCat-Flash-Omni Instruct	Gemini-2.5-Pro	Gemini-2.5-Flash	Qwen3-Omni Instruct	Qwen2.5-Omni Instruct
OmniBench	61.38	66.80	54.99	58.41	48.16
WorldSense	60.89	63.96	58.72	52.01	46.69
DailyOmni	82.38	80.61	80.78	69.33	47.45
UNO-Bench	49.90	64.48	54.30	42.10	32.60

The model achieves competitive results, particularly excelling in DailyOmni with a score of 82.38, outperforming both Gemini-2.5-Pro and Gemini-2.5-Flash. This demonstrates the model's strong capabilities in understanding and reasoning across multiple modalities in everyday scenarios.

Vision and Image-to-Text Performance

In vision tasks, LongCat-Flash-Omni shows strong performance across various benchmarks:

MMBench-ENtest: 87.5
MMBench-ZHtest: 88.7
RealWorldQA: 74.8
MMStar: 70.9
MathVistamini: 77.9
MMMUval: 70.7
MMVet: 69.0

The model demonstrates particularly strong performance in multilingual vision understanding, achieving 88.7 on MMBench-ZHtest, which tests Chinese language vision understanding capabilities.

Audio Performance

LongCat-Flash-Omni excels in audio understanding and generation tasks, with capabilities spanning:

ASR (Automatic Speech Recognition): Strong performance across multiple languages, enabling accurate transcription of speech in various linguistic contexts
TTS (Text-to-Speech): High-quality speech generation capabilities with natural-sounding output
Audio understanding: Effective processing of audio inputs for comprehension tasks, including understanding context and intent from spoken language
Real-time streaming: Low-latency audio processing that enables interactive voice applications

The model's audio capabilities are particularly notable for real-time interaction, supporting streaming speech generation and low-latency audio processing. This makes it suitable for applications requiring immediate audio-visual responses, such as interactive assistants and live video analysis systems.

Text Performance

Despite being an omni-modal model, LongCat-Flash-Omni maintains strong text-only capabilities across multiple domains:

Instruction Following: Competitive performance on IFEval (82.44% accuracy) and other instruction following benchmarks, demonstrating the model's ability to follow complex instructions accurately
Mathematical Reasoning: Strong performance on MATH500 (97.60% accuracy) and AIME24 (72.92 average), showing robust mathematical problem-solving capabilities
General Reasoning: Good performance on GPQA-diamond (74.41% accuracy) and other reasoning tasks, indicating strong logical reasoning abilities
Coding: Solid performance on HumanEval+ (90.85% pass@1) and MBPP+ (80.16% pass@1), demonstrating practical programming capabilities

The model's ability to maintain strong text performance while excelling in multimodal tasks demonstrates the effectiveness of its curriculum-inspired progressive training strategy, which ensures comprehensive capabilities across all modalities without degradation.

Technical Specifications and Requirements

Model Architecture Details

LongCat-Flash-Omni is built on the LongCat-Flash architecture with several key components:

MoE Architecture: 560B total parameters with 27B activated
Shortcut-connected MoE: High-performance architecture with zero-computation experts
Multimodal Perception Modules: Efficient encoders for processing different modalities
Speech Reconstruction Modules: Specialized components for audio generation

Hardware Requirements

Due to its massive size, LongCat-Flash-Omni requires significant computational resources:

Minimum for FP8 format: At least one node (e.g., 8×H20-141G)
Minimum for BF16 format: At least two nodes (e.g., 16×H800-80G)
Tensor Parallelism: Required for model distribution across devices
Expert Parallelism: Necessary for MoE architecture

The model uses distributed inference with Tensor Parallelism (TP) and Expert Parallelism (EP) to handle its large parameter count efficiently.

Software Requirements

The model requires specific software infrastructure:

Python: >= 3.10.0 (recommended: Anaconda)
PyTorch: >= 2.8
CUDA: >= 12.9
SGLang: Custom branch with LongCat-Flash-Omni support

The LongCat team has implemented basic adaptations in SGLang to support running the model, though official SGLang does not yet natively support LongCat-Flash-Omni.

Open-Source Availability and Usage

Model Access

LongCat-Flash-Omni is available on Hugging Face under the MIT License. The model weights are distributed across multiple devices due to the MoE architecture, and Hugging Face Transformers or vLLM will automatically download weights based on the model name.

For environments where automatic downloading isn't feasible, users can manually download the model:

pip install -U "huggingface_hub[cli]"
huggingface-cli download meituan-longcat/LongCat-Flash-Omni --local-dir ./LongCat-Flash-Omni

Installation and Setup

The LongCat team provides detailed installation instructions:

Environment Setup: Create a conda environment with Python 3.10+
SGLang Installation: Clone the custom branch with LongCat-Flash-Omni support
Demo Installation: Clone the LongCat-Flash-Omni repository and install dependencies

The installation process requires careful attention to dependencies and hardware configuration to ensure proper model loading and inference.

Inference Options

Users can run the model in different configurations:

Single-node inference: For smaller deployments or testing
Multi-node inference: For production-scale deployments requiring multiple GPU nodes

The model supports both configurations with appropriate TP and EP settings to distribute the computational load across available hardware.

Real-World Applications and Access

Web Interface

LongCat-Flash-Omni is available through a web interface at longcat.ai, though the web version currently supports audio interaction features only. The full service with complete multimodal capabilities will be provided in subsequent updates.

Mobile Applications

The LongCat team has released mobile applications for both Android and iOS:

Android: Available through QR code download
iOS: Available in the App Store (currently Chinese App Store only) by searching "LongCat"

These mobile applications provide access to LongCat-Flash-Omni's capabilities on mobile devices, enabling users to interact with the model through their smartphones and tablets.

Why It Matters

Advancing Open-Source Omni-Modal AI

LongCat-Flash-Omni represents a significant contribution to the open-source AI community:

State-of-the-art performance: Competitive with proprietary models like Gemini-2.5-Pro
Open-source availability: MIT License enables research and commercial use
Comprehensive capabilities: Unified framework for multiple modalities
Real-time interaction: Low-latency audio-visual processing

The model's open-source release enables researchers and developers worldwide to access cutting-edge omni-modal AI capabilities without relying on proprietary APIs or services.

Technical Innovation

The model introduces several technical innovations:

Modality-Decoupled Parallelism: Efficient training scheme for multimodal models
Shortcut-connected MoE: High-performance architecture with zero-computation experts
Chunk-wise audio-visual interleaving: Efficient mechanism for real-time processing
Curriculum-inspired progressive training: Effective strategy for multimodal learning

These innovations contribute to the broader AI research community and may influence future model architectures and training methodologies.

Practical Applications

LongCat-Flash-Omni's capabilities enable various practical applications:

Real-time video analysis: Processing and understanding video content in real-time
Interactive voice assistants: Natural conversation with audio-visual understanding
Multimodal content creation: Generating content across text, image, audio, and video
Long-context reasoning: Applications requiring understanding of extended multimodal contexts

The model's 128K token context window and real-time processing capabilities make it suitable for applications requiring both depth and speed.

Limitations and Considerations

Computational Requirements

The model's massive size presents significant challenges:

High hardware requirements: Requires multiple high-end GPUs or TPUs
Infrastructure complexity: Multi-node setups add operational complexity
Cost considerations: Running the model requires substantial computational resources

These requirements may limit accessibility for smaller organizations or individual researchers, though cloud-based solutions may help address this challenge.

Current Limitations

The LongCat team acknowledges several limitations:

Web interface limitations: Currently supports audio interaction only
iOS availability: Limited to Chinese App Store initially
Software dependencies: Requires custom SGLang branch, not yet in official release
Evaluation scope: Model not comprehensively evaluated for every possible downstream application

Developers should carefully assess accuracy, safety, and fairness before deploying the model in sensitive or high-risk scenarios.

Responsible Use

As with all large language models, users should consider:

Performance variations: May vary across different languages and domains
Safety and fairness: Requires careful evaluation for sensitive applications
Legal compliance: Users must comply with applicable laws and regulations
Data protection: Consider privacy and content safety requirements

The MIT License does not grant rights to use Meituan trademarks or patents, and users should understand and comply with all applicable terms.

Conclusion

LongCat-Flash-Omni represents a significant milestone in open-source omni-modal AI, combining state-of-the-art performance with real-time audio-visual interaction capabilities. The model's 560 billion parameters, unified multimodal framework, and competitive benchmark results demonstrate the potential of open-source AI to match or exceed proprietary solutions.

The model's release under the MIT License, combined with comprehensive technical documentation and accessible interfaces, makes advanced omni-modal AI capabilities available to researchers and developers worldwide. While the computational requirements are substantial, the model's performance and capabilities justify the investment for many applications.

As the AI community continues to push the boundaries of multimodal understanding, LongCat-Flash-Omni provides a valuable open-source foundation for future research and development. The model's innovations in training methodology, architecture design, and real-time processing will likely influence the next generation of multimodal AI systems.

Explore more about multimodal AI, large language models, and mixture-of-experts in our Glossary, and learn about other AI models in our Models catalog.