Meta Omnilingual ASR: 1,600+ Languages Support

Meta introduces Omnilingual ASR supporting 1,600+ languages including 500 low-resource languages, with in-context learning for new languages.

by HowAIWorks Team
MetaASRSpeech RecognitionMultilingual AIwav2vecLow-Resource LanguagesAI ResearchFAIROpen Source

Introduction

Meta's Fundamental AI Research (FAIR) team has introduced Omnilingual Automatic Speech Recognition (ASR), a groundbreaking suite of models that deliver automatic speech recognition for more than 1,600 languages, including 500 low-resource languages never before transcribed by AI. This announcement represents a significant step toward making speech-to-text technology universally accessible, addressing the digital divide that has left many language communities without high-quality transcription capabilities.

The release includes Omnilingual wav2vec 2.0, a new self-supervised learning massively multilingual speech recognition representation model scaled up to 7B parameters that can be leveraged for other downstream speech-related tasks. Meta is also releasing the Omnilingual ASR Corpus, a unique collection of transcribed speech in 350 underserved languages, curated in collaboration with global partners.

What sets Omnilingual ASR apart is its ability to extend to entirely new languages with just a few in-context examples—a paradigm shift from traditional systems that require expert-driven fine-tuning. This "bring your own language" approach makes speech technology accessible to communities that previously lacked the resources or expertise to develop ASR systems for their languages.

Omnilingual ASR: Unprecedented scale and performance

Architecture and capabilities

Omnilingual ASR addresses the fundamental challenge of scaling automatic speech recognition to thousands of languages by introducing two architectural variants. First, Meta scaled their previous wav2vec 2.0 speech recognition encoder to 7B parameters for the first time, producing rich, massively multilingual semantic representations from raw, untranscribed speech data.

The system then uses two decoder variants to map those representations into character tokens:

  • CTC decoder: Relies on traditional connectionist temporal classification (CTC) objective
  • LLM-ASR decoder: Leverages a transformer decoder, commonly used in LLMs, introducing a step change in ASR performance

The 7B-LLM-ASR system achieves state-of-the-art performance across 1,600+ languages, with character error rates (CER) below 10 for 78% of those languages. This represents unprecedented quality at an unprecedented scale, making high-quality transcriptions available to language communities that have been historically underserved by AI technology.

Performance metrics

Key performance characteristics of Omnilingual ASR include:

  • 1,600+ languages supported: Including 500 low-resource languages never before transcribed by AI
  • 78% of languages achieve CER < 10: Demonstrating high-quality transcription across the vast majority of supported languages
  • State-of-the-art performance: Outperforming existing systems across the full range of supported languages
  • Scalable architecture: From lightweight 300M versions to powerful 7B models

The system's ability to handle such a diverse range of languages—from widely spoken languages to those with minimal digital presence—demonstrates the power of self-supervised learning and massive multilingual training.

Bring your own language: In-context learning for ASR

Paradigm shift in language extension

One of the most significant innovations in Omnilingual ASR is its ability to extend to entirely new languages with just a few in-context examples. This represents a fundamental shift from traditional ASR systems, where languages not included at release time could only be added through expert-driven fine-tuning—a path inaccessible to most communities.

The LLM-inspired system brings in-context learning capabilities from the field of large language models to speech recognition. In practice, this means that a speaker of an unsupported language can provide only a handful of paired audio-text samples and obtain usable transcription quality—without training data at scale, onerous expertise, or access to high-end compute.

Practical implications

This approach has profound implications for language preservation and digital inclusion:

  • Accessibility: Communities can add their languages without requiring AI expertise
  • Scalability: The barrier to entry is dramatically reduced
  • Preservation: Endangered and low-resource languages can be brought into the digital age
  • Democratization: Speech technology becomes accessible to communities worldwide

While zero-shot performance cannot yet match that of fully trained systems, it offers a far more scalable path to bringing new languages into digital reach. This democratizes access to speech technology and empowers language communities to preserve and digitize their linguistic heritage.

A suite of models for various use cases

Model variants

Meta is releasing a full suite of models designed for different use cases and computational constraints:

  • 300M models: Lightweight versions designed for low-power devices and edge computing
  • 7B models: Powerful versions offering top-tier accuracy for a variety of use cases
  • CTC and LLM-ASR variants: Different decoding approaches optimized for different scenarios

The general-purpose speech foundation model wav2vec 2.0 is also made available at various sizes. It can be used by researchers and developers alike to enable speech-related tasks beyond ASR, such as:

  • Speech translation
  • Voice activity detection
  • Speaker identification
  • Emotion recognition
  • Other downstream audio processing tasks

Open source availability

All assets are released under a permissive Apache 2.0 license while the data is provided under the CC-BY license. The models are based on FAIR's open source fairseq2 framework, empowering researchers, developers, and language advocates worldwide to:

  • Advance speech solutions for their own use cases
  • Tailor models for specific languages or domains
  • Build upon the latest tools and technologies in the PyTorch ecosystem
  • Contribute to the democratization of speech technology

This open approach ensures that the benefits of Omnilingual ASR extend beyond Meta's immediate use cases, enabling innovation across the global research and development community.

Built with global partners

Training corpus assembly

Omnilingual ASR's training corpus is one of the largest ever assembled for ASR in both volume and linguistic diversity. It integrates:

  • Publicly available datasets: Leveraging existing open-source speech data
  • Community-sourced recordings: Collected through multiple partnerships
  • Commissioned data: Native speakers recruited and compensated by local organizations

To reach languages with little or no digital presence, Meta worked with local organizations that recruited and compensated native speakers, often in remote or under-documented regions. This collaborative approach was essential for gathering authentic speech data from communities that are typically underrepresented in digital datasets.

The Omnilingual ASR Corpus

Meta is releasing the commissioned part of the training corpus as the Omnilingual ASR Corpus to further benefit the ASR research community. To date, it is the largest ultra-low-resource spontaneous ASR dataset ever made available, covering hundreds of languages never seen before by ASR systems.

The corpus includes:

  • 350 underserved languages: Languages with minimal existing digital resources
  • Spontaneous speech: Natural, conversational recordings rather than scripted content
  • Diverse speakers: Representing various dialects, accents, and speaking styles
  • Cultural context: Preserving linguistic and cultural authenticity

Partnership network

Beyond commissioned partnerships, collaborations through the Language Technology Partner Program have brought together:

  • Linguists: Providing essential linguistic expertise
  • Researchers: Contributing technical knowledge and validation
  • Language communities: Ensuring cultural and linguistic authenticity

Meta joined forces with organizations such as:

  • Mozilla Foundation's Common Voice: A community-driven initiative for open speech data
  • Lanfrica/NaijaVoices: Working directly with local communities in Africa

These partnerships have been instrumental in infusing Omnilingual ASR with deep linguistic knowledge and cultural understanding, ensuring that the technology meets local needs and empowers diverse language communities globally.

Technical innovations

Scaling wav2vec 2.0 to 7B parameters

The scaling of wav2vec 2.0 to 7B parameters represents a significant technical achievement. This massive multilingual speech recognition representation model:

  • Learns from untranscribed speech: Leveraging self-supervised learning to extract meaningful representations
  • Captures cross-lingual patterns: Identifying commonalities across languages
  • Enables transfer learning: Providing rich representations for downstream tasks through transfer learning
  • Supports in-context learning: Enabling few-shot adaptation to new languages

The 7B parameter scale allows the model to capture more nuanced linguistic patterns and cross-lingual relationships, enabling better performance on low-resource languages that share characteristics with better-resourced languages.

LLM-ASR architecture

The LLM-ASR approach introduces transformer decoder architecture, commonly used in large language models, to speech recognition. This innovation:

  • Enables in-context learning: Bringing few-shot capabilities to ASR
  • Improves long-tail performance: Better handling of low-resource languages
  • Supports flexible decoding: More adaptable to different languages and domains
  • Reduces data requirements: Less training data needed for new languages

The combination of the 7B wav2vec 2.0 encoder with the transformer decoder creates a powerful system that can generalize across languages and adapt quickly to new ones.

Why it matters

Breaking down language barriers

Omnilingual ASR represents a significant step toward delivering a truly universal transcription system and expanding access to speech technology worldwide. This work supports Meta's goal of building technology to help bring the world closer together by:

  • Enabling communication: Making speech accessible across diverse linguistic backgrounds
  • Preserving languages: Helping digitize and preserve endangered languages
  • Reducing digital divide: Providing speech technology to underserved communities
  • Supporting accessibility: Enabling speech-to-text for users worldwide

Impact on low-resource languages

The inclusion of 500 low-resource languages that have never been transcribed by AI before has profound implications:

  • Digital inclusion: Bringing these languages into the digital age
  • Cultural preservation: Enabling documentation and preservation of linguistic heritage
  • Educational access: Supporting language learning and education
  • Economic opportunity: Enabling new applications and services for these communities

Research and development implications

The release of Omnilingual ASR, including models, data, and code, enables:

  • Further research: Researchers can build upon these foundations
  • Commercial applications: Developers can integrate ASR into their products
  • Language advocacy: Communities can develop tools for their languages
  • Innovation: New applications and use cases can emerge

The open-source nature of the release ensures that the benefits extend far beyond Meta's immediate applications, fostering innovation across the global AI and language technology community.

Conclusion

Meta's introduction of Omnilingual ASR represents a landmark achievement in making automatic speech recognition truly universal. With support for over 1,600 languages—including 500 low-resource languages never before transcribed by AI—the system addresses a critical gap in digital accessibility and language preservation.

The ability to extend to new languages with just a few in-context examples democratizes access to speech technology, empowering communities worldwide to bring their languages into the digital age. Combined with the release of the Omnilingual ASR Corpus and open-source models, this work provides the foundation for continued innovation in multilingual speech recognition.

As AI continues to evolve, ensuring that technology serves all language communities—not just those with abundant digital resources—becomes increasingly important. Omnilingual ASR takes a significant step toward this goal, breaking down language barriers and expanding access to speech technology worldwide.

Explore more about speech recognition and audio processing in our Glossary, and learn about Meta's AI research and other developments in our Blog.

Sources

Frequently Asked Questions

Omnilingual ASR is Meta's suite of automatic speech recognition models that support over 1,600 languages, including 500 low-resource languages never before transcribed by AI. It uses a 7B-parameter wav2vec 2.0 encoder and LLM-inspired decoders.
Unlike traditional systems requiring expert-driven fine-tuning, Omnilingual ASR can extend to entirely new languages with just a few in-context examples, bringing in-context learning capabilities from LLMs to speech recognition.
The 7B-LLM-ASR system achieves state-of-the-art performance across 1,600+ languages, with character error rates (CER) below 10 for 78% of those languages.
Meta is releasing the full suite of ASR models (300M to 7B parameters), Omnilingual wav2vec 2.0 speech foundation model, and the Omnilingual ASR Corpus with transcribed speech in 350 underserved languages.
All models are released under Apache 2.0 license, while the data is provided under CC-BY license, making them freely available for research and commercial use.
The training corpus integrates publicly available datasets with community-sourced recordings collected through partnerships with local organizations, Mozilla Foundation's Common Voice, and Lanfrica/NaijaVoices, often in remote or under-documented regions.

Continue Your AI Journey

Explore our lessons and glossary to deepen your understanding.