NVIDIA Omni-Embed-Nemotron-3B: Unified Multimodal Retrieval Model for Text, Image, Audio, and Video

NVIDIA releases Omni-Embed-Nemotron-3B, a versatile multimodal embedding model for text, image, audio, and video content in RAG systems.

by HowAIWorks Team
ainvidiamultimodalembeddingretrievalragnemotrontext-embeddingvideo-retrievalaudio-retrievalai-modelsartificial-intelligence

Introduction

NVIDIA has released Omni-Embed-Nemotron-3B, a groundbreaking multimodal embedding model that represents a significant advancement in retrieval-augmented generation (RAG) systems. This versatile model can encode content across multiple modalities—text, image, audio, and video—either individually or in combination, enabling sophisticated cross-modal retrieval capabilities that were previously challenging to achieve.

Built on the foundation of the Qwen2.5-Omni-3B architecture, Omni-Embed-Nemotron-3B is specifically designed to serve as a foundational component in multimodal RAG systems, opening new possibilities for AI applications that need to understand and retrieve information across diverse content types.

Model Architecture and Design

Foundation and Architecture

Omni-Embed-Nemotron-3B is built on the Qwen2.5-Omni-3B foundation model, but with a crucial architectural difference:

Thinker-Only Design:

  • Utilizes only the "Thinker" component from the original Thinker-Talker architecture
  • Focuses on multimodal understanding rather than response generation
  • Optimized specifically for embedding and retrieval tasks
  • 4.7 billion parameters for efficient processing

Note on Model Naming: The "3B" in the model name refers to the base Qwen2.5-Omni-3B architecture, while the actual parameter count (4.7B) includes additional embedding layers and multimodal processing components added for retrieval tasks.

Multimodal Processing:

  • Vision Encoder: Processes images and video frames
  • Audio Encoder: Handles audio content and soundtracks
  • Language Model: Processes text content and captions
  • Independent Processing: Audio and video are encoded separately to preserve temporal structure

Key Architectural Innovations

Separate Audio and Video Streams: Unlike the original Omni model that interleaves audio and video tokens with TMRoPE, Omni-Embed-Nemotron-3B keeps audio and video streams separate. This design choice:

  • Preserves full temporal structure of both audio and video
  • Improves retrieval performance compared to interleaved approaches
  • Enables more accurate cross-modal matching
  • Maintains the integrity of temporal information

Bi-Encoder Architecture: The model employs a bi-encoder design where:

  • Queries and candidate inputs are embedded independently
  • Contrastive learning aligns relevant query-content pairs
  • Unrelated pairs are pushed apart in the shared embedding space
  • Enables efficient similarity computation and retrieval

Technical Specifications

Input Capabilities

Supported Modalities:

  • Text: Strings, lists of strings, or pre-tokenized text
  • Images: PIL.Image, numpy arrays, or torch tensors
  • Video: Video files (.mp4), numpy arrays, or torch tensors
  • Audio: Waveform arrays, torch tensors, or audio files

Input Processing:

  • Maximum Context Length: 32,768 tokens
  • Batch Processing: Supports batched inputs across all modalities
  • Flexible Input: Can process any combination of modalities in a single query
  • Format Support: Multiple input formats for each modality type

Output Specifications

Embedding Characteristics:

  • Dimension: 2048-dimensional float vectors
  • Normalization: L2-normalized embeddings for efficient similarity computation
  • Consistency: Same embedding space across all modalities
  • Quality: High-quality embeddings optimized for retrieval tasks

Performance Metrics:

  • Model Size: 5B parameters (4.7B actual parameters)
  • Precision: BF16 (bfloat16) for efficient computation
  • Hardware: Optimized for NVIDIA GPU-accelerated systems
  • Memory: Efficient memory usage with flash attention implementation

Multimodal Retrieval Capabilities

Cross-Modal Retrieval

Text-to-Video Retrieval:

  • Find relevant videos using text queries
  • Support for complex video content understanding
  • Temporal understanding of video sequences
  • High accuracy on video retrieval benchmarks

Text-to-Audio Retrieval:

  • Locate audio content using text descriptions
  • Support for various audio formats and lengths
  • Understanding of audio content semantics
  • Effective for music and speech retrieval

Visual Document Retrieval:

  • Process and retrieve document images
  • Support for complex document layouts
  • Text extraction and understanding from images
  • Integration with document management systems

Advanced Retrieval Features

Multimodal Queries:

  • Support for queries combining multiple modalities
  • Complex cross-modal reasoning capabilities
  • Flexible query construction
  • Rich context understanding

Custom Retrieval Tasks:

  • Adaptable to specific domain requirements
  • Support for custom similarity metrics
  • Integration with existing retrieval systems
  • Extensible architecture for new modalities

Training and Evaluation

Training Datasets

The model was trained on a diverse collection of publicly available datasets:

Text Datasets:

  • HotpotQA: Complex question-answering
  • MIRACL: Multilingual information retrieval
  • Natural Questions (NQ): Real-world question answering
  • Stack Exchange: Technical Q&A content
  • SQuAD: Reading comprehension
  • Tiger Math/Stack: Mathematical reasoning

Multimodal Datasets:

  • DocMatix-IR: Document retrieval
  • Vidore-ColPali-Training: Video understanding
  • Wiki-SS-NQ: Wikipedia-based question answering

Data Scale:

  • Images: Approximately 100 million to 1 billion images across training datasets
  • Text: Several hundred million tokens from diverse text sources
  • Multimodal Content: Combined text-image pairs, video-audio-text triplets
  • Diverse Sources: Multiple languages, domains, and content types
  • Quality Control: Curated and validated datasets with proper licensing

Performance Benchmarks

Video Retrieval Performance: The model demonstrates strong performance on video retrieval benchmarks:

LPM Dataset (NDCG@10):

  • Strong performance across different modalities
  • Competitive text-to-video retrieval capabilities
  • Effective cross-modal understanding
  • Performance varies by modality type and query complexity

FineVideo Dataset (NDCG@10):

  • High-quality video content retrieval performance
  • Effective temporal understanding of video sequences
  • Robust performance across diverse video types
  • Consistent results across different video lengths and formats

Text Retrieval Benchmarks:

  • Strong performance on standard text retrieval tasks (nDCG@10 metrics)
  • Effective semantic understanding across multiple languages
  • Competitive with specialized text embedding models
  • Consistent performance across different text lengths and complexities

ViDoRe V1 Performance (nDCG@5):

  • Document retrieval capabilities with visual understanding
  • Effective processing of complex document layouts
  • Cross-modal document search across text and visual elements
  • Robust performance on document-image retrieval tasks

Software Integration and Usage

Installation Requirements

Dependencies:

pip install git+https://github.com/huggingface/transformers.git@v4.51.3-Qwen2.5-Omni-preview

Hardware Requirements:

  • NVIDIA GPU with CUDA support
  • Recommended: A100 40GB, A100 80GB, or H100 80GB
  • Linux operating system
  • TensorRT and Triton support

Usage Example

Basic Implementation:

import torch
from qwen_omni_utils import process_mm_info
import torch.nn.functional as F
from transformers import AutoModel, AutoProcessor

# Load model and processor
model_name_or_path = "nvidia/omni-embed-nemotron-3b"
model = AutoModel.from_pretrained(
    model_name_or_path, 
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    trust_remote_code=True,
)

# Prepare multimodal content
documents = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "passage: This is a passage to be embedded"
            },
            {
                "type": "video",
                "video": "path/to/video.mp4"
            },
            {
                "type": "audio",
                "audio": "path/to/audio.wav"
            }
        ]
    },
]

# Process and embed
processor = AutoProcessor.from_pretrained(model_name_or_path, trust_remote_code=True)
documents_texts = processor.apply_chat_template(documents, add_generation_prompt=False, tokenize=False)
audio, images, videos = process_mm_info(documents, use_audio_in_video=False)

# Generate embeddings
batch_dict = processor(
    text=documents_texts, 
    images=images, 
    videos=videos, 
    audio=audio,
    return_tensors="pt",
    text_kwargs={"truncation": True, "padding": True, "max_length": 204800},
    videos_kwargs={"min_pixels": 32*14*14, "max_pixels": 64*28*28, "use_audio_in_video": False},
    audio_kwargs={"max_length": 2048000},
)

# Compute embeddings
with torch.no_grad():
    last_hidden_states = model(**batch_dict, output_hidden_states=True).hidden_states[-1]
    attention_mask = batch_dict["attention_mask"]
    last_hidden_states_masked = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    embedding = last_hidden_states_masked.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
    embedding = F.normalize(embedding, dim=-1)

Integration Options

Runtime Engines:

  • TensorRT: Optimized inference on NVIDIA hardware
  • Triton: Scalable model serving
  • Custom Integration: Direct API integration

Supported Hardware:

  • A100 40GB and 80GB
  • H100 80GB
  • Other CUDA-compatible GPUs
  • Linux-based systems

Applications and Use Cases

Multimodal RAG Systems

Enhanced Retrieval:

  • Retrieve relevant content across multiple modalities
  • Support for complex queries combining text, images, audio, and video
  • Improved context understanding for language models
  • More accurate and comprehensive retrieval results

Conversational AI:

  • Rich input understanding across modalities
  • Support for multimedia conversations
  • Enhanced context awareness
  • Improved user experience

Multimedia Search Engines

Cross-Modal Search:

  • Find images using text descriptions
  • Locate videos using audio queries
  • Search audio content using visual cues
  • Complex multimodal search capabilities

Content Discovery:

  • Intelligent content recommendation
  • Similarity-based content discovery
  • Cross-platform content matching
  • Enhanced user engagement

Enterprise Applications

Document Management:

  • Visual document retrieval
  • Audio document processing
  • Video content indexing
  • Comprehensive content search

Knowledge Management:

  • Multimodal knowledge bases
  • Cross-modal information retrieval
  • Enhanced search capabilities
  • Improved information discovery

Research and Development

Research Applications

Academic Research:

  • Multimodal understanding research
  • Cross-modal learning studies
  • Retrieval system development
  • AI model evaluation

Industry Research:

  • Product development
  • User experience research
  • Content analysis
  • Recommendation systems

Development Opportunities

Custom Applications:

  • Domain-specific retrieval systems
  • Specialized multimodal applications
  • Integration with existing systems
  • Custom model fine-tuning

Open Source Ecosystem:

  • Community contributions
  • Model improvements
  • Application development
  • Research collaboration

Licensing and Terms

License Information

NVIDIA OneWay Noncommercial License:

  • Research and development use only
  • Non-commercial applications
  • Academic and research institutions
  • Open source projects

Additional Terms:

  • NVIDIA Software and Model Evaluation License applies
  • Qwen Research License Agreement applies
  • Third-party open source software included
  • Review license terms before use
  • Compliance with all applicable terms

Usage Restrictions

Intended Use:

  • Research and development
  • Non-commercial applications
  • Educational purposes
  • Open source projects

Prohibited Uses:

  • Commercial applications without proper licensing
  • Military applications
  • Surveillance systems
  • Harmful or illegal activities

Future Development and Roadmap

Planned Enhancements

Model Improvements:

  • Enhanced multimodal understanding
  • Improved retrieval accuracy
  • Better cross-modal alignment
  • Extended modality support

Performance Optimizations:

  • Faster inference times
  • Reduced memory requirements
  • Better hardware utilization
  • Improved scalability

Feature Additions:

  • Additional modality support
  • Enhanced query capabilities
  • Better integration tools
  • Improved documentation

Community Contributions

Open Source Development:

  • Community-driven improvements
  • Bug fixes and optimizations
  • New feature development
  • Documentation enhancements

Research Collaboration:

  • Academic partnerships
  • Industry collaboration
  • Joint research projects
  • Knowledge sharing

Conclusion

NVIDIA Omni-Embed-Nemotron-3B represents a significant advancement in multimodal embedding technology, providing a unified solution for retrieval across text, image, audio, and video modalities. This model opens new possibilities for AI applications that need to understand and retrieve information across diverse content types.

Key Achievements:

  • Unified Multimodal Processing: Single model for text, image, audio, and video embedding
  • Cross-Modal Retrieval: Advanced capabilities for finding content across different modalities
  • High Performance: Strong results on multiple retrieval benchmarks
  • Research Focus: Designed specifically for research and development applications
  • Open Architecture: Built on proven Qwen2.5-Omni foundation

Future Impact:

Omni-Embed-Nemotron-3B has the potential to accelerate the development of sophisticated multimodal AI applications by providing a robust foundation for retrieval-augmented generation systems. The model's ability to understand and process multiple modalities simultaneously represents a significant step forward in creating more intelligent and context-aware AI systems.

The release of this model marks an important milestone in the evolution of multimodal AI, providing researchers and developers with powerful tools to build the next generation of AI applications that can truly understand and work with the rich, multimodal nature of human communication and content.

Sources


Interested in learning more about multimodal AI and embedding models? Explore our AI fundamentals courses to understand how AI models work, check out our glossary of AI terms for key concepts like embedding and retrieval-augmented generation, or discover the latest AI models and AI tools in our comprehensive catalog.

Frequently Asked Questions

Omni-Embed-Nemotron-3B is a versatile multimodal embedding model that can encode content across text, image, audio, and video modalities, either individually or in combination, designed for multimodal Retrieval-Augmented Generation (RAG) systems.
The model supports text-to-video retrieval, text-to-audio retrieval, visual document retrieval, and custom multimodal embedding tasks. It can process any combination of text, images, audio, and video content with 2048-dimensional embeddings.
The model is based on the Qwen2.5-Omni-3B architecture, specifically using only the Thinker component for multimodal understanding. It has 5B parameters (4.7B actual parameters) and outputs 2048-dimensional embeddings with a maximum context length of 32,768 tokens.
Unlike traditional single-modal embedding models, Omni-Embed-Nemotron-3B can process multiple modalities simultaneously and supports cross-modal retrieval, enabling queries in one modality to find relevant content in another modality.
The model is designed for multimodal RAG systems, multimedia search engines, cross-modal retrieval systems, and conversational AI with rich input understanding across text, images, audio, and video content.

Continue Your AI Journey

Explore our lessons and glossary to deepen your understanding.