Introduction
NVIDIA has released Omni-Embed-Nemotron-3B, a groundbreaking multimodal embedding model that represents a significant advancement in retrieval-augmented generation (RAG) systems. This versatile model can encode content across multiple modalities—text, image, audio, and video—either individually or in combination, enabling sophisticated cross-modal retrieval capabilities that were previously challenging to achieve.
Built on the foundation of the Qwen2.5-Omni-3B architecture, Omni-Embed-Nemotron-3B is specifically designed to serve as a foundational component in multimodal RAG systems, opening new possibilities for AI applications that need to understand and retrieve information across diverse content types.
Model Architecture and Design
Foundation and Architecture
Omni-Embed-Nemotron-3B is built on the Qwen2.5-Omni-3B foundation model, but with a crucial architectural difference:
Thinker-Only Design:
- Utilizes only the "Thinker" component from the original Thinker-Talker architecture
- Focuses on multimodal understanding rather than response generation
- Optimized specifically for embedding and retrieval tasks
- 4.7 billion parameters for efficient processing
Note on Model Naming: The "3B" in the model name refers to the base Qwen2.5-Omni-3B architecture, while the actual parameter count (4.7B) includes additional embedding layers and multimodal processing components added for retrieval tasks.
Multimodal Processing:
- Vision Encoder: Processes images and video frames
- Audio Encoder: Handles audio content and soundtracks
- Language Model: Processes text content and captions
- Independent Processing: Audio and video are encoded separately to preserve temporal structure
Key Architectural Innovations
Separate Audio and Video Streams: Unlike the original Omni model that interleaves audio and video tokens with TMRoPE, Omni-Embed-Nemotron-3B keeps audio and video streams separate. This design choice:
- Preserves full temporal structure of both audio and video
- Improves retrieval performance compared to interleaved approaches
- Enables more accurate cross-modal matching
- Maintains the integrity of temporal information
Bi-Encoder Architecture: The model employs a bi-encoder design where:
- Queries and candidate inputs are embedded independently
- Contrastive learning aligns relevant query-content pairs
- Unrelated pairs are pushed apart in the shared embedding space
- Enables efficient similarity computation and retrieval
Technical Specifications
Input Capabilities
Supported Modalities:
- Text: Strings, lists of strings, or pre-tokenized text
- Images: PIL.Image, numpy arrays, or torch tensors
- Video: Video files (.mp4), numpy arrays, or torch tensors
- Audio: Waveform arrays, torch tensors, or audio files
Input Processing:
- Maximum Context Length: 32,768 tokens
- Batch Processing: Supports batched inputs across all modalities
- Flexible Input: Can process any combination of modalities in a single query
- Format Support: Multiple input formats for each modality type
Output Specifications
Embedding Characteristics:
- Dimension: 2048-dimensional float vectors
- Normalization: L2-normalized embeddings for efficient similarity computation
- Consistency: Same embedding space across all modalities
- Quality: High-quality embeddings optimized for retrieval tasks
Performance Metrics:
- Model Size: 5B parameters (4.7B actual parameters)
- Precision: BF16 (bfloat16) for efficient computation
- Hardware: Optimized for NVIDIA GPU-accelerated systems
- Memory: Efficient memory usage with flash attention implementation
Multimodal Retrieval Capabilities
Cross-Modal Retrieval
Text-to-Video Retrieval:
- Find relevant videos using text queries
- Support for complex video content understanding
- Temporal understanding of video sequences
- High accuracy on video retrieval benchmarks
Text-to-Audio Retrieval:
- Locate audio content using text descriptions
- Support for various audio formats and lengths
- Understanding of audio content semantics
- Effective for music and speech retrieval
Visual Document Retrieval:
- Process and retrieve document images
- Support for complex document layouts
- Text extraction and understanding from images
- Integration with document management systems
Advanced Retrieval Features
Multimodal Queries:
- Support for queries combining multiple modalities
- Complex cross-modal reasoning capabilities
- Flexible query construction
- Rich context understanding
Custom Retrieval Tasks:
- Adaptable to specific domain requirements
- Support for custom similarity metrics
- Integration with existing retrieval systems
- Extensible architecture for new modalities
Training and Evaluation
Training Datasets
The model was trained on a diverse collection of publicly available datasets:
Text Datasets:
- HotpotQA: Complex question-answering
- MIRACL: Multilingual information retrieval
- Natural Questions (NQ): Real-world question answering
- Stack Exchange: Technical Q&A content
- SQuAD: Reading comprehension
- Tiger Math/Stack: Mathematical reasoning
Multimodal Datasets:
- DocMatix-IR: Document retrieval
- Vidore-ColPali-Training: Video understanding
- Wiki-SS-NQ: Wikipedia-based question answering
Data Scale:
- Images: Approximately 100 million to 1 billion images across training datasets
- Text: Several hundred million tokens from diverse text sources
- Multimodal Content: Combined text-image pairs, video-audio-text triplets
- Diverse Sources: Multiple languages, domains, and content types
- Quality Control: Curated and validated datasets with proper licensing
Performance Benchmarks
Video Retrieval Performance: The model demonstrates strong performance on video retrieval benchmarks:
LPM Dataset (NDCG@10):
- Strong performance across different modalities
- Competitive text-to-video retrieval capabilities
- Effective cross-modal understanding
- Performance varies by modality type and query complexity
FineVideo Dataset (NDCG@10):
- High-quality video content retrieval performance
- Effective temporal understanding of video sequences
- Robust performance across diverse video types
- Consistent results across different video lengths and formats
Text Retrieval Benchmarks:
- Strong performance on standard text retrieval tasks (nDCG@10 metrics)
- Effective semantic understanding across multiple languages
- Competitive with specialized text embedding models
- Consistent performance across different text lengths and complexities
ViDoRe V1 Performance (nDCG@5):
- Document retrieval capabilities with visual understanding
- Effective processing of complex document layouts
- Cross-modal document search across text and visual elements
- Robust performance on document-image retrieval tasks
Software Integration and Usage
Installation Requirements
Dependencies:
pip install git+https://github.com/huggingface/transformers.git@v4.51.3-Qwen2.5-Omni-preview
Hardware Requirements:
- NVIDIA GPU with CUDA support
- Recommended: A100 40GB, A100 80GB, or H100 80GB
- Linux operating system
- TensorRT and Triton support
Usage Example
Basic Implementation:
import torch
from qwen_omni_utils import process_mm_info
import torch.nn.functional as F
from transformers import AutoModel, AutoProcessor
# Load model and processor
model_name_or_path = "nvidia/omni-embed-nemotron-3b"
model = AutoModel.from_pretrained(
model_name_or_path,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
trust_remote_code=True,
)
# Prepare multimodal content
documents = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "passage: This is a passage to be embedded"
},
{
"type": "video",
"video": "path/to/video.mp4"
},
{
"type": "audio",
"audio": "path/to/audio.wav"
}
]
},
]
# Process and embed
processor = AutoProcessor.from_pretrained(model_name_or_path, trust_remote_code=True)
documents_texts = processor.apply_chat_template(documents, add_generation_prompt=False, tokenize=False)
audio, images, videos = process_mm_info(documents, use_audio_in_video=False)
# Generate embeddings
batch_dict = processor(
text=documents_texts,
images=images,
videos=videos,
audio=audio,
return_tensors="pt",
text_kwargs={"truncation": True, "padding": True, "max_length": 204800},
videos_kwargs={"min_pixels": 32*14*14, "max_pixels": 64*28*28, "use_audio_in_video": False},
audio_kwargs={"max_length": 2048000},
)
# Compute embeddings
with torch.no_grad():
last_hidden_states = model(**batch_dict, output_hidden_states=True).hidden_states[-1]
attention_mask = batch_dict["attention_mask"]
last_hidden_states_masked = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
embedding = last_hidden_states_masked.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
embedding = F.normalize(embedding, dim=-1)
Integration Options
Runtime Engines:
- TensorRT: Optimized inference on NVIDIA hardware
- Triton: Scalable model serving
- Custom Integration: Direct API integration
Supported Hardware:
- A100 40GB and 80GB
- H100 80GB
- Other CUDA-compatible GPUs
- Linux-based systems
Applications and Use Cases
Multimodal RAG Systems
Enhanced Retrieval:
- Retrieve relevant content across multiple modalities
- Support for complex queries combining text, images, audio, and video
- Improved context understanding for language models
- More accurate and comprehensive retrieval results
Conversational AI:
- Rich input understanding across modalities
- Support for multimedia conversations
- Enhanced context awareness
- Improved user experience
Multimedia Search Engines
Cross-Modal Search:
- Find images using text descriptions
- Locate videos using audio queries
- Search audio content using visual cues
- Complex multimodal search capabilities
Content Discovery:
- Intelligent content recommendation
- Similarity-based content discovery
- Cross-platform content matching
- Enhanced user engagement
Enterprise Applications
Document Management:
- Visual document retrieval
- Audio document processing
- Video content indexing
- Comprehensive content search
Knowledge Management:
- Multimodal knowledge bases
- Cross-modal information retrieval
- Enhanced search capabilities
- Improved information discovery
Research and Development
Research Applications
Academic Research:
- Multimodal understanding research
- Cross-modal learning studies
- Retrieval system development
- AI model evaluation
Industry Research:
- Product development
- User experience research
- Content analysis
- Recommendation systems
Development Opportunities
Custom Applications:
- Domain-specific retrieval systems
- Specialized multimodal applications
- Integration with existing systems
- Custom model fine-tuning
Open Source Ecosystem:
- Community contributions
- Model improvements
- Application development
- Research collaboration
Licensing and Terms
License Information
NVIDIA OneWay Noncommercial License:
- Research and development use only
- Non-commercial applications
- Academic and research institutions
- Open source projects
Additional Terms:
- NVIDIA Software and Model Evaluation License applies
- Qwen Research License Agreement applies
- Third-party open source software included
- Review license terms before use
- Compliance with all applicable terms
Usage Restrictions
Intended Use:
- Research and development
- Non-commercial applications
- Educational purposes
- Open source projects
Prohibited Uses:
- Commercial applications without proper licensing
- Military applications
- Surveillance systems
- Harmful or illegal activities
Future Development and Roadmap
Planned Enhancements
Model Improvements:
- Enhanced multimodal understanding
- Improved retrieval accuracy
- Better cross-modal alignment
- Extended modality support
Performance Optimizations:
- Faster inference times
- Reduced memory requirements
- Better hardware utilization
- Improved scalability
Feature Additions:
- Additional modality support
- Enhanced query capabilities
- Better integration tools
- Improved documentation
Community Contributions
Open Source Development:
- Community-driven improvements
- Bug fixes and optimizations
- New feature development
- Documentation enhancements
Research Collaboration:
- Academic partnerships
- Industry collaboration
- Joint research projects
- Knowledge sharing
Conclusion
NVIDIA Omni-Embed-Nemotron-3B represents a significant advancement in multimodal embedding technology, providing a unified solution for retrieval across text, image, audio, and video modalities. This model opens new possibilities for AI applications that need to understand and retrieve information across diverse content types.
Key Achievements:
- Unified Multimodal Processing: Single model for text, image, audio, and video embedding
- Cross-Modal Retrieval: Advanced capabilities for finding content across different modalities
- High Performance: Strong results on multiple retrieval benchmarks
- Research Focus: Designed specifically for research and development applications
- Open Architecture: Built on proven Qwen2.5-Omni foundation
Future Impact:
Omni-Embed-Nemotron-3B has the potential to accelerate the development of sophisticated multimodal AI applications by providing a robust foundation for retrieval-augmented generation systems. The model's ability to understand and process multiple modalities simultaneously represents a significant step forward in creating more intelligent and context-aware AI systems.
The release of this model marks an important milestone in the evolution of multimodal AI, providing researchers and developers with powerful tools to build the next generation of AI applications that can truly understand and work with the rich, multimodal nature of human communication and content.
Sources
- NVIDIA Omni-Embed-Nemotron-3B on Hugging Face
- Omni-Embed-Nemotron: A Unified Multimodal Retrieval Model - arXiv Paper
- NV-Retriever: Improving text embedding models with effective hard-negative mining - arXiv Paper
- NVIDIA Developer Platform
- Qwen2.5-Omni-3B Foundation Model
Interested in learning more about multimodal AI and embedding models? Explore our AI fundamentals courses to understand how AI models work, check out our glossary of AI terms for key concepts like embedding and retrieval-augmented generation, or discover the latest AI models and AI tools in our comprehensive catalog.