MobileLLM-Pro: Meta's 1B On-Device Model with 128k Context

Meta Reality Labs releases MobileLLM-Pro, a 1B parameter language model optimized for on-device inference with 128k context window and near-lossless int4 quantization.

by HowAIWorks Team
MobileLLMMetaOn-Device AILanguage ModelQuantizationMobile AIEdge ComputingFAIRReality Labs1B Parameters

Introduction

On October 16, 2025, Meta Reality Labs announced the release of MobileLLM-Pro, a groundbreaking 1B parameter language model specifically designed for efficient on-device inference. This latest addition to the MobileLLM series represents a significant advancement in mobile AI capabilities, delivering high-quality performance while maintaining the resource efficiency required for real-world mobile deployment.

MobileLLM-Pro addresses the growing demand for powerful language models that can run directly on mobile devices without requiring cloud connectivity. The model's innovative architecture and quantization techniques make it particularly suitable for applications requiring privacy, low latency, and offline functionality.

What is MobileLLM-Pro?

MobileLLM-Pro (also known as MobileLLM-P1) is a 1.084B parameter foundational language model that combines efficient architecture with advanced quantization techniques to deliver state-of-the-art performance on mobile devices. The model is available in two variants: a pre-trained base model and an instruction-tuned version optimized for specific use cases.

Core Architecture

Model Specifications:

  • Parameters: 1,084M (1.08B)
  • Layers: 30
  • Attention Heads: 20
  • KV Heads: 4
  • Dimension: 1280
  • Hidden Dimension: 6144
  • Vocabulary Size: 202,048
  • Context Length: 128k tokens

Key Innovations

Interleaved Local-Global Attention:

  • 3:1 Ratio: Combines local and global attention layers for optimal efficiency
  • 512 Local Attention: Reduces computational complexity for long sequences
  • Memory Efficiency: Lowers KV cache size from 117MB to 40MB (for 8k context)
  • Speed Improvement: 1.8x faster prefill latency compared to fully global attention

Advanced Quantization:

  • Near Lossless int4: Less than 1.3% quality degradation
  • CPU Optimization: int4 weights (group size 32), int8 dynamic activations
  • Accelerator Support: int4 per-channel weights for specialized hardware
  • Flexible Deployment: Multiple quantization strategies for different use cases

Performance Benchmarks

Base Model Performance

MobileLLM-Pro demonstrates superior performance across multiple evaluation benchmarks:

BenchmarkP1 (FP)P1 (Q-CPU)P1 (Q-Acc)Gemma 3 1BLlama 3.2 1B
HellaSwag67.11%64.89%65.10%62.30%65.69%
BoolQ76.24%77.49%76.36%63.20%62.51%
PIQA76.55%76.66%75.52%73.80%75.14%
SocialIQA50.87%51.18%50.05%48.90%45.60%
TriviaQA39.85%37.26%36.42%39.80%23.81%
NatQ15.76%15.43%13.19%9.48%5.48%
ARC-c52.62%52.45%51.24%38.40%38.28%
ARC-e76.28%76.58%75.73%73.00%63.47%
WinoGrande62.83%62.43%61.96%58.20%61.09%
OBQA43.60%44.20%40.40%37.20%-
NIH100.00%96.44%98.67%--

Instruction-Tuned Model Performance

The instruction-tuned variant shows competitive performance on specialized tasks:

BenchmarkP1 (IFT)Gemma 3 1B (IFT)Llama 3.2 1B (IFT)
MMLU44.8%29.9%49.3%
IFEval62.0%80.2%59.5%
MBPP46.8%35.2%39.6%
HumanEval59.8%41.5%37.8%
ARC-C62.7%59.4%-
HellaSwag58.4%41.2%-
BFCL v229.4%25.7%-
Open Rewrite51.0%41.6%-
TLDR9+16.8%16.8%-

Competitive Advantages

Performance Leadership:

  • 5.7% improvement over Gemma 3 1B on average
  • 7.9% improvement over Llama 3.2 1B on average
  • Superior reasoning: Better performance on complex reasoning tasks
  • Knowledge retention: Strong performance on knowledge-intensive benchmarks

Technical Innovations

Knowledge Distillation Training

MobileLLM-Pro uses advanced knowledge distillation techniques:

  • Teacher Model: Llama 4-Scout as the knowledge source
  • Loss Function: KL Divergence for effective knowledge transfer
  • Training Data: Less than 2T fully open-source tokens
  • Efficiency: Achieves high performance with reduced training data

Long-Context Processing

The model's 128k context window enables:

  • Document Summarization: Processing of long documents
  • Information Retrieval: Context-aware search and analysis
  • Conversation Memory: Maintaining context across extended interactions
  • Code Analysis: Understanding large codebases

Quantization Techniques

Group-wise Quantization (CPU):

  • int4 weights with group size 32
  • int8 dynamic activations for optimal performance
  • int8 KV cache for memory efficiency
  • 0.4% quality regression compared to full precision

Per-channel Quantization (Accelerators):

  • int4 per-channel weights for specialized hardware
  • 1.3% quality regression compared to full precision
  • Optimized for ANE & HTP (Apple Neural Engine & Hexagon Tensor Processor)

Latency and Performance Benchmarks

Real-World Performance

Benchmarking on Samsung Galaxy S25 CPU and Galaxy S24 Hexagon Tensor Processor:

Model / Prompt Length2k4k8k
CPU Prefill Latency (s)8.924.863.5
CPU Decode Speed (tok/s)33.624.819.7
HTP Prefill Latency (s)1.963.389.82
HTP Decode Speed (tok/s)31.6028.9522.77
KV Cache Size (MB)142340

Memory Efficiency

  • Model Size: 590MB with 4-bit groupwise quantization
  • KV Cache Optimization: Significant reduction in memory footprint
  • Scalable Performance: Maintains efficiency across different context lengths

Use Cases and Applications

Mobile-First Applications

On-Device Chatbots:

  • Privacy-preserving conversations
  • Offline functionality
  • Low-latency responses
  • No data transmission to servers

Document Processing:

  • Long document summarization
  • Information extraction
  • Content analysis
  • Multi-language support

Code Assistance:

  • On-device code completion
  • Debugging assistance
  • Code explanation
  • Programming tutorials

Enterprise Applications

Edge Computing:

  • IoT device intelligence
  • Real-time processing
  • Bandwidth optimization
  • Data privacy compliance

Mobile Development:

  • App-integrated AI features
  • Offline AI capabilities
  • Reduced server costs
  • Enhanced user experience

Implementation and Usage

Basic Usage

Installation:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

MODEL_ID = "facebook/MobileLLM-Pro"

def generate(prompt, model, tokenizer, chat=False):
    if chat:
        prompt = f"<|user|>\n{prompt}\n<|assistant|>\n"
    
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"].to(model.device)
    outputs = model.generate(input_ids, max_new_tokens=128)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Load model
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, trust_remote_code=True)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

# Generate response
prompt = "Why are open-source on-device language models great?"
result = generate(prompt, model, tokenizer, chat=True)
print(result)

Quantization Implementation

4-bit Groupwise Quantization:

from torchao.quantization import quantize_
from torchao.quantization.qat import QATConfig, IntxFakeQuantizeConfig

# Prepare for quantization
activation_config = IntxFakeQuantizeConfig(
    torch.int8, "per_token", is_symmetric=False
)
weight_config = IntxFakeQuantizeConfig(
    torch.int4, group_size=32, is_symmetric=True, is_dynamic=True
)
qat_config = QATConfig(activation_config=activation_config, weight_config=weight_config, step="prepare")
quantize_(model, qat_config)

Training and Data

Training Methodology

Knowledge Distillation:

  • Teacher Model: Llama 4-Scout
  • Student Model: MobileLLM-Pro
  • Loss Function: KL Divergence
  • Training Data: <2T open-source tokens

Data Mix:

  • Educational Web Data: Primary training source
  • Coding Data: Programming and software development content
  • Mathematics: Mathematical reasoning and problem-solving
  • Wikipedia: General knowledge and factual information
  • Scientific Papers: Academic and research content
  • Q&A Forums: Question-answer pairs and discussions

Training Process

Pre-training Phase:

  • Large-scale language modeling
  • Knowledge acquisition from teacher model
  • Efficient parameter utilization
  • Context length optimization

Instruction Fine-tuning:

  • Specialized task training
  • Tool calling capabilities
  • Question answering optimization
  • Rewriting and summarization skills

Competitive Landscape

Advantages Over Competitors

vs. Gemma 3 1B:

  • 5.7% average improvement across benchmarks
  • Better reasoning capabilities
  • Superior long-context handling
  • More efficient quantization

vs. Llama 3.2 1B:

  • 7.9% average improvement across benchmarks
  • Enhanced knowledge retention
  • Better instruction following
  • Improved mobile optimization

Market Position

MobileLLM-Pro positions Meta as a leader in:

  • On-device AI: Efficient mobile language models
  • Quantization Technology: Near-lossless compression
  • Edge Computing: Mobile-first AI solutions
  • Open Source: FAIR NC licensed for research and development

Future Implications

Mobile AI Evolution

MobileLLM-Pro represents several important trends:

Efficiency Focus:

  • Models that deliver high performance with minimal resources
  • Advanced quantization techniques for mobile deployment
  • Optimized architectures for edge computing

Privacy and Security:

  • On-device processing for data protection
  • Reduced dependency on cloud services
  • Enhanced user privacy controls

Accessibility:

  • Democratized access to powerful AI capabilities
  • Reduced infrastructure requirements
  • Lower barriers to AI adoption

Industry Impact

Mobile Development:

  • Enhanced app capabilities with on-device AI
  • Reduced server costs and complexity
  • Improved user experience through faster responses

Enterprise Applications:

  • Edge computing solutions
  • Privacy-compliant AI implementations
  • Cost-effective AI deployment

Conclusion

MobileLLM-Pro represents a significant milestone in mobile AI development, demonstrating that it's possible to achieve state-of-the-art performance in a compact, efficient package. By combining innovative architecture with advanced quantization techniques, Meta has created a model that pushes the boundaries of what's possible with on-device language processing.

Key Takeaways:

  • Performance Excellence: Outperforms Gemma 3 1B by 5.7% and Llama 3.2 1B by 7.9% on average
  • Efficiency Innovation: 1.8x faster prefill with 3:1 local-global attention ratio
  • Memory Optimization: Reduces KV cache from 117MB to 40MB for 8k context
  • Near-Lossless Quantization: Less than 1.3% quality degradation with int4 quantization
  • Long-Context Support: 128k token context window for complex applications
  • Mobile-First Design: Optimized for real-world mobile deployment scenarios

This development highlights that mobile AI is reaching new levels of sophistication, with models that can deliver powerful capabilities while maintaining the efficiency and privacy requirements of mobile devices. The combination of advanced performance with practical deployment characteristics positions MobileLLM-Pro as a transformative tool for mobile AI applications.

Sources


Want to learn more about mobile AI and edge computing? Explore our AI models catalog, check out our AI fundamentals courses, or browse our glossary of AI terms for deeper understanding. For information about other mobile AI tools, visit our AI tools section.

Frequently Asked Questions

MobileLLM-Pro is Meta's 1B parameter language model designed for efficient on-device inference, featuring a 128k context window, near-lossless int4 quantization, and competitive performance against larger models.
MobileLLM-Pro uses interleaved local-global attention (3:1 ratio) to reduce prefill latency by 1.8x and lower KV cache size from 117MB to 40MB, making it highly efficient for mobile deployment.
MobileLLM-Pro outperforms Gemma 3 1B by 5.7% and Llama 3.2 1B by 7.9% on average across reasoning, knowledge, and long-context retrieval benchmarks.
MobileLLM-Pro offers int4 quantization with less than 1.3% quality degradation: group-wise quantization for CPU (0.4% regression) and per-channel quantization for accelerators (1.3% regression).
MobileLLM-Pro supports up to 128k tokens, enabling long-context understanding for applications like document summarization and information retrieval.

Continue Your AI Journey

Explore our lessons and glossary to deepen your understanding.