Introduction
NVIDIA has introduced TTT-E2E (Test-Time Training with End-to-End formulation), a groundbreaking approach that fundamentally reimagines how large language models handle long context. Unlike traditional methods that struggle with either performance or latency as context grows, TTT-E2E enables LLMs to compress long context into their model weights through next-token prediction, achieving constant inference latency regardless of context length while maintaining superior performance.
Published on January 9, 2026, by researchers Yu Sun and Yejin Choi from NVIDIA, TTT-E2E addresses the most fundamental problem in long-context AI research: scaling with context length in terms of both loss and latency. The method represents a paradigm shift from treating context as external memory to compressing it directly into the model's weights during inference, similar to how humans compress experience into intuitive understanding.
The Fundamental Problem: LLM Memory vs Human Memory
How Humans Remember
Humans excel at improving with more "context" in the form of life experience, despite imperfect recall of exact details. For example, after your first machine learning lecture, you might not remember the instructor's first word, but the intuition you learned likely helps you understand new concepts years later. This is because humans compress massive amounts of experience into their brains, preserving important, predictive information while leaving out many details.
How LLMs Currently Remember
Traditional transformers with self-attention are designed for nearly lossless recall, maintaining full memory of every token by caching and comparing their keys and values. This approach has a critical limitation: the cost per token grows linearly with context length. Processing the 10-millionth token takes one million times longer than processing the 10th token.
To address this, modern architectures combine full attention with approximations like:
- Sliding-window attention - Limited context window
- Mamba - State space models
- Gated DeltaNet - Recurrent architectures
While these approximations have constant cost per token, they become significantly less effective in longer context compared to full attention, losing important information that would help predict the future.
TTT-E2E: Compressing Context into Weights
The Core Innovation
TTT-E2E solves the long-context problem through compression. Just as humans compress experience into their brains, TTT-E2E compresses context into model weights during test time. The key insight: we know that training with next-token prediction compresses massive amounts of data into model weights, so why not continue this process at test time on the given context?
The TTT-E2E Process:
- Step 1: Meta-learning preparation - During training, the model learns how to learn from context through meta-learning
- Step 2: Test-time training - During inference, the model continues training through next-token prediction on the given context
- Step 3: Context compression - Important, predictive information gets compressed into the model weights
- Step 4: Constant latency inference - The model uses compressed weights for prediction, maintaining constant latency
End-to-End Formulation
TTT-E2E is "end-to-end" in two critical ways:
- Inner loop optimization - Directly optimizes the next-token prediction loss at the end of the network, in contrast to prior work on long-context TTT (e.g., Titans)
- Outer loop optimization - Directly optimizes the final loss after TTT, ensuring the meta-learning prepares the model effectively
This dual optimization creates a model that is fundamentally prepared to learn from context during inference, rather than simply processing it.
Performance Results: Scaling in Both Dimensions
Loss Scaling: From Worst to Best
The left panel of Figure 1 shows TTT-E2E's remarkable performance in terms of loss scaling:
- At 128K context: TTT-E2E turns the worst performance line (gray) into the best (light green)
- Loss ∆ improvement: Maintains the same advantage over full attention as context length increases
- No degradation: While other methods produce worse loss ∆ in longer context, TTT-E2E maintains consistent advantage
The loss ∆ metric is computed as (loss of the reported method) − (loss of transformer with full attention), so full attention itself is the flat line at y=0. TTT-E2E not only matches full attention's loss performance but actually improves upon it while maintaining constant latency.
Latency Scaling: Constant Time Inference
The right panel demonstrates TTT-E2E's constant inference latency:
- 2.7x speedup over full attention for 128K context on NVIDIA H100
- 35x speedup over full attention for 2M context
- Constant latency regardless of context length, similar to RNNs
- No scaling walls observed across extensive experiments
All tested models have 3B parameters and were trained with 164B tokens, ensuring fair comparison across methods.
The Unique Achievement
TTT-E2E is the first method that shows a sign of life at the fundamental problem of scaling with context length in terms of both loss and latency. All other methods exhibit qualitatively different trends:
- Transformers with full attention: Scale well in loss but not latency
- RNNs (Mamba 2, Gated DeltaNet): Scale well in latency but not loss
- TTT-E2E: Scales well in both dimensions
The research community may finally have a basic solution to long context in 2026.
How TTT-E2E Works: Technical Deep Dive
Meta-Learning for Test-Time Training
The effectiveness of TTT-E2E depends on proper preparation during training. The model undergoes meta-learning that prepares its initialization for test-time training:
- Outer loop: Optimizes the model's initial parameters to be good at learning from context
- Inner loop: During test time, the model performs gradient updates on the given context
- End-to-end: Both loops are optimized together, ensuring the model learns how to learn effectively
This meta-learning approach is crucial—without it, simple test-time training would not be effective for long-context compression.
Next-Token Prediction as Compression
During test time, TTT-E2E compresses context through next-token prediction:
- Context processing: The model receives a long context sequence
- Gradient updates: The model performs a few gradient steps on next-token prediction
- Weight updates: Model weights are updated to better predict tokens in this context
- Compression achieved: Important, predictive information is now encoded in the weights
- Inference: The model uses updated weights for final predictions with constant latency
This process mirrors how humans learn: we don't store every detail, but we compress experience into understanding that helps us predict and understand future situations.
Constant Latency Architecture
The key to constant latency is that TTT-E2E doesn't need to attend to every token during inference:
- Compressed representation: Context information is already in the weights
- No full attention: The model doesn't need to compare against all previous tokens
- Efficient inference: Similar to RNNs, processing each new token has constant cost
- Scalable: Works the same way whether context is 1K or 2M tokens
Comparison with Existing Methods
Transformers with Full Attention
Strengths:
- Excellent loss scaling with context length
- Maintains all context information
- Proven architecture
Weaknesses:
- Linear latency scaling with context
- Becomes prohibitively slow for very long contexts
- Memory requirements grow quadratically
TTT-E2E Advantage: Matches or exceeds loss performance while maintaining constant latency.
RNNs (Mamba 2, Gated DeltaNet)
Strengths:
- Constant latency regardless of context length
- Memory efficient
- Fast inference
Weaknesses:
- Loss degrades significantly with longer context
- Loses important information
- Less effective than full attention
TTT-E2E Advantage: Maintains constant latency like RNNs while achieving better loss than full attention.
Prior Test-Time Training Methods
Previous TTT approaches:
- Often used auxiliary tasks or separate objectives
- Didn't optimize end-to-end
- Less effective for long-context compression
TTT-E2E Advantage: Direct optimization of next-token prediction with end-to-end meta-learning makes it far more effective.
The Role of RAG in the TTT-E2E Era
TTT-E2E and RAG (Retrieval-Augmented Generation) serve complementary roles:
TTT-E2E is like updating the human brain—it compresses context into intuitive, predictive understanding that persists and helps with future tasks. The model's weights encode learned patterns and insights.
RAG is like writing things down and looking them up in a notepad—it provides access to detailed information when specifics matter, like shopping for a long list of groceries.
The Relationship:
- Human productivity is mostly determined by their brains, not by the notepads they use
- AI agent productivity is mostly determined by how well it compresses context into predictive information
- RAG remains valuable for accessing specific details, but TTT-E2E provides the foundational understanding
Both approaches will likely coexist, with TTT-E2E handling the core reasoning and RAG providing detailed retrieval when needed.
Limitations and Future Work
Current Limitations
Meta-Learning Overhead:
- Current meta-learning implementation is 3.4x slower than standard pre-training for short context (8K)
- This is because the standard API of FlashAttention does not support gradients of gradients
- The limitation affects training time, not inference time
Potential Solutions:
- Custom attention kernels - Develop kernels that support gradients of gradients
- Hybrid initialization - Initialize TTT-E2E from a standard Transformer pre-trained without TTT
Future Research Directions
The research opens several exciting directions for the community:
- Larger models - Extending TTT-E2E to models with billions of parameters
- Efficiency improvements - Optimizing the meta-learning phase for faster training through custom kernels or hybrid initializations
- Further exploration - Additional research directions that the community may pursue
Implications for AI Development
Paradigm Shift in Context Handling
TTT-E2E represents a fundamental shift in how we think about context in AI:
Traditional Approach:
- Context is external memory that must be attended to
- Longer context means more computation
- Trade-off between performance and efficiency
TTT-E2E Approach:
- Context is compressed into model weights
- Longer context doesn't mean more inference computation
- No trade-off—better performance and efficiency together
Impact on AI Applications
This breakthrough could transform many AI applications:
Long Document Processing:
- Process entire codebases, books, or research papers with constant latency
- Maintain understanding across very long documents
- Enable new applications requiring deep context understanding
Conversational AI:
- Maintain context across extremely long conversations
- Learn user preferences and patterns over time
- Provide consistent, context-aware responses
Agentic AI:
- AI agents that learn from long interaction histories
- Compress experience into actionable knowledge
- Improve performance over time through context compression
Scientific Research:
- Process entire research corpora efficiently
- Maintain understanding across multiple papers
- Enable new forms of scientific discovery
Research Contributions
Novel Methodology
TTT-E2E introduces several innovative concepts:
- Test-time training for context compression - Novel approach to handling long context
- End-to-end meta-learning - Unified optimization of training and test-time learning
- Constant latency scaling - First method to achieve both loss and latency scaling
- Next-token prediction as compression - Leverages existing training objective for new purpose
Comprehensive Evaluation
The research includes extensive validation:
- Multiple context lengths - Tested up to 2M tokens, with detailed analysis at 128K and 2M
- Multiple architectures - Compared with transformers, Mamba 2, and Gated DeltaNet
- Rigorous benchmarks - Extensive experiments showing no scaling walls
- Performance analysis - Detailed comparison of loss and latency scaling
Open Research
NVIDIA has made the research accessible:
- Published paper - "End-to-End Test-Time Training for Long Context" (paper and code are publicly available)
- Reproducible experiments - Clear experimental setup and results
- Community contribution - Advancing the field of long-context AI
Conclusion
NVIDIA's TTT-E2E (Test-Time Training with End-to-End formulation) represents a fundamental breakthrough in long-context AI research. By enabling LLMs to compress long context into model weights through next-token prediction, TTT-E2E achieves what no previous method could: scaling well in both loss and latency as context length increases.
Key Achievements:
- 2.7x speedup over full attention for 128K context on NVIDIA H100
- 35x speedup for 2M context, demonstrating constant latency regardless of length
- Superior loss scaling - Maintains advantage over full attention as context grows
- No scaling walls - Extensive experiments show consistent scaling trends
- First solution to the fundamental problem of scaling in both dimensions
The Significance:
TTT-E2E reimagines LLM memory by treating context compression as a core capability, similar to how humans compress experience into intuitive understanding. This approach eliminates the traditional trade-off between performance and efficiency, opening new possibilities for AI applications that require deep, long-context understanding.
The research suggests that 2026 may finally see a basic solution to long context in AI, with TTT-E2E providing a foundation for the next generation of context-aware AI systems. As the community addresses current limitations and extends the approach to larger models and new applications, we may see transformative improvements in how AI systems understand and work with long-form content.
Ready to explore more AI breakthroughs? Check out our AI Fundamentals course to understand the building blocks of modern AI, visit our glossary for key terms like transformer and large language models, or browse our models catalog to learn about the latest AI developments.
Sources
This article covers groundbreaking research in long-context AI. For more cutting-edge AI news and analysis, check out our blog or explore related topics in our AI fundamentals courses.