NVIDIA TTT-E2E: Test-Time Training Long Context

Introduction

NVIDIA has introduced TTT-E2E (Test-Time Training with End-to-End formulation), a groundbreaking approach that fundamentally reimagines how large language models handle long context. Unlike traditional methods that struggle with either performance or latency as context grows, TTT-E2E enables LLMs to compress long context into their model weights through next-token prediction, achieving constant inference latency regardless of context length while maintaining superior performance.

Published on January 9, 2026, by researchers Yu Sun and Yejin Choi from NVIDIA, TTT-E2E addresses the most fundamental problem in long-context AI research: scaling with context length in terms of both loss and latency. The method represents a paradigm shift from treating context as external memory to compressing it directly into the model's weights during inference, similar to how humans compress experience into intuitive understanding.

The Fundamental Problem: LLM Memory vs Human Memory

How Humans Remember

Humans excel at improving with more "context" in the form of life experience, despite imperfect recall of exact details. For example, after your first machine learning lecture, you might not remember the instructor's first word, but the intuition you learned likely helps you understand new concepts years later. This is because humans compress massive amounts of experience into their brains, preserving important, predictive information while leaving out many details.

How LLMs Currently Remember

Traditional transformers with self-attention are designed for nearly lossless recall, maintaining full memory of every token by caching and comparing their keys and values. This approach has a critical limitation: the cost per token grows linearly with context length. Processing the 10-millionth token takes one million times longer than processing the 10th token.

To address this, modern architectures combine full attention with approximations like:

Sliding-window attention - Limited context window
Mamba - State space models
Gated DeltaNet - Recurrent architectures

While these approximations have constant cost per token, they become significantly less effective in longer context compared to full attention, losing important information that would help predict the future.

TTT-E2E: Compressing Context into Weights

The Core Innovation

TTT-E2E solves the long-context problem through compression. Just as humans compress experience into their brains, TTT-E2E compresses context into model weights during test time. The key insight: we know that training with next-token prediction compresses massive amounts of data into model weights, so why not continue this process at test time on the given context?

The TTT-E2E Process:

Step 1: Meta-learning preparation - During training, the model learns how to learn from context through meta-learning
Step 2: Test-time training - During inference, the model continues training through next-token prediction on the given context
Step 3: Context compression - Important, predictive information gets compressed into the model weights
Step 4: Constant latency inference - The model uses compressed weights for prediction, maintaining constant latency

End-to-End Formulation

TTT-E2E is "end-to-end" in two critical ways:

Inner loop optimization - Directly optimizes the next-token prediction loss at the end of the network, in contrast to prior work on long-context TTT (e.g., Titans)
Outer loop optimization - Directly optimizes the final loss after TTT, ensuring the meta-learning prepares the model effectively

This dual optimization creates a model that is fundamentally prepared to learn from context during inference, rather than simply processing it.

Performance Results: Scaling in Both Dimensions

Loss Scaling: From Worst to Best

The left panel of Figure 1 shows TTT-E2E's remarkable performance in terms of loss scaling:

At 128K context: TTT-E2E turns the worst performance line (gray) into the best (light green)
Loss ∆ improvement: Maintains the same advantage over full attention as context length increases
No degradation: While other methods produce worse loss ∆ in longer context, TTT-E2E maintains consistent advantage

The loss ∆ metric is computed as (loss of the reported method) − (loss of transformer with full attention), so full attention itself is the flat line at y=0. TTT-E2E not only matches full attention's loss performance but actually improves upon it while maintaining constant latency.

Latency Scaling: Constant Time Inference

The right panel demonstrates TTT-E2E's constant inference latency:

2.7x speedup over full attention for 128K context on NVIDIA H100
35x speedup over full attention for 2M context
Constant latency regardless of context length, similar to RNNs
No scaling walls observed across extensive experiments

All tested models have 3B parameters and were trained with 164B tokens, ensuring fair comparison across methods.

The Unique Achievement

TTT-E2E is the first method that shows a sign of life at the fundamental problem of scaling with context length in terms of both loss and latency. All other methods exhibit qualitatively different trends:

Transformers with full attention: Scale well in loss but not latency
RNNs (Mamba 2, Gated DeltaNet): Scale well in latency but not loss
TTT-E2E: Scales well in both dimensions

The research community may finally have a basic solution to long context in 2026.

How TTT-E2E Works: Technical Deep Dive

Meta-Learning for Test-Time Training

The effectiveness of TTT-E2E depends on proper preparation during training. The model undergoes meta-learning that prepares its initialization for test-time training:

Outer loop: Optimizes the model's initial parameters to be good at learning from context
Inner loop: During test time, the model performs gradient updates on the given context
End-to-end: Both loops are optimized together, ensuring the model learns how to learn effectively

This meta-learning approach is crucial—without it, simple test-time training would not be effective for long-context compression.

Next-Token Prediction as Compression

During test time, TTT-E2E compresses context through next-token prediction:

Context processing: The model receives a long context sequence
Gradient updates: The model performs a few gradient steps on next-token prediction
Weight updates: Model weights are updated to better predict tokens in this context
Compression achieved: Important, predictive information is now encoded in the weights
Inference: The model uses updated weights for final predictions with constant latency

This process mirrors how humans learn: we don't store every detail, but we compress experience into understanding that helps us predict and understand future situations.

Constant Latency Architecture

The key to constant latency is that TTT-E2E doesn't need to attend to every token during inference:

Compressed representation: Context information is already in the weights
No full attention: The model doesn't need to compare against all previous tokens
Efficient inference: Similar to RNNs, processing each new token has constant cost
Scalable: Works the same way whether context is 1K or 2M tokens

Comparison with Existing Methods

Transformers with Full Attention

Strengths:

Excellent loss scaling with context length
Maintains all context information
Proven architecture

Weaknesses:

Linear latency scaling with context
Becomes prohibitively slow for very long contexts
Memory requirements grow quadratically

TTT-E2E Advantage: Matches or exceeds loss performance while maintaining constant latency.

RNNs (Mamba 2, Gated DeltaNet)

Strengths:

Constant latency regardless of context length
Memory efficient
Fast inference

Weaknesses:

Loss degrades significantly with longer context
Loses important information
Less effective than full attention

TTT-E2E Advantage: Maintains constant latency like RNNs while achieving better loss than full attention.

Prior Test-Time Training Methods

Previous TTT approaches:

Often used auxiliary tasks or separate objectives
Didn't optimize end-to-end
Less effective for long-context compression

TTT-E2E Advantage: Direct optimization of next-token prediction with end-to-end meta-learning makes it far more effective.

The Role of RAG in the TTT-E2E Era

TTT-E2E and RAG (Retrieval-Augmented Generation) serve complementary roles:

TTT-E2E is like updating the human brain—it compresses context into intuitive, predictive understanding that persists and helps with future tasks. The model's weights encode learned patterns and insights.

RAG is like writing things down and looking them up in a notepad—it provides access to detailed information when specifics matter, like shopping for a long list of groceries.

The Relationship:

Human productivity is mostly determined by their brains, not by the notepads they use
AI agent productivity is mostly determined by how well it compresses context into predictive information
RAG remains valuable for accessing specific details, but TTT-E2E provides the foundational understanding

Both approaches will likely coexist, with TTT-E2E handling the core reasoning and RAG providing detailed retrieval when needed.

Limitations and Future Work

Current Limitations

Meta-Learning Overhead:

Current meta-learning implementation is 3.4x slower than standard pre-training for short context (8K)
This is because the standard API of FlashAttention does not support gradients of gradients
The limitation affects training time, not inference time

Potential Solutions:

Custom attention kernels - Develop kernels that support gradients of gradients
Hybrid initialization - Initialize TTT-E2E from a standard Transformer pre-trained without TTT

Future Research Directions

The research opens several exciting directions for the community:

Larger models - Extending TTT-E2E to models with billions of parameters
Efficiency improvements - Optimizing the meta-learning phase for faster training through custom kernels or hybrid initializations
Further exploration - Additional research directions that the community may pursue

Implications for AI Development

Paradigm Shift in Context Handling

TTT-E2E represents a fundamental shift in how we think about context in AI:

Traditional Approach:

Context is external memory that must be attended to
Longer context means more computation
Trade-off between performance and efficiency

TTT-E2E Approach:

Context is compressed into model weights
Longer context doesn't mean more inference computation
No trade-off—better performance and efficiency together

Impact on AI Applications

This breakthrough could transform many AI applications:

Long Document Processing:

Process entire codebases, books, or research papers with constant latency
Maintain understanding across very long documents
Enable new applications requiring deep context understanding

Conversational AI:

Maintain context across extremely long conversations
Learn user preferences and patterns over time
Provide consistent, context-aware responses

Agentic AI:

AI agents that learn from long interaction histories
Compress experience into actionable knowledge
Improve performance over time through context compression

Scientific Research:

Process entire research corpora efficiently
Maintain understanding across multiple papers
Enable new forms of scientific discovery

Research Contributions

Novel Methodology

TTT-E2E introduces several innovative concepts:

Test-time training for context compression - Novel approach to handling long context
End-to-end meta-learning - Unified optimization of training and test-time learning
Constant latency scaling - First method to achieve both loss and latency scaling
Next-token prediction as compression - Leverages existing training objective for new purpose

Comprehensive Evaluation

The research includes extensive validation:

Multiple context lengths - Tested up to 2M tokens, with detailed analysis at 128K and 2M
Multiple architectures - Compared with transformers, Mamba 2, and Gated DeltaNet
Rigorous benchmarks - Extensive experiments showing no scaling walls
Performance analysis - Detailed comparison of loss and latency scaling

Open Research

NVIDIA has made the research accessible:

Published paper - "End-to-End Test-Time Training for Long Context" (paper and code are publicly available)
Reproducible experiments - Clear experimental setup and results
Community contribution - Advancing the field of long-context AI

Conclusion

NVIDIA's TTT-E2E (Test-Time Training with End-to-End formulation) represents a fundamental breakthrough in long-context AI research. By enabling LLMs to compress long context into model weights through next-token prediction, TTT-E2E achieves what no previous method could: scaling well in both loss and latency as context length increases.

Key Achievements:

2.7x speedup over full attention for 128K context on NVIDIA H100
35x speedup for 2M context, demonstrating constant latency regardless of length
Superior loss scaling - Maintains advantage over full attention as context grows
No scaling walls - Extensive experiments show consistent scaling trends
First solution to the fundamental problem of scaling in both dimensions

The Significance:

TTT-E2E reimagines LLM memory by treating context compression as a core capability, similar to how humans compress experience into intuitive understanding. This approach eliminates the traditional trade-off between performance and efficiency, opening new possibilities for AI applications that require deep, long-context understanding.

The research suggests that 2026 may finally see a basic solution to long context in AI, with TTT-E2E providing a foundation for the next generation of context-aware AI systems. As the community addresses current limitations and extends the approach to larger models and new applications, we may see transformative improvements in how AI systems understand and work with long-form content.

Ready to explore more AI breakthroughs? Check out our AI Fundamentals course to understand the building blocks of modern AI, visit our glossary for key terms like transformer and large language models, or browse our models catalog to learn about the latest AI developments.

Sources

This article covers groundbreaking research in long-context AI. For more cutting-edge AI news and analysis, check out our blog or explore related topics in our AI fundamentals courses.