NVIDIA TTT-E2E: Test-Time Training Long Context

NVIDIA introduces TTT-E2E, enabling LLMs to compress long context into model weights via next-token prediction, achieving constant inference latency regardless of context length.

by HowAIWorks Team
nvidiatest-time-traininglong-contextllmscontext-compressionai-researchmachine-learningtransformermambarnninference-optimization

Introduction

NVIDIA has introduced TTT-E2E (Test-Time Training with End-to-End formulation), a groundbreaking approach that fundamentally reimagines how large language models handle long context. Unlike traditional methods that struggle with either performance or latency as context grows, TTT-E2E enables LLMs to compress long context into their model weights through next-token prediction, achieving constant inference latency regardless of context length while maintaining superior performance.

Published on January 9, 2026, by researchers Yu Sun and Yejin Choi from NVIDIA, TTT-E2E addresses the most fundamental problem in long-context AI research: scaling with context length in terms of both loss and latency. The method represents a paradigm shift from treating context as external memory to compressing it directly into the model's weights during inference, similar to how humans compress experience into intuitive understanding.

The Fundamental Problem: LLM Memory vs Human Memory

How Humans Remember

Humans excel at improving with more "context" in the form of life experience, despite imperfect recall of exact details. For example, after your first machine learning lecture, you might not remember the instructor's first word, but the intuition you learned likely helps you understand new concepts years later. This is because humans compress massive amounts of experience into their brains, preserving important, predictive information while leaving out many details.

How LLMs Currently Remember

Traditional transformers with self-attention are designed for nearly lossless recall, maintaining full memory of every token by caching and comparing their keys and values. This approach has a critical limitation: the cost per token grows linearly with context length. Processing the 10-millionth token takes one million times longer than processing the 10th token.

To address this, modern architectures combine full attention with approximations like:

  • Sliding-window attention - Limited context window
  • Mamba - State space models
  • Gated DeltaNet - Recurrent architectures

While these approximations have constant cost per token, they become significantly less effective in longer context compared to full attention, losing important information that would help predict the future.

TTT-E2E: Compressing Context into Weights

The Core Innovation

TTT-E2E solves the long-context problem through compression. Just as humans compress experience into their brains, TTT-E2E compresses context into model weights during test time. The key insight: we know that training with next-token prediction compresses massive amounts of data into model weights, so why not continue this process at test time on the given context?

The TTT-E2E Process:

  • Step 1: Meta-learning preparation - During training, the model learns how to learn from context through meta-learning
  • Step 2: Test-time training - During inference, the model continues training through next-token prediction on the given context
  • Step 3: Context compression - Important, predictive information gets compressed into the model weights
  • Step 4: Constant latency inference - The model uses compressed weights for prediction, maintaining constant latency

End-to-End Formulation

TTT-E2E is "end-to-end" in two critical ways:

  1. Inner loop optimization - Directly optimizes the next-token prediction loss at the end of the network, in contrast to prior work on long-context TTT (e.g., Titans)
  2. Outer loop optimization - Directly optimizes the final loss after TTT, ensuring the meta-learning prepares the model effectively

This dual optimization creates a model that is fundamentally prepared to learn from context during inference, rather than simply processing it.

Performance Results: Scaling in Both Dimensions

Loss Scaling: From Worst to Best

The left panel of Figure 1 shows TTT-E2E's remarkable performance in terms of loss scaling:

  • At 128K context: TTT-E2E turns the worst performance line (gray) into the best (light green)
  • Loss ∆ improvement: Maintains the same advantage over full attention as context length increases
  • No degradation: While other methods produce worse loss ∆ in longer context, TTT-E2E maintains consistent advantage

The loss ∆ metric is computed as (loss of the reported method) − (loss of transformer with full attention), so full attention itself is the flat line at y=0. TTT-E2E not only matches full attention's loss performance but actually improves upon it while maintaining constant latency.

Latency Scaling: Constant Time Inference

The right panel demonstrates TTT-E2E's constant inference latency:

  • 2.7x speedup over full attention for 128K context on NVIDIA H100
  • 35x speedup over full attention for 2M context
  • Constant latency regardless of context length, similar to RNNs
  • No scaling walls observed across extensive experiments

All tested models have 3B parameters and were trained with 164B tokens, ensuring fair comparison across methods.

The Unique Achievement

TTT-E2E is the first method that shows a sign of life at the fundamental problem of scaling with context length in terms of both loss and latency. All other methods exhibit qualitatively different trends:

  • Transformers with full attention: Scale well in loss but not latency
  • RNNs (Mamba 2, Gated DeltaNet): Scale well in latency but not loss
  • TTT-E2E: Scales well in both dimensions

The research community may finally have a basic solution to long context in 2026.

How TTT-E2E Works: Technical Deep Dive

Meta-Learning for Test-Time Training

The effectiveness of TTT-E2E depends on proper preparation during training. The model undergoes meta-learning that prepares its initialization for test-time training:

  • Outer loop: Optimizes the model's initial parameters to be good at learning from context
  • Inner loop: During test time, the model performs gradient updates on the given context
  • End-to-end: Both loops are optimized together, ensuring the model learns how to learn effectively

This meta-learning approach is crucial—without it, simple test-time training would not be effective for long-context compression.

Next-Token Prediction as Compression

During test time, TTT-E2E compresses context through next-token prediction:

  1. Context processing: The model receives a long context sequence
  2. Gradient updates: The model performs a few gradient steps on next-token prediction
  3. Weight updates: Model weights are updated to better predict tokens in this context
  4. Compression achieved: Important, predictive information is now encoded in the weights
  5. Inference: The model uses updated weights for final predictions with constant latency

This process mirrors how humans learn: we don't store every detail, but we compress experience into understanding that helps us predict and understand future situations.

Constant Latency Architecture

The key to constant latency is that TTT-E2E doesn't need to attend to every token during inference:

  • Compressed representation: Context information is already in the weights
  • No full attention: The model doesn't need to compare against all previous tokens
  • Efficient inference: Similar to RNNs, processing each new token has constant cost
  • Scalable: Works the same way whether context is 1K or 2M tokens

Comparison with Existing Methods

Transformers with Full Attention

Strengths:

  • Excellent loss scaling with context length
  • Maintains all context information
  • Proven architecture

Weaknesses:

  • Linear latency scaling with context
  • Becomes prohibitively slow for very long contexts
  • Memory requirements grow quadratically

TTT-E2E Advantage: Matches or exceeds loss performance while maintaining constant latency.

RNNs (Mamba 2, Gated DeltaNet)

Strengths:

  • Constant latency regardless of context length
  • Memory efficient
  • Fast inference

Weaknesses:

  • Loss degrades significantly with longer context
  • Loses important information
  • Less effective than full attention

TTT-E2E Advantage: Maintains constant latency like RNNs while achieving better loss than full attention.

Prior Test-Time Training Methods

Previous TTT approaches:

  • Often used auxiliary tasks or separate objectives
  • Didn't optimize end-to-end
  • Less effective for long-context compression

TTT-E2E Advantage: Direct optimization of next-token prediction with end-to-end meta-learning makes it far more effective.

The Role of RAG in the TTT-E2E Era

TTT-E2E and RAG (Retrieval-Augmented Generation) serve complementary roles:

TTT-E2E is like updating the human brain—it compresses context into intuitive, predictive understanding that persists and helps with future tasks. The model's weights encode learned patterns and insights.

RAG is like writing things down and looking them up in a notepad—it provides access to detailed information when specifics matter, like shopping for a long list of groceries.

The Relationship:

  • Human productivity is mostly determined by their brains, not by the notepads they use
  • AI agent productivity is mostly determined by how well it compresses context into predictive information
  • RAG remains valuable for accessing specific details, but TTT-E2E provides the foundational understanding

Both approaches will likely coexist, with TTT-E2E handling the core reasoning and RAG providing detailed retrieval when needed.

Limitations and Future Work

Current Limitations

Meta-Learning Overhead:

  • Current meta-learning implementation is 3.4x slower than standard pre-training for short context (8K)
  • This is because the standard API of FlashAttention does not support gradients of gradients
  • The limitation affects training time, not inference time

Potential Solutions:

  1. Custom attention kernels - Develop kernels that support gradients of gradients
  2. Hybrid initialization - Initialize TTT-E2E from a standard Transformer pre-trained without TTT

Future Research Directions

The research opens several exciting directions for the community:

  • Larger models - Extending TTT-E2E to models with billions of parameters
  • Efficiency improvements - Optimizing the meta-learning phase for faster training through custom kernels or hybrid initializations
  • Further exploration - Additional research directions that the community may pursue

Implications for AI Development

Paradigm Shift in Context Handling

TTT-E2E represents a fundamental shift in how we think about context in AI:

Traditional Approach:

  • Context is external memory that must be attended to
  • Longer context means more computation
  • Trade-off between performance and efficiency

TTT-E2E Approach:

  • Context is compressed into model weights
  • Longer context doesn't mean more inference computation
  • No trade-off—better performance and efficiency together

Impact on AI Applications

This breakthrough could transform many AI applications:

Long Document Processing:

  • Process entire codebases, books, or research papers with constant latency
  • Maintain understanding across very long documents
  • Enable new applications requiring deep context understanding

Conversational AI:

  • Maintain context across extremely long conversations
  • Learn user preferences and patterns over time
  • Provide consistent, context-aware responses

Agentic AI:

  • AI agents that learn from long interaction histories
  • Compress experience into actionable knowledge
  • Improve performance over time through context compression

Scientific Research:

  • Process entire research corpora efficiently
  • Maintain understanding across multiple papers
  • Enable new forms of scientific discovery

Research Contributions

Novel Methodology

TTT-E2E introduces several innovative concepts:

  • Test-time training for context compression - Novel approach to handling long context
  • End-to-end meta-learning - Unified optimization of training and test-time learning
  • Constant latency scaling - First method to achieve both loss and latency scaling
  • Next-token prediction as compression - Leverages existing training objective for new purpose

Comprehensive Evaluation

The research includes extensive validation:

  • Multiple context lengths - Tested up to 2M tokens, with detailed analysis at 128K and 2M
  • Multiple architectures - Compared with transformers, Mamba 2, and Gated DeltaNet
  • Rigorous benchmarks - Extensive experiments showing no scaling walls
  • Performance analysis - Detailed comparison of loss and latency scaling

Open Research

NVIDIA has made the research accessible:

  • Published paper - "End-to-End Test-Time Training for Long Context" (paper and code are publicly available)
  • Reproducible experiments - Clear experimental setup and results
  • Community contribution - Advancing the field of long-context AI

Conclusion

NVIDIA's TTT-E2E (Test-Time Training with End-to-End formulation) represents a fundamental breakthrough in long-context AI research. By enabling LLMs to compress long context into model weights through next-token prediction, TTT-E2E achieves what no previous method could: scaling well in both loss and latency as context length increases.

Key Achievements:

  • 2.7x speedup over full attention for 128K context on NVIDIA H100
  • 35x speedup for 2M context, demonstrating constant latency regardless of length
  • Superior loss scaling - Maintains advantage over full attention as context grows
  • No scaling walls - Extensive experiments show consistent scaling trends
  • First solution to the fundamental problem of scaling in both dimensions

The Significance:

TTT-E2E reimagines LLM memory by treating context compression as a core capability, similar to how humans compress experience into intuitive understanding. This approach eliminates the traditional trade-off between performance and efficiency, opening new possibilities for AI applications that require deep, long-context understanding.

The research suggests that 2026 may finally see a basic solution to long context in AI, with TTT-E2E providing a foundation for the next generation of context-aware AI systems. As the community addresses current limitations and extends the approach to larger models and new applications, we may see transformative improvements in how AI systems understand and work with long-form content.

Ready to explore more AI breakthroughs? Check out our AI Fundamentals course to understand the building blocks of modern AI, visit our glossary for key terms like transformer and large language models, or browse our models catalog to learn about the latest AI developments.

Sources


This article covers groundbreaking research in long-context AI. For more cutting-edge AI news and analysis, check out our blog or explore related topics in our AI fundamentals courses.

Frequently Asked Questions

TTT-E2E (Test-Time Training with End-to-End formulation) enables LLMs to compress long context into model weights through next-token prediction during inference, achieving constant latency regardless of context length while maintaining performance.
TTT-E2E outperforms both: transformers scale well in loss but not latency, while RNNs like Mamba scale well in latency but not loss. TTT-E2E is the only method that scales well in both dimensions.
TTT-E2E achieves 2.7x speedup over full attention for 128K context and 35x speedup for 2M context on NVIDIA H100, with constant inference latency regardless of context length and superior loss scaling.
During test time, the model continues training through next-token prediction on the given context, compressing important information into its weights similar to how humans compress experience into memory.
Meta-learning during training prepares the model's initialization for test-time training, making the end-to-end approach effective. The model learns how to learn from context during inference.
Current meta-learning implementation is 3.4x slower than standard pre-training for short contexts due to lack of support for gradients of gradients in FlashAttention, though this can be addressed with custom kernels or hybrid initializations.

Continue Your AI Journey

Explore our lessons and glossary to deepen your understanding.