Introduction
As AI development matures, the focus is shifting from crafting perfect prompts to a more fundamental challenge: context engineering. In a comprehensive engineering blog post published on September 29, 2025, Anthropic's engineering team revealed that building effective AI agents is less about finding the right words and more about answering a critical question: "What configuration of context is most likely to generate our model's desired behavior?"
Released alongside Claude Sonnet 4.5, this guidance represents years of experience building sophisticated agents like Claude Code and marks a significant evolution in AI development methodology. Context represents the set of tokens included when sampling from a large language model (LLM), and the engineering challenge is optimizing the utility of those tokens against inherent LLM constraints to consistently achieve desired outcomes.
This methodology complements Anthropic's recent guidance on writing effective tools for agents and the Claude Agent SDK, forming a comprehensive framework for building production-ready AI agents.
From Prompt Engineering to Context Engineering
The Evolution of AI Development
Prompt Engineering emerged as the primary focus in early LLM applications, centered on writing and organizing instructions for optimal outcomes. This approach worked well for one-shot classification and simple text generation tasks.
Context Engineering represents the natural progression, encompassing strategies for curating and maintaining the optimal set of tokens during inference. This includes:
- System instructions
- Tool definitions
- Model Context Protocol (MCP) integrations
- External data sources
- Message history
- Memory systems
As agents operate over multiple turns and longer time horizons, they generate exponentially more data that could be relevant for the next inference step. Context engineering addresses the challenge of cyclically refining this information to maintain only what's necessary.
Why This Shift Matters
The difference is fundamental:
- Prompt engineering: Discrete task of writing a single prompt
- Context engineering: Iterative process where curation happens at each inference step
This shift reflects the move from simple, stateless interactions to sophisticated agents that maintain state, use tools, and operate over extended periods.
The Attention Budget: Why Context is Finite
Understanding Context Rot
Research on "needle-in-a-haystack" benchmarking has uncovered a critical phenomenon: context rot. As the number of tokens in the context window increases, the model's ability to accurately recall information from that context decreases.
This degradation occurs across all models, though some exhibit more gentle decline than others. The implication is clear: context must be treated as a finite resource with diminishing marginal returns.
Architectural Constraints
The attention scarcity stems from the transformer architecture itself:
N² Complexity: Every token in the context can attend to every other token, creating n² pairwise relationships for n tokens. As context length increases, the model's ability to capture these relationships gets stretched thin.
Training Distribution: Models develop attention patterns from training data where shorter sequences are more common than longer ones. This means models have less experience with, and fewer specialized parameters for, long-range dependencies.
Position Encoding: Techniques like position encoding interpolation allow models to handle longer sequences by adapting them to originally trained smaller contexts, though with some degradation in token position understanding.
These factors create a performance gradient rather than a hard cliff - models remain capable at longer contexts but may show reduced precision for information retrieval and long-range reasoning.
Anatomy of Effective Context
The Guiding Principle
Good context engineering means finding the smallest possible set of high-signal tokens that maximize the likelihood of the desired outcome. Here's how this applies to each component:
1. System Prompts: The Right Altitude
System prompts should use simple, direct language at the "right altitude" - a balance between two extremes:
Too Low (Brittle):
- Hardcoded complex logic
- Excessive if-else instructions
- Fragile, difficult to maintain
- Over-specification
Too High (Vague):
- Overly general guidance
- False assumptions about shared context
- Unclear expectations
- Under-specification
Optimal Approach:
- Specific enough to guide behavior effectively
- Flexible enough to provide strong heuristics
- Organized into distinct sections (background, instructions, tool guidance, output description)
- Use XML tags or Markdown headers for clear delineation
- Start minimal, add based on observed failure modes
2. Tools: Promoting Efficiency
Tools define the contract between agents and their information/action space. Effective tools:
Design Principles:
- Minimal overlap in functionality
- Self-contained and robust to errors
- Crystal clear intended use
- Descriptive, unambiguous parameters
- Token-efficient return values
- Encourage efficient agent behaviors
Common Failure Modes:
- Bloated tool sets covering too much functionality
- Ambiguous decision points about which tool to use
- Unclear boundaries between similar tools
- Verbose or low-signal outputs
Best Practice: If a human engineer can't definitively say which tool should be used in a given situation, an AI agent can't be expected to do better.
3. Examples: Quality Over Quantity
Few-shot prompting remains a best practice, but implementation matters:
Avoid:
- Laundry lists of edge cases
- Attempting to cover every possible scenario
- Redundant or similar examples
Prefer:
- Diverse, canonical examples
- Clear portrayal of expected behavior
- Representative edge cases
- "Pictures worth a thousand words" - examples are often more powerful than lengthy descriptions
Agentic Search: Just-in-Time Context Retrieval
The Traditional Approach
Many AI applications employ embedding-based pre-inference retrieval to surface important context. This involves:
- Creating vector embeddings of information
- Storing in vector databases
- Retrieving similar content based on query embeddings
Limitations:
- Requires maintaining indices
- Can become stale
- Limited by pre-defined retrieval strategies
- Doesn't adapt to agent's actual needs
The Agentic Alternative
Modern agents can explore their environment autonomously through tools, discovering context just-in-time:
How It Works:
- Agents use tools like
glob
,grep
,list_directory
,read_file
- Each interaction yields context that informs the next decision
- File sizes suggest complexity
- Naming conventions hint at purpose
- Timestamps proxy for relevance
- Layer-by-layer understanding
Advantages:
- No stale indexing
- Self-managed context window
- Focused on relevant subsets
- Adapts to actual task requirements
- Bypasses complex syntax trees
Trade-offs:
- Runtime exploration is slower than pre-computed retrieval
- Requires opinionated engineering to ensure proper navigation
- Without guidance, agents can waste context on dead-ends
The Hybrid Strategy
The most effective approach often combines both methods:
Example: Claude Code:
- CLAUDE.md files dropped into context up front (speed)
- Primitives like
glob
andgrep
for just-in-time navigation (flexibility) - Balances speed with adaptability
Decision Factors:
- Task characteristics
- Content dynamism
- Performance requirements
- Agent capabilities
As model capabilities improve, the trend moves toward letting intelligent models act intelligently with progressively less human curation.
Long-Horizon Task Management
The Challenge
Long-horizon tasks require agents to maintain coherence and goal-directed behavior over sequences where token count exceeds the context window. Examples include:
- Large codebase migrations (tens of minutes)
- Comprehensive research projects (hours)
- Extended software development tasks (30+ hours)
Waiting for larger context windows isn't a complete solution - context pollution and information relevance concerns persist at all scales.
Technique 1: Compaction
What It Is: Summarizing conversation history and reinitiating with a compressed version.
Implementation:
- Pass message history to the model for summarization
- Preserve critical details: architectural decisions, unresolved bugs, implementation details
- Discard redundant content: tool outputs, repeated messages
- Continue with compressed context plus recently accessed resources
Example from Claude Code:
- Summarize and compress critical details
- Keep five most recently accessed files
- Maintain continuity without hitting limits
Best Practices:
- Maximize recall first (capture everything relevant)
- Then improve precision (eliminate superfluous content)
- Tune on complex agent traces
- Clear tool calls and results from deep history
Tool Result Clearing: One of the safest forms of compaction - once a tool result is used, why keep the raw output deep in history? This feature is now available on the Claude Developer Platform, launched in September 2025.
Example Compaction Approach:
# Simplified compaction strategy
def compact_context(message_history, max_tokens=100000):
if count_tokens(message_history) < max_tokens:
return message_history
# Keep recent messages and system prompts
system_messages = [m for m in message_history if m.role == 'system']
recent_messages = message_history[-10:] # Keep last 10 messages
# Summarize the middle section
middle_section = message_history[len(system_messages):-10]
summary = {
'role': 'system',
'content': f'Previous context summary: {summarize_messages(middle_section)}'
}
return system_messages + [summary] + recent_messages
Technique 2: Structured Note-Taking
What It Is: Agents regularly write notes to persistent memory outside the context window.
Implementation:
- Agent maintains notes files (e.g., NOTES.md, TODO.md)
- Tracks progress across complex tasks
- Maintains critical context and dependencies
- Reads notes after context resets
Example Note Structure:
# Agent Memory - Project Alpha
## Current Objective
Implement user authentication system with OAuth2 support
## Progress
- ✅ Set up database schema (users, sessions tables)
- ✅ Implemented password hashing with bcrypt
- 🔄 Working on OAuth2 provider integration
- ⏳ Pending: Email verification flow
## Key Decisions
- Using JWT tokens with 24-hour expiration
- Refresh tokens stored in secure HTTP-only cookies
- Rate limiting: 5 login attempts per 15 minutes
## Next Steps
1. Complete Google OAuth2 integration
2. Add email verification endpoint
3. Write integration tests for auth flow
Example: Claude Playing Pokémon:
- Maintains precise tallies across thousands of game steps
- "For the last 1,234 steps I've been training in Route 1, Pikachu gained 8 levels toward target of 10"
- Develops maps of explored regions
- Remembers key achievements and strategic patterns
- Continues multi-hour sequences after resets
Anthropic Memory Tool:
- Public beta on Claude Developer Platform
- File-based system for storing information outside context
- Build knowledge bases over time
- Maintain project state across sessions
- Reference previous work without keeping everything in context
- Released alongside Claude Sonnet 4.5 in September 2025
Technique 3: Multi-Agent Architectures
What It Is: Specialized sub-agents handle focused tasks with clean context windows, coordinated by a main agent.
How It Works:
- Main agent maintains high-level plan and coordination
- Sub-agents perform deep technical work
- Each sub-agent explores extensively (tens of thousands of tokens)
- Returns condensed summary (1,000-2,000 tokens)
- Clear separation of concerns
Advantages:
- Detailed search context isolated within sub-agents
- Lead agent focuses on synthesis and analysis
- Parallel exploration where beneficial
- Reduced context pollution in main agent
Use Case: Research Systems: Anthropic's multi-agent research system showed substantial improvements over single-agent systems on complex research tasks.
Choosing the Right Technique
Match the technique to task characteristics:
- Compaction: Maintains conversational flow for extensive back-and-forth
- Note-taking: Excels for iterative development with clear milestones
- Multi-agent: Handles complex research/analysis requiring parallel exploration
Practical Implementation Strategies
Start Simple
Anthropic's recurring advice: "Do the simplest thing that works."
- Test minimal prompts with the best available model
- Observe performance on your specific tasks
- Add clear instructions based on failure modes
- Iterate based on real-world usage
Monitor Key Metrics
Track context usage and efficiency:
- Token consumption per task
- Success rates at different context lengths
- Tool usage patterns
- Time to task completion
- Error rates and types
Iterative Refinement
Context engineering is an ongoing process:
- Deploy with minimal viable context
- Measure performance and behavior
- Identify bottlenecks and failures
- Optimize based on data
- Repeat continuously
Balance Autonomy and Guidance
As models improve:
- Give more autonomy to capable models
- Reduce prescriptive engineering
- Let intelligent models act intelligently
- Maintain safety boundaries
- Iterate based on capability advances
Industry Applications
Software Development
Context engineering enables:
- Extended coding sessions (30+ hours of focused work)
- Large codebase navigation and modification
- Multi-file refactoring with coherence
- Bug tracking across long debugging sessions
- Documentation generation from scattered sources
Research and Analysis
Long-horizon capabilities support:
- Comprehensive literature reviews
- Multi-source information synthesis
- Extended hypothesis exploration
- Complex data analysis workflows
- Report generation from extensive research
Customer Support
Context management improves:
- Multi-turn problem resolution
- Account history maintenance
- Cross-department information coordination
- Escalation context preservation
- Long-term customer relationship tracking
Future Implications
Evolving Best Practices
As models become more capable:
- Less prescriptive prompting needed
- More autonomous exploration possible
- Adaptive context strategies emerge
- Dynamic tool selection improves
- Self-optimizing systems develop
Remaining Challenges
Even with improving models:
- Context will remain a precious resource
- Attention budget management stays critical
- Token efficiency remains important
- Information relevance continues to matter
- Strategic curation provides value
The Path Forward
The field is converging on a simple agent definition: LLMs autonomously using tools in a loop.
As underlying models improve:
- Autonomy can scale proportionally
- Agents navigate nuanced problem spaces
- Error recovery becomes more robust
- Complex multi-step tasks become feasible
- Human oversight requirements decrease
Key Takeaways
For Developers
- Think holistically about entire context state, not just prompts
- Treat context as finite resource with diminishing returns
- Start minimal and add based on observed failures
- Enable agent exploration through well-designed tools
- Choose appropriate techniques for long-horizon tasks
- Iterate continuously based on real-world performance
For Organizations
- Invest in context engineering as core AI competency
- Monitor token efficiency as key performance metric
- Design for long-horizon tasks from the start
- Balance speed with adaptability in retrieval strategies
- Prepare for increasing autonomy as models improve
- Maintain safety boundaries while enabling exploration
Core Principles
- Smallest possible set of high-signal tokens
- Right altitude for instructions - specific yet flexible
- Efficient tools that promote good agent behaviors
- Quality examples over exhaustive edge cases
- Just-in-time retrieval when feasible
- Persistent memory for extended tasks
- Clear separation of concerns in multi-agent systems
Conclusion
Context engineering represents a fundamental shift in building with LLMs. As we move from simple prompt optimization to sophisticated agent development, success depends on thoughtfully curating what information enters the model's limited attention budget at each step.
Whether implementing compaction for long-horizon tasks, designing token-efficient tools, or enabling just-in-time exploration, the guiding principle remains constant: find the smallest set of high-signal tokens that maximize the likelihood of your desired outcome.
The techniques outlined by Anthropic will continue evolving as models improve. While smarter models require less prescriptive engineering and operate with more autonomy, treating context as a precious, finite resource will remain central to building reliable, effective agents.
As the field progresses, the most successful AI systems will be those that master the art of context engineering - understanding not just what to tell an AI agent, but what information to provide, when to provide it, and how to maintain coherence across extended interactions.
Sources
- Anthropic Engineering - Effective context engineering for AI agents
- Anthropic - Building effective AI agents
- Anthropic - How we built our multi-agent research system
- Claude Developer Platform - Memory and Context Management
Want to deepen your understanding of AI agent development? Explore our AI fundamentals courses, check out our glossary of AI terms, or discover AI development tools in our comprehensive catalog. For more on Anthropic's AI models, visit our Claude model pages.