Introduction
The effectiveness of AI agents is fundamentally limited by the quality of tools we provide them. In a groundbreaking engineering blog post, Anthropic shares their comprehensive methodology for writing high-quality tools that maximize agent performance across a wide range of real-world tasks.
The Model Context Protocol (MCP) can empower LLM agents with potentially hundreds of tools, but the key question remains: How do we make those tools maximally effective?
The Fundamental Challenge: Deterministic vs. Non-Deterministic Systems
Traditional Software Development
When we write traditional software, we establish contracts between deterministic systems. A function call like getWeather("NYC")
will always fetch weather data in the exact same manner every time it's called.
The Agent Tool Paradigm
Tools for AI agents represent a fundamentally different paradigm - a contract between deterministic systems and non-deterministic agents. When a user asks "Should I bring an umbrella today?", an agent might:
- Call the weather tool
- Answer from general knowledge
- Ask clarifying questions about location
- Occasionally hallucinate or fail to grasp tool usage
This requires completely rethinking our approach to software development for agents.
Anthropic's Proven Methodology
1. Building and Testing Prototypes
Start with Quick Prototypes
It's difficult to anticipate which tools agents will find ergonomic without hands-on testing. Anthropic recommends:
- Standing up quick prototypes of your tools
- Using Claude Code to write tools with proper documentation
- Wrapping tools in local MCP servers or Desktop extensions
- Testing tools yourself to identify rough edges
- Collecting user feedback to build intuition
Connection Methods
- Claude Code:
claude mcp add <name> <command> [args...]
- Claude Desktop: Navigate to
Settings > Developer
orSettings > Extensions
- API Testing: Pass tools directly into Anthropic API calls
2. Comprehensive Evaluation Framework
Generating Evaluation Tasks
Create evaluation tasks grounded in real-world use cases:
Strong Evaluation Tasks:
- "Schedule a meeting with Jane next week to discuss our latest Acme Corp project. Attach the notes from our last project planning meeting and reserve a conference room."
- "Customer ID 9182 reported being charged three times for a single purchase. Find all relevant log entries and determine if any other customers were affected."
- "Customer Sarah Chen submitted a cancellation request. Prepare a retention offer by determining why they're leaving, what offer would be compelling, and any risk factors."
Weak Evaluation Tasks:
- "Schedule a meeting with jane@acme.corp next week."
- "Search payment logs for
purchase_complete
andcustomer_id=9182
." - "Find cancellation request by Customer ID 45892."
Evaluation Best Practices
- Generate dozens of prompt-response pairs
- Use realistic data sources and services
- Avoid overly simplistic sandbox environments
- Require multiple tool calls (potentially dozens)
- Pair each prompt with verifiable responses
- Use flexible verifiers that don't reject correct responses due to formatting differences
3. Running Systematic Evaluations
Programmatic Evaluation Setup
- Use simple agentic loops with alternating LLM API and tool calls
- One loop per evaluation task
- Include reasoning and feedback blocks in system prompts
- Enable interleaved thinking for Claude to probe agent behavior
Key Metrics to Track
- Top-level accuracy: Overall task completion success
- Runtime: Total time for individual tool calls and tasks
- Token consumption: Efficiency of tool usage
- Tool errors: Frequency and types of errors
- Tool call patterns: Common workflows agents pursue
Performance Results
Anthropic's internal evaluations show dramatic improvements:
- Slack MCP servers: Significant accuracy improvements with Claude-optimized tools
- Asana MCP servers: Enhanced performance on held-out test sets
- SWE-bench: State-of-the-art performance after tool description refinements
Core Principles for Effective Agent Tools
1. Choosing the Right Tools
Strategic Tool Selection
- Implement tools that reflect natural task subdivisions
- Reduce the number of tools loaded into agent context
- Offload agentic computation from context back into tool calls
- Minimize agent's risk of making mistakes
Tool Consolidation Opportunities
Track common workflows to identify opportunities for:
- Combining multiple tool calls into single operations
- Creating higher-level abstractions
- Reducing cognitive load on agents
2. Namespacing for Clear Boundaries
Effective Namespacing Strategies
- Prefix-based:
asana_projects_search
,asana_users_search
- Suffix-based:
search_projects_asana
,search_users_asana
- Source-based: Tools grouped by data source or service
Impact on Performance
Anthropic found that namespacing choices have non-trivial effects on tool-use evaluations. Effects vary by LLM, so choose naming schemes based on your own evaluations.
3. Returning Meaningful Context
High Signal Information Priority
Tool implementations should return only high-signal information:
Avoid:
- Low-level technical identifiers (
uuid
,256px_image_url
,mime_type
) - Cryptic alphanumeric UUIDs
- Excessive technical details
Prefer:
- Natural language names and terms
- Contextually relevant fields (
name
,image_url
,file_type
) - Semantically meaningful identifiers
Response Format Flexibility
Implement response_format
enum parameters for flexibility:
enum ResponseFormat {
DETAILED = "detailed",
CONCISE = "concise"
}
Example Impact:
- Detailed response: 206 tokens with full context and IDs
- Concise response: 72 tokens with essential information only
- Token efficiency: ~⅓ reduction with concise format
4. Optimizing for Token Efficiency
Context Management Strategies
- Pagination: Break large responses into manageable chunks
- Range selection: Allow agents to specify data ranges
- Filtering: Enable targeted data retrieval
- Truncation: Implement sensible default limits (25,000 tokens for Claude Code)
Helpful Truncation Example
[Previous 50 messages truncated for brevity.
Use filters or pagination to access specific messages.]
Error Response Optimization
Unhelpful Error:
Error: Invalid input format
Helpful Error:
Error: Invalid date format. Please use YYYY-MM-DD format.
Example: 2024-01-15
5. Prompt Engineering Tool Descriptions
Writing Effective Descriptions
Think of how you'd describe your tool to a new team member:
- Make implicit context explicit: Specialized query formats, niche terminology, resource relationships
- Avoid ambiguity: Clear input/output descriptions with strict data models
- Use unambiguous parameter names:
user_id
instead ofuser
- Include examples: Show expected usage patterns
Impact of Description Refinements
Even small refinements to tool descriptions can yield dramatic improvements. Anthropic achieved state-of-the-art performance on SWE-bench after precise refinements, dramatically reducing error rates and improving task completion.
Advanced Techniques
Agent-Agent Collaboration
Using Claude to Optimize Tools
Anthropic demonstrates how to use Claude Code to:
- Automatically evaluate tool efficacy
- Generate optimization suggestions
- Improve tool descriptions and specifications
- Create comprehensive evaluation frameworks
Iterative Improvement Process
- Build prototype → Test locally
- Run evaluation → Measure performance
- Analyze results → Identify improvement areas
- Optimize tools → Implement changes
- Repeat → Continue until strong performance achieved
Response Structure Optimization
Format Impact on Performance
Different response structures (XML, JSON, Markdown) can impact evaluation performance because LLMs are trained on next-token prediction and perform better with formats matching their training data.
Best Practices:
- Test different formats with your specific tasks
- Choose based on evaluation results
- Consider agent-specific preferences
Real-World Applications
Enterprise Tool Development
Internal Knowledge Systems
- Slack integration tools
- Asana project management tools
- Customer support systems
- Document management platforms
Performance Improvements
- Reduced error rates
- Faster task completion
- Better agent understanding
- Improved user satisfaction
Developer Tools
MCP Server Development
- Model Context Protocol integration
- Tool annotation for access control
- Command-line interfaces
- API compatibility layers
Future Implications
Evolving Agent Capabilities
As agents become more capable, the tools they use must evolve alongside them. Anthropic's systematic, evaluation-driven approach ensures that:
- Tools remain effective as LLMs improve
- New capabilities are properly supported
- Performance continues to increase
- User experience remains optimal
Industry Impact
This methodology represents a fundamental shift in software development:
- From deterministic to non-deterministic: Rethinking software contracts
- From human-centric to agent-centric: Designing for AI understanding
- From static to adaptive: Tools that evolve with agent capabilities
Key Takeaways
For Developers
- Start with prototypes and test early
- Build comprehensive evaluations with real-world tasks
- Use systematic metrics to measure improvement
- Collaborate with agents to optimize tools
- Focus on agent ergonomics over traditional software patterns
For Organizations
- Invest in evaluation frameworks for tool development
- Prioritize agent-friendly design in tool architecture
- Measure performance systematically across all tools
- Iterate continuously based on agent feedback
- Plan for evolving capabilities in tool design
Conclusion
Anthropic's comprehensive guide to writing effective tools for AI agents represents a paradigm shift in software development. By treating tools as contracts between deterministic systems and non-deterministic agents, we can create more effective, intuitive, and powerful AI systems.
The key insight is that agents are only as effective as the tools we give them. Through systematic evaluation, iterative improvement, and agent-centric design, we can build tools that maximize agent performance across a wide range of real-world tasks.
As the AI landscape continues to evolve, this methodology provides a solid foundation for building tools that will remain effective as agents become increasingly capable. The future of AI development lies not just in improving the models themselves, but in creating the tools and interfaces that allow them to reach their full potential.
Sources
- Anthropic Engineering Blog - Writing effective tools for agents — with agents
- Model Context Protocol Documentation
- Anthropic API Documentation
- Claude Code Documentation
Interested in learning more about AI agent development and tool creation? Explore our AI fundamentals courses, check out our glossary of AI terms, or discover AI tools in our comprehensive catalog.