AI Agent Tools: Anthropic Development Guide

Introduction

The effectiveness of AI agents is fundamentally limited by the quality of tools we provide them. In a groundbreaking engineering blog post, Anthropic shares their comprehensive methodology for writing high-quality tools that maximize agent performance across a wide range of real-world tasks.

The Model Context Protocol (MCP) can empower LLM agents with potentially hundreds of tools, but the key question remains: How do we make those tools maximally effective?

The Fundamental Challenge: Deterministic vs. Non-Deterministic Systems

Traditional Software Development

When we write traditional software, we establish contracts between deterministic systems. A function call like getWeather("NYC") will always fetch weather data in the exact same manner every time it's called.

The Agent Tool Paradigm

Tools for AI agents represent a fundamentally different paradigm - a contract between deterministic systems and non-deterministic agents. When a user asks "Should I bring an umbrella today?", an agent might:

Call the weather tool
Answer from general knowledge
Ask clarifying questions about location
Occasionally hallucinate or fail to grasp tool usage

This requires completely rethinking our approach to software development for agents.

Anthropic's Proven Methodology

1. Building and Testing Prototypes

Start with Quick Prototypes

It's difficult to anticipate which tools agents will find ergonomic without hands-on testing. Anthropic recommends:

Standing up quick prototypes of your tools
Using Claude Code to write tools with proper documentation
Wrapping tools in local MCP servers or Desktop extensions
Testing tools yourself to identify rough edges
Collecting user feedback to build intuition

Connection Methods

Claude Code: claude mcp add <name> <command> [args...]
Claude Desktop: Navigate to Settings > Developer or Settings > Extensions
API Testing: Pass tools directly into Anthropic API calls

2. Comprehensive Evaluation Framework

Generating Evaluation Tasks

Create evaluation tasks grounded in real-world use cases:

Strong Evaluation Tasks:

"Schedule a meeting with Jane next week to discuss our latest Acme Corp project. Attach the notes from our last project planning meeting and reserve a conference room."
"Customer ID 9182 reported being charged three times for a single purchase. Find all relevant log entries and determine if any other customers were affected."
"Customer Sarah Chen submitted a cancellation request. Prepare a retention offer by determining why they're leaving, what offer would be compelling, and any risk factors."

Weak Evaluation Tasks:

"Schedule a meeting with jane@acme.corp next week."
"Search payment logs for purchase_complete and customer_id=9182."
"Find cancellation request by Customer ID 45892."

Evaluation Best Practices

Generate dozens of prompt-response pairs
Use realistic data sources and services
Avoid overly simplistic sandbox environments
Require multiple tool calls (potentially dozens)
Pair each prompt with verifiable responses
Use flexible verifiers that don't reject correct responses due to formatting differences

3. Running Systematic Evaluations

Programmatic Evaluation Setup

Use simple agentic loops with alternating LLM API and tool calls
One loop per evaluation task
Include reasoning and feedback blocks in system prompts
Enable interleaved thinking for Claude to probe agent behavior

Key Metrics to Track

Top-level accuracy: Overall task completion success
Runtime: Total time for individual tool calls and tasks
Token consumption: Efficiency of tool usage
Tool errors: Frequency and types of errors
Tool call patterns: Common workflows agents pursue

Performance Results

Anthropic's internal evaluations show dramatic improvements:

Slack MCP servers: Significant accuracy improvements with Claude-optimized tools
Asana MCP servers: Enhanced performance on held-out test sets
SWE-bench: State-of-the-art performance after tool description refinements

Core Principles for Effective Agent Tools

1. Choosing the Right Tools

Strategic Tool Selection

Implement tools that reflect natural task subdivisions
Reduce the number of tools loaded into agent context
Offload agentic computation from context back into tool calls
Minimize agent's risk of making mistakes

Tool Consolidation Opportunities

Track common workflows to identify opportunities for:

Combining multiple tool calls into single operations
Creating higher-level abstractions
Reducing cognitive load on agents

2. Namespacing for Clear Boundaries

Effective Namespacing Strategies

Prefix-based: asana_projects_search, asana_users_search
Suffix-based: search_projects_asana, search_users_asana
Source-based: Tools grouped by data source or service

Impact on Performance

Anthropic found that namespacing choices have non-trivial effects on tool-use evaluations. Effects vary by LLM, so choose naming schemes based on your own evaluations.

3. Returning Meaningful Context

High Signal Information Priority

Tool implementations should return only high-signal information:

Avoid:

Low-level technical identifiers (uuid, 256px_image_url, mime_type)
Cryptic alphanumeric UUIDs
Excessive technical details

Prefer:

Natural language names and terms
Contextually relevant fields (name, image_url, file_type)
Semantically meaningful identifiers

Response Format Flexibility

Implement response_format enum parameters for flexibility:

enum ResponseFormat {
   DETAILED = "detailed",
   CONCISE = "concise"
}

Example Impact:

Detailed response: 206 tokens with full context and IDs
Concise response: 72 tokens with essential information only
Token efficiency: ~⅓ reduction with concise format

4. Optimizing for Token Efficiency

Context Management Strategies

Pagination: Break large responses into manageable chunks
Range selection: Allow agents to specify data ranges
Filtering: Enable targeted data retrieval
Truncation: Implement sensible default limits (25,000 tokens for Claude Code)

Helpful Truncation Example

[Previous 50 messages truncated for brevity. 
Use filters or pagination to access specific messages.]

Error Response Optimization

Unhelpful Error:

Error: Invalid input format

Helpful Error:

Error: Invalid date format. Please use YYYY-MM-DD format. 
Example: 2024-01-15

5. Prompt Engineering Tool Descriptions

Writing Effective Descriptions

Think of how you'd describe your tool to a new team member:

Make implicit context explicit: Specialized query formats, niche terminology, resource relationships
Avoid ambiguity: Clear input/output descriptions with strict data models
Use unambiguous parameter names: user_id instead of user
Include examples: Show expected usage patterns

Impact of Description Refinements

Even small refinements to tool descriptions can yield dramatic improvements. Anthropic achieved state-of-the-art performance on SWE-bench after precise refinements, dramatically reducing error rates and improving task completion.

Advanced Techniques

Agent-Agent Collaboration

Using Claude to Optimize Tools

Anthropic demonstrates how to use Claude Code to:

Automatically evaluate tool efficacy
Generate optimization suggestions
Improve tool descriptions and specifications
Create comprehensive evaluation frameworks

Iterative Improvement Process

Build prototype → Test locally
Run evaluation → Measure performance
Analyze results → Identify improvement areas
Optimize tools → Implement changes
Repeat → Continue until strong performance achieved

Response Structure Optimization

Format Impact on Performance

Different response structures (XML, JSON, Markdown) can impact evaluation performance because LLMs are trained on next-token prediction and perform better with formats matching their training data.

Best Practices:

Test different formats with your specific tasks
Choose based on evaluation results
Consider agent-specific preferences

Real-World Applications

Enterprise Tool Development

Internal Knowledge Systems

Slack integration tools
Asana project management tools
Customer support systems
Document management platforms

Performance Improvements

Reduced error rates
Faster task completion
Better agent understanding
Improved user satisfaction

Developer Tools

MCP Server Development

Model Context Protocol integration
Tool annotation for access control
Command-line interfaces
API compatibility layers

Future Implications

Evolving Agent Capabilities

As agents become more capable, the tools they use must evolve alongside them. Anthropic's systematic, evaluation-driven approach ensures that:

Tools remain effective as LLMs improve
New capabilities are properly supported
Performance continues to increase
User experience remains optimal

Industry Impact

This methodology represents a fundamental shift in software development:

From deterministic to non-deterministic: Rethinking software contracts
From human-centric to agent-centric: Designing for AI understanding
From static to adaptive: Tools that evolve with agent capabilities

Key Takeaways

For Developers

Start with prototypes and test early
Build comprehensive evaluations with real-world tasks
Use systematic metrics to measure improvement
Collaborate with agents to optimize tools
Focus on agent ergonomics over traditional software patterns

For Organizations

Invest in evaluation frameworks for tool development
Prioritize agent-friendly design in tool architecture
Measure performance systematically across all tools
Iterate continuously based on agent feedback
Plan for evolving capabilities in tool design

Conclusion

Anthropic's comprehensive guide to writing effective tools for AI agents represents a paradigm shift in software development. By treating tools as contracts between deterministic systems and non-deterministic agents, we can create more effective, intuitive, and powerful AI systems.

The key insight is that agents are only as effective as the tools we give them. Through systematic evaluation, iterative improvement, and agent-centric design, we can build tools that maximize agent performance across a wide range of real-world tasks.

As the AI landscape continues to evolve, this methodology provides a solid foundation for building tools that will remain effective as agents become increasingly capable. The future of AI development lies not just in improving the models themselves, but in creating the tools and interfaces that allow them to reach their full potential.

Sources

Interested in learning more about AI agent development and tool creation? Explore our AI fundamentals courses, check out our glossary of AI terms, or discover AI tools in our comprehensive catalog.