Introduction
On October 17, 2025, IBM Research and the University of Washington announced the release of Toucan, a revolutionary dataset that represents the largest and most comprehensive collection of publicly available tool-calling scenarios for AI agents. With 1.5 million real-world task sequences spanning 2,000 different web services, Toucan is designed to transform how AI agents learn to interact with the world and accomplish practical tasks.
As one enthusiast noted on LinkedIn: "Toucan changes everything. This isn't another simulated dataset. It captures actual API executions in real environments. Complete interaction chains from start to finish."
This groundbreaking dataset addresses one of the most critical challenges in AI development: teaching large language models to properly call and execute tools through APIs. The dataset seized the community's attention almost immediately after release, becoming a top trending dataset on Hugging Face.
Key highlights:
- 1.5 million real-world tool-calling trajectories
- 2,000+ different web services and APIs
- 5x larger than previous open-source datasets
- Real MCP (Model Context Protocol) environments
- Proven performance improvements on leading benchmarks
- Open-source availability on Hugging Face
The Challenge of Tool-Calling in AI
Why Tool-Calling Matters
Tool-calling represents perhaps the most essential capability that distinguishes AI agents from simple chatbots. Without the ability to find, select, and deploy external tools and applications, large language models remain limited to text generation rather than becoming truly useful autonomous agents.
The Evolution from Chatbots to Agents:
- Traditional LLMs: Limited to text-based responses
- Modern AI Agents: Can interact with external systems, APIs, and tools
- Future Vision: Fully autonomous agents capable of complex real-world tasks
Training Data Challenges
Teaching LLMs to properly execute tool-calling has historically been extremely difficult due to several key challenges:
Data Scarcity:
- High-quality tool-calling examples are rare on the internet
- Most existing data lacks real-world complexity
- Synthetic data often fails to capture nuanced API interactions
- Limited diversity in available tool-calling scenarios
Quality Requirements:
- Need for end-to-end interaction sequences
- Requirement for realistic error handling
- Complex multi-tool workflows
- Real-world context and constraints
Scale Limitations:
- Previous datasets were too small for effective training
- Limited coverage of different tool types and domains
- Insufficient examples of parallel tool calling
- Lack of comprehensive evaluation scenarios
Toucan Dataset Overview
Dataset Specifications
Scale and Scope:
- 1.5 million tool-calling trajectories
- 2,000+ unique web services and APIs
- 500 MCP (Model Context Protocol) servers
- Multiple complexity levels and task types
- Comprehensive coverage across domains
Data Quality:
- Real-world scenarios: Actual API executions, not simulations
- Complete interaction chains: From task initiation to completion
- Quality-rated: Each trajectory rated for difficulty and quality
- Diverse complexity: From simple single-tool tasks to complex multi-tool workflows
- Error handling: Includes realistic failure scenarios and recovery
Comparison with Existing Datasets
Size Advantage:
- Toucan: 1.5 million trajectories
- Nvidia Nemotron: 310,000 trajectories (previous largest)
- 5x larger than the next biggest open-source dataset
- Most diverse tool coverage available
Quality Differences:
- Real vs. Synthetic: Toucan uses actual API executions
- Comprehensive Coverage: 2,000+ tools vs. limited tool sets in other datasets
- Multi-modal Support: Includes various input and output modalities
- Production-Ready: Based on real MCP server implementations
Technical Architecture and Creation Process
Data Collection Methodology
MCP Server Sourcing: The research team began by gathering Model Context Protocol server metadata from two primary sources:
GitHub Repositories:
- Comprehensive scan of open-source MCP servers
- Filtering for active, well-maintained projects
- Quality assessment of server implementations
- Documentation and API completeness evaluation
Smithery.ai Platform:
- Curated collection of production MCP servers
- Verified server functionality and reliability
- Access to diverse tool categories and domains
- Real-world usage patterns and examples
Quality Filtering Process:
- Error Testing: Removed servers that returned frequent error messages
- Functionality Verification: Ensured all tools work as expected
- Documentation Quality: Required clear API documentation
- Reliability Assessment: Tested server stability and response times
- Final Selection: 500 high-quality MCP servers chosen for dataset creation
AI-Powered Dataset Generation
Multi-Model Approach: The Toucan creation process employed multiple specialized AI models for different aspects of dataset generation:
Task Scenario Generation (5 Models):
- Diverse Perspectives: Five different open-source LLMs used for variety
- Creative Task Design: Models generated plausible, realistic scenarios
- Domain Coverage: Ensured broad coverage across different use cases
- Complexity Variation: Generated tasks of varying difficulty levels
- Real-World Grounding: Based scenarios on actual tool capabilities
Agent Trajectory Construction (3 Models):
- Step-by-Step Planning: Models created detailed execution plans
- Tool Selection Logic: Demonstrated proper tool choice reasoning
- Error Handling: Included realistic failure and recovery scenarios
- Multi-Tool Coordination: Showed complex workflows with multiple tools
- Human-Like Interaction: Generated natural dialogue and explanations
Quality Assessment (2 Models):
- Difficulty Rating: Assessed task complexity and challenge level
- Quality Scoring: Evaluated trajectory completeness and realism
- Selection Criteria: Identified best examples for final dataset
- Consistency Checking: Ensured logical flow and coherence
- Benchmark Alignment: Selected examples suitable for evaluation
Dataset Structure and Format
Trajectory Components: Each Toucan trajectory follows a standardized structure based on real-world tool-calling scenarios:
Basic Structure:
- Task Description: Natural language description of the desired outcome
- Planning Phase: Agent creates a step-by-step execution plan
- Tool Execution: Actual API calls and tool interactions
- Result Summary: Friendly summary of what was accomplished
The trajectories vary in complexity, covering everything from analyzing sales reports and drafting business summaries to scheduling meetings with colleagues and sending out calendar invites.
Performance Benchmarks and Results
Benchmark Evaluation Framework
Primary Benchmarks: Toucan's effectiveness was evaluated using several industry-standard benchmarks:
Berkeley Function Calling Leaderboard v3 (BFCLv3):
- Industry Standard: Widely recognized benchmark for tool-calling capabilities
- Comprehensive Testing: Covers various tool-calling scenarios and complexities
- Real-World Relevance: Based on practical use cases and applications
- Comparative Analysis: Enables direct comparison with other models and approaches
MCP-Universe Benchmark:
- Salesforce Development: Created by Salesforce for real tool-calling evaluation
- Diverse Domains: Financial analysis, 3D design, web search, and more
- Production Scenarios: Based on actual enterprise use cases
- Multi-Modal Tasks: Includes various input and output modalities
τ-Bench and τ²-Bench:
- Specialized Evaluation: Focus on retail, airline, and telecommunications environments
- Domain-Specific Testing: Industry-relevant scenarios and challenges
- Performance Metrics: Detailed analysis of tool-calling accuracy and efficiency
- Real-World Applications: Based on actual business processes and workflows
Performance Improvements
Qwen Model Family Results: Models fine-tuned on Toucan data showed impressive performance gains across all tested configurations:
Qwen-2.5-7B:
- τ-Bench Improvement: Up to 7 percentage points increase
- τ²-Bench Enhancement: Significant performance boost
- Efficiency Gains: Better tool selection and execution
- Error Reduction: Fewer failed tool calls and better recovery
Qwen-2.5-14B:
- Scaling Benefits: Larger model showed even greater improvements
- Complex Task Handling: Better performance on multi-tool scenarios
- Reasoning Enhancement: Improved tool selection logic
- Consistency Improvement: More reliable task completion
Qwen-2.5-32B:
- BFCL V3 Results: Nearly 9 percentage points improvement
- GPT-4.5-Preview Comparison: Narrowly outperformed OpenAI's GPT-4.5-Preview, which is estimated to have at least a trillion parameters
- Parameter Efficiency: Achieved superior results with significantly fewer parameters
- Benchmark Success: Also showed strong performance on MCP-Universe benchmark
Cross-Benchmark Consistency:
- MCP-Universe: Strong performance across diverse real-world tasks
- Multi-Domain Success: Consistent improvements across different application areas
- Scalability Demonstration: Performance gains scaled with model size
- Generalization Ability: Improvements transferred to unseen tasks and domains
Parallel Tool Calling Capabilities
Efficiency Innovation: One of Toucan's unique features is its emphasis on parallel tool calling, where AI agents learn to execute multiple tools simultaneously:
Economic Benefits: As Zhangchen Xu, a graduate student at University of Washington who helped build the dataset as an IBM intern, explained: "You can imagine how parallel calling improves efficiency, which can lower the cost of running agentic systems."
Technical Implementation:
- 20% Parallel Scenarios: One-fifth of Toucan's scenarios require multiple simultaneous tool calls
- Efficiency Focus: Designed to teach AI agents more economical operation
Real-World Applications and Use Cases
Key Application Areas
Toucan's diverse scenarios cover a wide range of practical applications that demonstrate the potential of well-trained AI agents:
Business Automation:
- Analyzing sales reports and generating business summaries
- Scheduling meetings and sending calendar invites
- Coordinating between multiple business tools and platforms
- Managing complex multi-step workflows
Data Processing:
- Integrating information from multiple sources
- Processing and analyzing large datasets
- Generating reports with real-time data
- Handling various file formats and data types
Communication and Coordination:
- Managing multi-channel communications
- Coordinating tasks across different platforms
- Facilitating team collaboration
- Automating routine administrative tasks
Model Context Protocol (MCP) Integration
Understanding MCP Architecture
Standardized Interface: The Model Context Protocol, originally developed and open-sourced by Anthropic, provides a standardized way for AI agents to interact with external tools and services:
Core Components:
- MCP Servers: Topic-based software libraries that provide access to specific tools and APIs
- Standardized Communication: Consistent interface for tool discovery and execution
- Security Framework: Built-in security measures for safe tool execution
- Extensibility: Easy addition of new tools and services to the ecosystem
Toucan's MCP Focus:
- Real MCP Environments: Dataset based on actual MCP server implementations
- Production Readiness: Tools and services that are actively used in real applications
- Comprehensive Coverage: 500 different MCP servers representing diverse tool categories
- Future Compatibility: Training that prepares models for the evolving MCP ecosystem
Tool Discovery and Selection
Toucan teaches AI agents how to intelligently select and coordinate multiple tools for complex tasks, with scenarios covering everything from simple single-tool operations to sophisticated multi-tool workflows.
Research Impact and Academic Significance
Key Research Contributions
Advancing Tool-Calling Research: As Rameswar Panda, the IBM researcher who led the team behind Toucan, explained: "Tool-calling is central to AI agents. How can you train better agents? Through diverse, high-quality examples sourced from the real world."
Methodological Innovations:
- Real vs. Synthetic Data: Demonstrating the superiority of real API executions over simulated data
- Multi-Model Generation: Using multiple LLMs for different aspects of dataset creation
- Quality Assessment: Implementing systematic quality rating for trajectory selection
- Scale Impact: Proving that larger, higher-quality datasets lead to better performance
Academic-Industry Collaboration
Partnership Success: The collaboration between IBM Research and the University of Washington exemplifies effective academic-industry partnerships, with graduate student Zhangchen Xu contributing as an IBM intern.
Open Science Approach:
- Public release of the complete dataset
- Detailed methodology documentation
- Community-driven research advancement
- Educational resource creation
Future Development and Roadmap
Planned Enhancements
Dataset Expansion: The research team has outlined several areas for future development:
Dataset Expansion: The research team plans to onboard new MCP servers with a wider range of tools that have come online since June 2025, when they collected their seed data.
Ongoing Research: As Adriana Meza noted: "We're repurposing part of the Toucan code for these new projects and of course building on all the tool-calling knowledge we acquired in putting the dataset together."
Reinforcement Learning Integration
Future RL Development: The research team plans to create a reinforcement learning gymnasium and benchmark to give LLMs more experience with enterprise workflows. This will build on the tool-calling knowledge acquired in creating the Toucan dataset and help advance the field of AI agent training.
Technical Access and Implementation
Getting Started with Toucan
Dataset Access:
The Toucan dataset is freely available through Hugging Face at Agent-Ark/Toucan-1.5M, making it accessible to researchers and developers worldwide.
Basic Usage:
from datasets import load_dataset
# Load the Toucan dataset
dataset = load_dataset("Agent-Ark/Toucan-1.5M")
# Access training trajectories
trajectories = dataset['train']
Data Characteristics:
- Complete tool-calling interaction sequences
- Quality-rated trajectories with difficulty assessments
- Rich metadata including tool usage patterns
- Comprehensive documentation for research use
Performance Optimization
Training Considerations: The Toucan dataset's large scale (1.5 million trajectories) requires careful consideration of computational resources and training strategies. Researchers should plan for appropriate hardware and training time when working with this comprehensive dataset.
Industry Impact and Adoption
Transforming AI Development
Paradigm Shift: Toucan represents a significant shift in how AI agents are developed and trained:
From Synthetic to Real:
- Authentic Data: Moving from synthetic, simulated data to real-world interactions
- Practical Relevance: Training on scenarios that actually occur in production environments
- Improved Generalization: Better performance on real-world tasks due to realistic training data
- Reduced Gap: Narrowing the gap between training and deployment environments
Scale and Quality:
- Unprecedented Scale: Largest dataset of its kind enables more effective training
- Quality Focus: Emphasis on high-quality, curated examples improves training outcomes
- Comprehensive Coverage: Broad coverage of tools and scenarios enables versatile AI agents
- Continuous Improvement: Framework for ongoing dataset expansion and improvement
Research and Development Impact
Academic and Industry Interest: The release of Toucan has generated significant interest in the AI research community, as evidenced by its immediate popularity on Hugging Face. The dataset provides researchers with unprecedented access to high-quality, real-world tool-calling scenarios for developing and evaluating AI agents.
Key Research Areas:
- Tool-calling methodology development
- Multi-tool coordination strategies
- Error handling and recovery in AI agents
- Parallel tool execution optimization
Conclusion
The release of IBM's Toucan dataset marks a pivotal moment in the evolution of AI agents, providing the research and development community with an unprecedented resource for training more capable, practical, and reliable tool-calling systems. With its 1.5 million real-world scenarios spanning 2,000 different web services, Toucan addresses the critical data scarcity problem that has long hindered the development of truly effective AI agents.
Key Achievements:
- Unprecedented Scale: 5x larger than previous open-source datasets
- Real-World Authenticity: Actual API executions rather than synthetic simulations
- Proven Performance: Demonstrated improvements on leading benchmarks
- Open Access: Freely available to the global research community
- Industry Collaboration: Successful partnership between academia and industry
- Future Foundation: Establishes framework for continued dataset expansion and improvement
Transformative Impact:
Toucan's impact extends far beyond its immediate technical contributions. By providing high-quality, realistic training data, it enables the development of AI agents that can truly understand and navigate the complex, multi-tool workflows that characterize modern digital work environments. The dataset's emphasis on parallel tool calling and sophisticated error handling prepares AI systems for the practical challenges they will face in real-world deployments.
Looking Forward:
The success of Toucan demonstrates the value of collaborative, open-science approaches to AI development. As the research team continues to expand the dataset with new MCP servers and develops reinforcement learning frameworks, Toucan is positioned to remain at the forefront of AI agent research and development.
The enthusiastic community response, with Toucan becoming a top trending dataset on Hugging Face immediately after release, demonstrates strong demand for high-quality, realistic training data in the AI agent development community.
Toucan represents a significant step forward in AI agent training data quality and realism. By providing access to real-world tool-calling scenarios at unprecedented scale, it enables researchers and developers to build more capable and reliable AI agents that can effectively navigate the complex world of modern digital tools and services.
Sources
- Toucan: A new goldmine for tool-calling AI agents - IBM Research, October 17, 2025
- Toucan Dataset on Hugging Face - Agent-Ark
- TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments - IBM Research Study
- Berkeley Function Calling Leaderboard - UC Berkeley
- Model Context Protocol - Anthropic
Interested in learning more about AI agents and tool-calling capabilities? Explore our AI Fundamentals course to understand how AI systems work, check out our glossary of AI terms for key concepts like API and machine learning, or discover the latest AI models and AI tools in our comprehensive catalog.