IBM Releases Toucan: Largest Open-Source Tool-Calling Dataset with 1.5M Real-World Scenarios

IBM and University of Washington release Toucan, a groundbreaking dataset of 1.5 million real-world tool-calling scenarios designed to train better AI agents.

by HowAIWorks Team
IBMToucanTool CallingAI AgentsDatasetOpen SourceMCPAPIAI TrainingMachine LearningAI DevelopmentResearchBerkeley Function CallingQwenAI ModelsAgent TrainingReal-World DataMultimodal AIAI PerformanceAI Benchmarks

Introduction

On October 17, 2025, IBM Research and the University of Washington announced the release of Toucan, a revolutionary dataset that represents the largest and most comprehensive collection of publicly available tool-calling scenarios for AI agents. With 1.5 million real-world task sequences spanning 2,000 different web services, Toucan is designed to transform how AI agents learn to interact with the world and accomplish practical tasks.

As one enthusiast noted on LinkedIn: "Toucan changes everything. This isn't another simulated dataset. It captures actual API executions in real environments. Complete interaction chains from start to finish."

This groundbreaking dataset addresses one of the most critical challenges in AI development: teaching large language models to properly call and execute tools through APIs. The dataset seized the community's attention almost immediately after release, becoming a top trending dataset on Hugging Face.

Key highlights:

  • 1.5 million real-world tool-calling trajectories
  • 2,000+ different web services and APIs
  • 5x larger than previous open-source datasets
  • Real MCP (Model Context Protocol) environments
  • Proven performance improvements on leading benchmarks
  • Open-source availability on Hugging Face

The Challenge of Tool-Calling in AI

Why Tool-Calling Matters

Tool-calling represents perhaps the most essential capability that distinguishes AI agents from simple chatbots. Without the ability to find, select, and deploy external tools and applications, large language models remain limited to text generation rather than becoming truly useful autonomous agents.

The Evolution from Chatbots to Agents:

  • Traditional LLMs: Limited to text-based responses
  • Modern AI Agents: Can interact with external systems, APIs, and tools
  • Future Vision: Fully autonomous agents capable of complex real-world tasks

Training Data Challenges

Teaching LLMs to properly execute tool-calling has historically been extremely difficult due to several key challenges:

Data Scarcity:

  • High-quality tool-calling examples are rare on the internet
  • Most existing data lacks real-world complexity
  • Synthetic data often fails to capture nuanced API interactions
  • Limited diversity in available tool-calling scenarios

Quality Requirements:

  • Need for end-to-end interaction sequences
  • Requirement for realistic error handling
  • Complex multi-tool workflows
  • Real-world context and constraints

Scale Limitations:

  • Previous datasets were too small for effective training
  • Limited coverage of different tool types and domains
  • Insufficient examples of parallel tool calling
  • Lack of comprehensive evaluation scenarios

Toucan Dataset Overview

Dataset Specifications

Scale and Scope:

  • 1.5 million tool-calling trajectories
  • 2,000+ unique web services and APIs
  • 500 MCP (Model Context Protocol) servers
  • Multiple complexity levels and task types
  • Comprehensive coverage across domains

Data Quality:

  • Real-world scenarios: Actual API executions, not simulations
  • Complete interaction chains: From task initiation to completion
  • Quality-rated: Each trajectory rated for difficulty and quality
  • Diverse complexity: From simple single-tool tasks to complex multi-tool workflows
  • Error handling: Includes realistic failure scenarios and recovery

Comparison with Existing Datasets

Size Advantage:

  • Toucan: 1.5 million trajectories
  • Nvidia Nemotron: 310,000 trajectories (previous largest)
  • 5x larger than the next biggest open-source dataset
  • Most diverse tool coverage available

Quality Differences:

  • Real vs. Synthetic: Toucan uses actual API executions
  • Comprehensive Coverage: 2,000+ tools vs. limited tool sets in other datasets
  • Multi-modal Support: Includes various input and output modalities
  • Production-Ready: Based on real MCP server implementations

Technical Architecture and Creation Process

Data Collection Methodology

MCP Server Sourcing: The research team began by gathering Model Context Protocol server metadata from two primary sources:

GitHub Repositories:

  • Comprehensive scan of open-source MCP servers
  • Filtering for active, well-maintained projects
  • Quality assessment of server implementations
  • Documentation and API completeness evaluation

Smithery.ai Platform:

  • Curated collection of production MCP servers
  • Verified server functionality and reliability
  • Access to diverse tool categories and domains
  • Real-world usage patterns and examples

Quality Filtering Process:

  • Error Testing: Removed servers that returned frequent error messages
  • Functionality Verification: Ensured all tools work as expected
  • Documentation Quality: Required clear API documentation
  • Reliability Assessment: Tested server stability and response times
  • Final Selection: 500 high-quality MCP servers chosen for dataset creation

AI-Powered Dataset Generation

Multi-Model Approach: The Toucan creation process employed multiple specialized AI models for different aspects of dataset generation:

Task Scenario Generation (5 Models):

  • Diverse Perspectives: Five different open-source LLMs used for variety
  • Creative Task Design: Models generated plausible, realistic scenarios
  • Domain Coverage: Ensured broad coverage across different use cases
  • Complexity Variation: Generated tasks of varying difficulty levels
  • Real-World Grounding: Based scenarios on actual tool capabilities

Agent Trajectory Construction (3 Models):

  • Step-by-Step Planning: Models created detailed execution plans
  • Tool Selection Logic: Demonstrated proper tool choice reasoning
  • Error Handling: Included realistic failure and recovery scenarios
  • Multi-Tool Coordination: Showed complex workflows with multiple tools
  • Human-Like Interaction: Generated natural dialogue and explanations

Quality Assessment (2 Models):

  • Difficulty Rating: Assessed task complexity and challenge level
  • Quality Scoring: Evaluated trajectory completeness and realism
  • Selection Criteria: Identified best examples for final dataset
  • Consistency Checking: Ensured logical flow and coherence
  • Benchmark Alignment: Selected examples suitable for evaluation

Dataset Structure and Format

Trajectory Components: Each Toucan trajectory follows a standardized structure based on real-world tool-calling scenarios:

Basic Structure:

  • Task Description: Natural language description of the desired outcome
  • Planning Phase: Agent creates a step-by-step execution plan
  • Tool Execution: Actual API calls and tool interactions
  • Result Summary: Friendly summary of what was accomplished

The trajectories vary in complexity, covering everything from analyzing sales reports and drafting business summaries to scheduling meetings with colleagues and sending out calendar invites.

Performance Benchmarks and Results

Benchmark Evaluation Framework

Primary Benchmarks: Toucan's effectiveness was evaluated using several industry-standard benchmarks:

Berkeley Function Calling Leaderboard v3 (BFCLv3):

  • Industry Standard: Widely recognized benchmark for tool-calling capabilities
  • Comprehensive Testing: Covers various tool-calling scenarios and complexities
  • Real-World Relevance: Based on practical use cases and applications
  • Comparative Analysis: Enables direct comparison with other models and approaches

MCP-Universe Benchmark:

  • Salesforce Development: Created by Salesforce for real tool-calling evaluation
  • Diverse Domains: Financial analysis, 3D design, web search, and more
  • Production Scenarios: Based on actual enterprise use cases
  • Multi-Modal Tasks: Includes various input and output modalities

τ-Bench and τ²-Bench:

  • Specialized Evaluation: Focus on retail, airline, and telecommunications environments
  • Domain-Specific Testing: Industry-relevant scenarios and challenges
  • Performance Metrics: Detailed analysis of tool-calling accuracy and efficiency
  • Real-World Applications: Based on actual business processes and workflows

Performance Improvements

Qwen Model Family Results: Models fine-tuned on Toucan data showed impressive performance gains across all tested configurations:

Qwen-2.5-7B:

  • τ-Bench Improvement: Up to 7 percentage points increase
  • τ²-Bench Enhancement: Significant performance boost
  • Efficiency Gains: Better tool selection and execution
  • Error Reduction: Fewer failed tool calls and better recovery

Qwen-2.5-14B:

  • Scaling Benefits: Larger model showed even greater improvements
  • Complex Task Handling: Better performance on multi-tool scenarios
  • Reasoning Enhancement: Improved tool selection logic
  • Consistency Improvement: More reliable task completion

Qwen-2.5-32B:

  • BFCL V3 Results: Nearly 9 percentage points improvement
  • GPT-4.5-Preview Comparison: Narrowly outperformed OpenAI's GPT-4.5-Preview, which is estimated to have at least a trillion parameters
  • Parameter Efficiency: Achieved superior results with significantly fewer parameters
  • Benchmark Success: Also showed strong performance on MCP-Universe benchmark

Cross-Benchmark Consistency:

  • MCP-Universe: Strong performance across diverse real-world tasks
  • Multi-Domain Success: Consistent improvements across different application areas
  • Scalability Demonstration: Performance gains scaled with model size
  • Generalization Ability: Improvements transferred to unseen tasks and domains

Parallel Tool Calling Capabilities

Efficiency Innovation: One of Toucan's unique features is its emphasis on parallel tool calling, where AI agents learn to execute multiple tools simultaneously:

Economic Benefits: As Zhangchen Xu, a graduate student at University of Washington who helped build the dataset as an IBM intern, explained: "You can imagine how parallel calling improves efficiency, which can lower the cost of running agentic systems."

Technical Implementation:

  • 20% Parallel Scenarios: One-fifth of Toucan's scenarios require multiple simultaneous tool calls
  • Efficiency Focus: Designed to teach AI agents more economical operation

Real-World Applications and Use Cases

Key Application Areas

Toucan's diverse scenarios cover a wide range of practical applications that demonstrate the potential of well-trained AI agents:

Business Automation:

  • Analyzing sales reports and generating business summaries
  • Scheduling meetings and sending calendar invites
  • Coordinating between multiple business tools and platforms
  • Managing complex multi-step workflows

Data Processing:

  • Integrating information from multiple sources
  • Processing and analyzing large datasets
  • Generating reports with real-time data
  • Handling various file formats and data types

Communication and Coordination:

  • Managing multi-channel communications
  • Coordinating tasks across different platforms
  • Facilitating team collaboration
  • Automating routine administrative tasks

Model Context Protocol (MCP) Integration

Understanding MCP Architecture

Standardized Interface: The Model Context Protocol, originally developed and open-sourced by Anthropic, provides a standardized way for AI agents to interact with external tools and services:

Core Components:

  • MCP Servers: Topic-based software libraries that provide access to specific tools and APIs
  • Standardized Communication: Consistent interface for tool discovery and execution
  • Security Framework: Built-in security measures for safe tool execution
  • Extensibility: Easy addition of new tools and services to the ecosystem

Toucan's MCP Focus:

  • Real MCP Environments: Dataset based on actual MCP server implementations
  • Production Readiness: Tools and services that are actively used in real applications
  • Comprehensive Coverage: 500 different MCP servers representing diverse tool categories
  • Future Compatibility: Training that prepares models for the evolving MCP ecosystem

Tool Discovery and Selection

Toucan teaches AI agents how to intelligently select and coordinate multiple tools for complex tasks, with scenarios covering everything from simple single-tool operations to sophisticated multi-tool workflows.

Research Impact and Academic Significance

Key Research Contributions

Advancing Tool-Calling Research: As Rameswar Panda, the IBM researcher who led the team behind Toucan, explained: "Tool-calling is central to AI agents. How can you train better agents? Through diverse, high-quality examples sourced from the real world."

Methodological Innovations:

  • Real vs. Synthetic Data: Demonstrating the superiority of real API executions over simulated data
  • Multi-Model Generation: Using multiple LLMs for different aspects of dataset creation
  • Quality Assessment: Implementing systematic quality rating for trajectory selection
  • Scale Impact: Proving that larger, higher-quality datasets lead to better performance

Academic-Industry Collaboration

Partnership Success: The collaboration between IBM Research and the University of Washington exemplifies effective academic-industry partnerships, with graduate student Zhangchen Xu contributing as an IBM intern.

Open Science Approach:

  • Public release of the complete dataset
  • Detailed methodology documentation
  • Community-driven research advancement
  • Educational resource creation

Future Development and Roadmap

Planned Enhancements

Dataset Expansion: The research team has outlined several areas for future development:

Dataset Expansion: The research team plans to onboard new MCP servers with a wider range of tools that have come online since June 2025, when they collected their seed data.

Ongoing Research: As Adriana Meza noted: "We're repurposing part of the Toucan code for these new projects and of course building on all the tool-calling knowledge we acquired in putting the dataset together."

Reinforcement Learning Integration

Future RL Development: The research team plans to create a reinforcement learning gymnasium and benchmark to give LLMs more experience with enterprise workflows. This will build on the tool-calling knowledge acquired in creating the Toucan dataset and help advance the field of AI agent training.

Technical Access and Implementation

Getting Started with Toucan

Dataset Access: The Toucan dataset is freely available through Hugging Face at Agent-Ark/Toucan-1.5M, making it accessible to researchers and developers worldwide.

Basic Usage:

from datasets import load_dataset

# Load the Toucan dataset
dataset = load_dataset("Agent-Ark/Toucan-1.5M")

# Access training trajectories
trajectories = dataset['train']

Data Characteristics:

  • Complete tool-calling interaction sequences
  • Quality-rated trajectories with difficulty assessments
  • Rich metadata including tool usage patterns
  • Comprehensive documentation for research use

Performance Optimization

Training Considerations: The Toucan dataset's large scale (1.5 million trajectories) requires careful consideration of computational resources and training strategies. Researchers should plan for appropriate hardware and training time when working with this comprehensive dataset.

Industry Impact and Adoption

Transforming AI Development

Paradigm Shift: Toucan represents a significant shift in how AI agents are developed and trained:

From Synthetic to Real:

  • Authentic Data: Moving from synthetic, simulated data to real-world interactions
  • Practical Relevance: Training on scenarios that actually occur in production environments
  • Improved Generalization: Better performance on real-world tasks due to realistic training data
  • Reduced Gap: Narrowing the gap between training and deployment environments

Scale and Quality:

  • Unprecedented Scale: Largest dataset of its kind enables more effective training
  • Quality Focus: Emphasis on high-quality, curated examples improves training outcomes
  • Comprehensive Coverage: Broad coverage of tools and scenarios enables versatile AI agents
  • Continuous Improvement: Framework for ongoing dataset expansion and improvement

Research and Development Impact

Academic and Industry Interest: The release of Toucan has generated significant interest in the AI research community, as evidenced by its immediate popularity on Hugging Face. The dataset provides researchers with unprecedented access to high-quality, real-world tool-calling scenarios for developing and evaluating AI agents.

Key Research Areas:

  • Tool-calling methodology development
  • Multi-tool coordination strategies
  • Error handling and recovery in AI agents
  • Parallel tool execution optimization

Conclusion

The release of IBM's Toucan dataset marks a pivotal moment in the evolution of AI agents, providing the research and development community with an unprecedented resource for training more capable, practical, and reliable tool-calling systems. With its 1.5 million real-world scenarios spanning 2,000 different web services, Toucan addresses the critical data scarcity problem that has long hindered the development of truly effective AI agents.

Key Achievements:

  • Unprecedented Scale: 5x larger than previous open-source datasets
  • Real-World Authenticity: Actual API executions rather than synthetic simulations
  • Proven Performance: Demonstrated improvements on leading benchmarks
  • Open Access: Freely available to the global research community
  • Industry Collaboration: Successful partnership between academia and industry
  • Future Foundation: Establishes framework for continued dataset expansion and improvement

Transformative Impact:

Toucan's impact extends far beyond its immediate technical contributions. By providing high-quality, realistic training data, it enables the development of AI agents that can truly understand and navigate the complex, multi-tool workflows that characterize modern digital work environments. The dataset's emphasis on parallel tool calling and sophisticated error handling prepares AI systems for the practical challenges they will face in real-world deployments.

Looking Forward:

The success of Toucan demonstrates the value of collaborative, open-science approaches to AI development. As the research team continues to expand the dataset with new MCP servers and develops reinforcement learning frameworks, Toucan is positioned to remain at the forefront of AI agent research and development.

The enthusiastic community response, with Toucan becoming a top trending dataset on Hugging Face immediately after release, demonstrates strong demand for high-quality, realistic training data in the AI agent development community.

Toucan represents a significant step forward in AI agent training data quality and realism. By providing access to real-world tool-calling scenarios at unprecedented scale, it enables researchers and developers to build more capable and reliable AI agents that can effectively navigate the complex world of modern digital tools and services.

Sources


Interested in learning more about AI agents and tool-calling capabilities? Explore our AI Fundamentals course to understand how AI systems work, check out our glossary of AI terms for key concepts like API and machine learning, or discover the latest AI models and AI tools in our comprehensive catalog.

Frequently Asked Questions

Toucan is the largest publicly available collection of tool-calling scenarios, featuring 1.5 million real-life task sequences that collectively invoke 2,000 different web services through Model Context Protocol (MCP) servers.
Unlike simulated datasets, Toucan captures actual API executions in real environments with complete interaction chains from start to finish. It's over 5 times larger than the next largest open-source dataset (Nvidia's Nemotron with 310,000 trajectories).
Toucan covers diverse real-world scenarios including analyzing sales reports, drafting business summaries, scheduling meetings, sending calendar invites, and complex multi-tool workflows across various domains.
IBM and University of Washington researchers used five open-source LLMs to generate task scenarios and three additional models to construct step-by-step agent trajectories, all based on metadata from 500 MCP servers gathered from GitHub and Smithery.ai.
Models fine-tuned on Toucan showed up to 7 percentage points improvement on τ-Bench and τ²-Bench, with Qwen-2.5-32B improving nearly 9 percentage points on BFCL V3 and outperforming GPT-4.5-Preview.
Toucan is released as open-source research, but you should check the specific licensing terms on Hugging Face. It's designed primarily for research and development of better AI agents.

Continue Your AI Journey

Explore our lessons and glossary to deepen your understanding.