IBM Releases Toucan: Largest Open-Source Tool-Calling Dataset with 1.5M Real-World Scenarios

Introduction

On October 17, 2025, IBM Research and the University of Washington announced the release of Toucan, a revolutionary dataset that represents the largest and most comprehensive collection of publicly available tool-calling scenarios for AI agents. With 1.5 million real-world task sequences spanning 2,000 different web services, Toucan is designed to transform how AI agents learn to interact with the world and accomplish practical tasks.

As one enthusiast noted on LinkedIn: "Toucan changes everything. This isn't another simulated dataset. It captures actual API executions in real environments. Complete interaction chains from start to finish."

This groundbreaking dataset addresses one of the most critical challenges in AI development: teaching large language models to properly call and execute tools through APIs. The dataset seized the community's attention almost immediately after release, becoming a top trending dataset on Hugging Face.

Key highlights:

1.5 million real-world tool-calling trajectories
2,000+ different web services and APIs
5x larger than previous open-source datasets
Real MCP (Model Context Protocol) environments
Proven performance improvements on leading benchmarks
Open-source availability on Hugging Face

The Challenge of Tool-Calling in AI

Why Tool-Calling Matters

Tool-calling represents perhaps the most essential capability that distinguishes AI agents from simple chatbots. Without the ability to find, select, and deploy external tools and applications, large language models remain limited to text generation rather than becoming truly useful autonomous agents.

The Evolution from Chatbots to Agents:

Traditional LLMs: Limited to text-based responses
Modern AI Agents: Can interact with external systems, APIs, and tools
Future Vision: Fully autonomous agents capable of complex real-world tasks

Training Data Challenges

Teaching LLMs to properly execute tool-calling has historically been extremely difficult due to several key challenges:

Data Scarcity:

High-quality tool-calling examples are rare on the internet
Most existing data lacks real-world complexity
Synthetic data often fails to capture nuanced API interactions
Limited diversity in available tool-calling scenarios

Quality Requirements:

Need for end-to-end interaction sequences
Requirement for realistic error handling
Complex multi-tool workflows
Real-world context and constraints

Scale Limitations:

Previous datasets were too small for effective training
Limited coverage of different tool types and domains
Insufficient examples of parallel tool calling
Lack of comprehensive evaluation scenarios

Toucan Dataset Overview

Dataset Specifications

Scale and Scope:

1.5 million tool-calling trajectories
2,000+ unique web services and APIs
500 MCP (Model Context Protocol) servers
Multiple complexity levels and task types
Comprehensive coverage across domains

Data Quality:

Real-world scenarios: Actual API executions, not simulations
Complete interaction chains: From task initiation to completion
Quality-rated: Each trajectory rated for difficulty and quality
Diverse complexity: From simple single-tool tasks to complex multi-tool workflows
Error handling: Includes realistic failure scenarios and recovery

Comparison with Existing Datasets

Size Advantage:

Toucan: 1.5 million trajectories
Nvidia Nemotron: 310,000 trajectories (previous largest)
5x larger than the next biggest open-source dataset
Most diverse tool coverage available

Quality Differences:

Real vs. Synthetic: Toucan uses actual API executions
Comprehensive Coverage: 2,000+ tools vs. limited tool sets in other datasets
Multi-modal Support: Includes various input and output modalities
Production-Ready: Based on real MCP server implementations

Technical Architecture and Creation Process

Data Collection Methodology

MCP Server Sourcing: The research team began by gathering Model Context Protocol server metadata from two primary sources:

GitHub Repositories:

Comprehensive scan of open-source MCP servers
Filtering for active, well-maintained projects
Quality assessment of server implementations
Documentation and API completeness evaluation

Smithery.ai Platform:

Curated collection of production MCP servers
Verified server functionality and reliability
Access to diverse tool categories and domains
Real-world usage patterns and examples

Quality Filtering Process:

Error Testing: Removed servers that returned frequent error messages
Functionality Verification: Ensured all tools work as expected
Documentation Quality: Required clear API documentation
Reliability Assessment: Tested server stability and response times
Final Selection: 500 high-quality MCP servers chosen for dataset creation

AI-Powered Dataset Generation

Multi-Model Approach: The Toucan creation process employed multiple specialized AI models for different aspects of dataset generation:

Task Scenario Generation (5 Models):

Diverse Perspectives: Five different open-source LLMs used for variety
Creative Task Design: Models generated plausible, realistic scenarios
Domain Coverage: Ensured broad coverage across different use cases
Complexity Variation: Generated tasks of varying difficulty levels
Real-World Grounding: Based scenarios on actual tool capabilities

Agent Trajectory Construction (3 Models):

Step-by-Step Planning: Models created detailed execution plans
Tool Selection Logic: Demonstrated proper tool choice reasoning
Error Handling: Included realistic failure and recovery scenarios
Multi-Tool Coordination: Showed complex workflows with multiple tools
Human-Like Interaction: Generated natural dialogue and explanations

Quality Assessment (2 Models):

Difficulty Rating: Assessed task complexity and challenge level
Quality Scoring: Evaluated trajectory completeness and realism
Selection Criteria: Identified best examples for final dataset
Consistency Checking: Ensured logical flow and coherence
Benchmark Alignment: Selected examples suitable for evaluation

Dataset Structure and Format

Trajectory Components: Each Toucan trajectory follows a standardized structure based on real-world tool-calling scenarios:

Basic Structure:

Task Description: Natural language description of the desired outcome
Planning Phase: Agent creates a step-by-step execution plan
Tool Execution: Actual API calls and tool interactions
Result Summary: Friendly summary of what was accomplished

The trajectories vary in complexity, covering everything from analyzing sales reports and drafting business summaries to scheduling meetings with colleagues and sending out calendar invites.

Performance Benchmarks and Results

Benchmark Evaluation Framework

Primary Benchmarks: Toucan's effectiveness was evaluated using several industry-standard benchmarks:

Berkeley Function Calling Leaderboard v3 (BFCLv3):

Industry Standard: Widely recognized benchmark for tool-calling capabilities
Comprehensive Testing: Covers various tool-calling scenarios and complexities
Real-World Relevance: Based on practical use cases and applications
Comparative Analysis: Enables direct comparison with other models and approaches

MCP-Universe Benchmark:

Salesforce Development: Created by Salesforce for real tool-calling evaluation
Diverse Domains: Financial analysis, 3D design, web search, and more
Production Scenarios: Based on actual enterprise use cases
Multi-Modal Tasks: Includes various input and output modalities

τ-Bench and τ²-Bench:

Specialized Evaluation: Focus on retail, airline, and telecommunications environments
Domain-Specific Testing: Industry-relevant scenarios and challenges
Performance Metrics: Detailed analysis of tool-calling accuracy and efficiency
Real-World Applications: Based on actual business processes and workflows

Performance Improvements

Qwen Model Family Results: Models fine-tuned on Toucan data showed impressive performance gains across all tested configurations:

Qwen-2.5-7B:

τ-Bench Improvement: Up to 7 percentage points increase
τ²-Bench Enhancement: Significant performance boost
Efficiency Gains: Better tool selection and execution
Error Reduction: Fewer failed tool calls and better recovery

Qwen-2.5-14B:

Scaling Benefits: Larger model showed even greater improvements
Complex Task Handling: Better performance on multi-tool scenarios
Reasoning Enhancement: Improved tool selection logic
Consistency Improvement: More reliable task completion

Qwen-2.5-32B:

BFCL V3 Results: Nearly 9 percentage points improvement
GPT-4.5-Preview Comparison: Narrowly outperformed OpenAI's GPT-4.5-Preview, which is estimated to have at least a trillion parameters
Parameter Efficiency: Achieved superior results with significantly fewer parameters
Benchmark Success: Also showed strong performance on MCP-Universe benchmark

Cross-Benchmark Consistency:

MCP-Universe: Strong performance across diverse real-world tasks
Multi-Domain Success: Consistent improvements across different application areas
Scalability Demonstration: Performance gains scaled with model size
Generalization Ability: Improvements transferred to unseen tasks and domains

Parallel Tool Calling Capabilities

Efficiency Innovation: One of Toucan's unique features is its emphasis on parallel tool calling, where AI agents learn to execute multiple tools simultaneously:

Economic Benefits: As Zhangchen Xu, a graduate student at University of Washington who helped build the dataset as an IBM intern, explained: "You can imagine how parallel calling improves efficiency, which can lower the cost of running agentic systems."

Technical Implementation:

20% Parallel Scenarios: One-fifth of Toucan's scenarios require multiple simultaneous tool calls
Efficiency Focus: Designed to teach AI agents more economical operation

Real-World Applications and Use Cases

Key Application Areas

Toucan's diverse scenarios cover a wide range of practical applications that demonstrate the potential of well-trained AI agents:

Business Automation:

Analyzing sales reports and generating business summaries
Scheduling meetings and sending calendar invites
Coordinating between multiple business tools and platforms
Managing complex multi-step workflows

Data Processing:

Integrating information from multiple sources
Processing and analyzing large datasets
Generating reports with real-time data
Handling various file formats and data types

Communication and Coordination:

Managing multi-channel communications
Coordinating tasks across different platforms
Facilitating team collaboration
Automating routine administrative tasks

Model Context Protocol (MCP) Integration

Understanding MCP Architecture

Standardized Interface: The Model Context Protocol, originally developed and open-sourced by Anthropic, provides a standardized way for AI agents to interact with external tools and services:

Core Components:

MCP Servers: Topic-based software libraries that provide access to specific tools and APIs
Standardized Communication: Consistent interface for tool discovery and execution
Security Framework: Built-in security measures for safe tool execution
Extensibility: Easy addition of new tools and services to the ecosystem

Toucan's MCP Focus:

Real MCP Environments: Dataset based on actual MCP server implementations
Production Readiness: Tools and services that are actively used in real applications
Comprehensive Coverage: 500 different MCP servers representing diverse tool categories
Future Compatibility: Training that prepares models for the evolving MCP ecosystem

Tool Discovery and Selection

Toucan teaches AI agents how to intelligently select and coordinate multiple tools for complex tasks, with scenarios covering everything from simple single-tool operations to sophisticated multi-tool workflows.

Research Impact and Academic Significance

Key Research Contributions

Advancing Tool-Calling Research: As Rameswar Panda, the IBM researcher who led the team behind Toucan, explained: "Tool-calling is central to AI agents. How can you train better agents? Through diverse, high-quality examples sourced from the real world."

Methodological Innovations:

Real vs. Synthetic Data: Demonstrating the superiority of real API executions over simulated data
Multi-Model Generation: Using multiple LLMs for different aspects of dataset creation
Quality Assessment: Implementing systematic quality rating for trajectory selection
Scale Impact: Proving that larger, higher-quality datasets lead to better performance

Academic-Industry Collaboration

Partnership Success: The collaboration between IBM Research and the University of Washington exemplifies effective academic-industry partnerships, with graduate student Zhangchen Xu contributing as an IBM intern.

Open Science Approach:

Public release of the complete dataset
Detailed methodology documentation
Community-driven research advancement
Educational resource creation

Future Development and Roadmap

Planned Enhancements

Dataset Expansion: The research team has outlined several areas for future development:

Dataset Expansion: The research team plans to onboard new MCP servers with a wider range of tools that have come online since June 2025, when they collected their seed data.

Ongoing Research: As Adriana Meza noted: "We're repurposing part of the Toucan code for these new projects and of course building on all the tool-calling knowledge we acquired in putting the dataset together."

Reinforcement Learning Integration

Future RL Development: The research team plans to create a reinforcement learning gymnasium and benchmark to give LLMs more experience with enterprise workflows. This will build on the tool-calling knowledge acquired in creating the Toucan dataset and help advance the field of AI agent training.

Technical Access and Implementation

Getting Started with Toucan

Dataset Access: The Toucan dataset is freely available through Hugging Face at Agent-Ark/Toucan-1.5M, making it accessible to researchers and developers worldwide.

Basic Usage:

from datasets import load_dataset

# Load the Toucan dataset
dataset = load_dataset("Agent-Ark/Toucan-1.5M")

# Access training trajectories
trajectories = dataset['train']

Data Characteristics:

Complete tool-calling interaction sequences
Quality-rated trajectories with difficulty assessments
Rich metadata including tool usage patterns
Comprehensive documentation for research use

Performance Optimization

Training Considerations: The Toucan dataset's large scale (1.5 million trajectories) requires careful consideration of computational resources and training strategies. Researchers should plan for appropriate hardware and training time when working with this comprehensive dataset.

Industry Impact and Adoption

Transforming AI Development

Paradigm Shift: Toucan represents a significant shift in how AI agents are developed and trained:

From Synthetic to Real:

Authentic Data: Moving from synthetic, simulated data to real-world interactions
Practical Relevance: Training on scenarios that actually occur in production environments
Improved Generalization: Better performance on real-world tasks due to realistic training data
Reduced Gap: Narrowing the gap between training and deployment environments

Scale and Quality:

Unprecedented Scale: Largest dataset of its kind enables more effective training
Quality Focus: Emphasis on high-quality, curated examples improves training outcomes
Comprehensive Coverage: Broad coverage of tools and scenarios enables versatile AI agents
Continuous Improvement: Framework for ongoing dataset expansion and improvement

Research and Development Impact

Academic and Industry Interest: The release of Toucan has generated significant interest in the AI research community, as evidenced by its immediate popularity on Hugging Face. The dataset provides researchers with unprecedented access to high-quality, real-world tool-calling scenarios for developing and evaluating AI agents.

Key Research Areas:

Tool-calling methodology development
Multi-tool coordination strategies
Error handling and recovery in AI agents
Parallel tool execution optimization

Conclusion

The release of IBM's Toucan dataset marks a pivotal moment in the evolution of AI agents, providing the research and development community with an unprecedented resource for training more capable, practical, and reliable tool-calling systems. With its 1.5 million real-world scenarios spanning 2,000 different web services, Toucan addresses the critical data scarcity problem that has long hindered the development of truly effective AI agents.

Key Achievements:

Unprecedented Scale: 5x larger than previous open-source datasets
Real-World Authenticity: Actual API executions rather than synthetic simulations
Proven Performance: Demonstrated improvements on leading benchmarks
Open Access: Freely available to the global research community
Industry Collaboration: Successful partnership between academia and industry
Future Foundation: Establishes framework for continued dataset expansion and improvement

Transformative Impact:

Toucan's impact extends far beyond its immediate technical contributions. By providing high-quality, realistic training data, it enables the development of AI agents that can truly understand and navigate the complex, multi-tool workflows that characterize modern digital work environments. The dataset's emphasis on parallel tool calling and sophisticated error handling prepares AI systems for the practical challenges they will face in real-world deployments.

Looking Forward:

The success of Toucan demonstrates the value of collaborative, open-science approaches to AI development. As the research team continues to expand the dataset with new MCP servers and develops reinforcement learning frameworks, Toucan is positioned to remain at the forefront of AI agent research and development.

The enthusiastic community response, with Toucan becoming a top trending dataset on Hugging Face immediately after release, demonstrates strong demand for high-quality, realistic training data in the AI agent development community.

Toucan represents a significant step forward in AI agent training data quality and realism. By providing access to real-world tool-calling scenarios at unprecedented scale, it enables researchers and developers to build more capable and reliable AI agents that can effectively navigate the complex world of modern digital tools and services.

Sources

Toucan: A new goldmine for tool-calling AI agents - IBM Research, October 17, 2025
Toucan Dataset on Hugging Face - Agent-Ark
TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments - IBM Research Study
Berkeley Function Calling Leaderboard - UC Berkeley
Model Context Protocol - Anthropic

Interested in learning more about AI agents and tool-calling capabilities? Explore our AI Fundamentals course to understand how AI systems work, check out our glossary of AI terms for key concepts like API and machine learning, or discover the latest AI models and AI tools in our comprehensive catalog.