Introduction
Earlier in 2025, Google announced plans to bring computer use capabilities to developers via the Gemini API. Today, the company has fulfilled that promise by releasing the Gemini 2.5 Computer Use model, marking a significant advancement in AI agent capabilities. This specialized model, built on Gemini 2.5 Pro's visual understanding and reasoning capabilities, enables AI agents to interact directly with graphical user interfaces by clicking, typing, and scrolling through web pages and mobile applications just as humans do.
Available now through the Gemini API on Google AI Studio and Vertex AI, this model outperforms competing solutions on multiple benchmarks while delivering lower latency. The release represents a crucial step toward building powerful, general-purpose AI agents capable of completing complex digital tasks that traditionally require human interaction with software interfaces.
What Makes Computer Use Different
Beyond API Interactions
While AI models can already interface with software through structured APIs, many digital tasks still require direct interaction with graphical user interfaces. Activities such as filling out web forms, navigating complex applications, manipulating interactive elements like dropdowns and filters, and operating behind login screens have remained challenging for AI systems.
The ability to natively complete these tasks by controlling user interfaces represents a fundamental shift in how AI agents can assist with digital workflows. Rather than requiring developers to build custom API integrations for every application, agents powered by the Gemini 2.5 Computer Use model can work with any web-based or mobile interface.
Core Capabilities
The model introduces sophisticated UI interaction capabilities through the new computer_use
tool in the Gemini API:
- Visual Understanding: Analyzes screenshots of digital environments to understand interface elements
- Action Planning: Determines appropriate UI actions based on user requests and current context
- Interaction Execution: Performs clicks, typing, scrolling, and other interface manipulations
- Iterative Processing: Continuously evaluates results and adjusts actions to complete tasks
- Confirmation Handling: Requests user approval for high-stakes actions like making purchases
How the Model Works
Agent Loop Architecture
The Gemini 2.5 Computer Use model operates within an iterative agent loop. The process begins with the user providing a task request along with a screenshot of the current environment and a history of recent actions. Developers can optionally specify which UI functions to exclude or include custom functions for specific use cases.
The model analyzes these inputs and generates a response, typically a function call representing a UI action such as clicking a button or typing text. For certain sensitive actions, the model may request end-user confirmation before proceeding. The client-side code then executes the received action.
After execution, a new screenshot and the current URL are sent back to the model as a function response, restarting the loop. This iterative process continues until the task is complete, an error occurs, or the interaction is terminated by a safety response or user decision.
Input Processing
The model accepts three primary inputs for each iteration:
- User Request: The task or goal to be accomplished
- Environment Screenshot: Visual representation of the current UI state
- Action History: Context from previous steps in the workflow
This combination allows the model to maintain awareness of its progress while adapting to changes in the interface as it navigates through multi-step tasks.
Performance Benchmarks
Leading Quality Metrics
The Gemini 2.5 Computer Use model demonstrates strong performance across multiple evaluation frameworks. On the Online-Mind2Web benchmark, which tests real-world web navigation capabilities, the model achieves over 70% accuracy while maintaining the lowest latency among competing solutions at approximately 225 seconds per task.
The model also excels on WebVoyager, a benchmark testing complex web-based task completion, and AndroidWorld, which evaluates mobile UI control capabilities. These results come from a combination of self-reported metrics, independent evaluations run by Browserbase, and Google's internal testing.
Latency Advantage
A critical advantage of the Gemini 2.5 Computer Use model is its combination of high accuracy with low latency. While some competing models may achieve similar or higher accuracy scores on individual benchmarks, they typically require significantly more time to complete tasks. This latency advantage makes the model more practical for real-world applications where users expect responsive agent behavior.
The model's efficiency stems from Gemini 2.5 Pro's underlying architecture, which was designed to balance powerful visual reasoning capabilities with fast inference speeds.
Platform Optimization
Web Browser Focus
The Gemini 2.5 Computer Use model is primarily optimized for web browser environments, where it demonstrates its strongest performance. Web applications represent a significant portion of modern software usage, making browser control a high-value capability for AI agents.
The model can navigate complex web interfaces, including:
- Multi-page forms requiring sequential data entry
- Dynamic content that loads based on user interactions
- Login flows and authenticated sessions
- Interactive elements like dropdowns, date pickers, and filters
- Content behind modals and overlay interfaces
Mobile UI Control
Beyond web browsers, the model shows strong promise for mobile UI control tasks on platforms like Android. The AndroidWorld benchmark results indicate the model can effectively navigate mobile applications, adapting its interaction patterns to touch-based interfaces and mobile-specific UI conventions.
Desktop Limitations
Currently, the model is not optimized for desktop OS-level control, such as manipulating native operating system interfaces or desktop applications. This limitation reflects the initial development focus on web and mobile environments, which offer more standardized interaction patterns and broader applicability across use cases.
Safety and Responsible Development
Built-in Safety Features
Google has implemented multiple layers of safety controls to address the unique risks associated with AI agents that can control computers. These risks include intentional misuse by users, unexpected model behavior, and vulnerability to prompt injections and scams in web environments.
Safety features are trained directly into the model to recognize and respond appropriately to potentially harmful scenarios. The Gemini 2.5 Computer Use System Card provides detailed documentation of these safety considerations and mitigation strategies.
Developer Safety Controls
Developers building with the model have access to additional safety mechanisms:
Per-Step Safety Service: An inference-time safety service that evaluates each proposed action before it is executed. This out-of-model service acts as a guardrail, preventing potentially dangerous actions from being carried out even if the model suggests them.
System Instructions: Developers can specify through system instructions that agents must either refuse or request user confirmation before taking specific types of high-stakes actions. Examples include actions that could:
- Compromise system integrity or security
- Bypass CAPTCHA verification
- Control medical devices
- Complete financial transactions
- Delete or modify critical data
These controls empower developers to customize safety behaviors based on their specific application requirements and risk tolerances.
Best Practices
Google's documentation includes comprehensive recommendations for developers on safety measures and best practices. While the implemented safeguards reduce risk, Google emphasizes that developers must thoroughly test their systems before launch to ensure safe and reliable behavior in production environments.
Early Use Cases
Google Internal Deployments
Google teams have already deployed the Gemini 2.5 Computer Use model to production environments for several use cases. UI testing has emerged as a particularly valuable application, where the model can automatically navigate interfaces and verify functionality, significantly accelerating software development cycles.
Versions of this model power several Google products:
- Project Mariner: Google's experimental browser-based assistant
- Firebase Testing Agent: Automated testing for mobile applications
- AI Mode in Search: Agentic capabilities within Google Search
These internal deployments have provided valuable feedback for model refinement and demonstrated practical applications of computer use capabilities.
Developer Community Applications
Participants in Google's early access program have tested the model across various domains:
- Personal Assistants: Agents that can complete tasks across multiple web services on behalf of users
- Workflow Automation: Automated processes that navigate business applications to complete routine tasks
- UI Testing: Comprehensive testing of web and mobile applications without manual intervention
Early testers report strong results, with the model successfully completing complex multi-step tasks that previously required significant manual effort or custom integration work.
Real-World Examples
Google has demonstrated the model's capabilities through practical examples that showcase its ability to handle complex, multi-step workflows across different applications.
Pet Care Appointment Booking
In one demonstration, the model was given the task: "From a pet care signup form, get all details for any pet with a California residency and add them as a guest in my spa CRM. Then, set up a follow-up visit appointment with the specialist Anima Lavar for October 10th anytime after 8am. The reason for the visit is the same as their requested treatment."
The agent successfully navigated between two separate websites, extracted relevant information from a signup form, populated fields in a CRM system, and scheduled an appointment with specific parameters. This example demonstrates the model's ability to:
- Navigate across multiple web applications
- Extract and transfer data between systems
- Interpret context-specific requirements
- Complete authenticated actions
Digital Board Organization
Another demonstration tasked the agent with organizing a chaotic digital sticky note board: "My art club brainstormed tasks ahead of our fair. The board is chaotic and I need your help organizing the tasks into some categories I created. Go to the sticky note application and ensure notes are clearly in the right sections. Drag them there if not."
The agent analyzed the board layout, identified appropriate categories, and physically dragged notes to their correct locations. This showcases the model's capabilities in:
- Understanding spatial relationships in UI elements
- Performing drag-and-drop interactions
- Making categorization decisions based on content
- Working with dynamic, user-generated interfaces
These demonstrations highlight how the Gemini 2.5 Computer Use model can automate workflows that previously required human judgment and manual interaction across complex digital environments.
Getting Started
Access Options
The Gemini 2.5 Computer Use model is available in public preview through the Gemini API:
- Google AI Studio: For developers and researchers exploring AI capabilities
- Vertex AI: For enterprise deployments requiring Google Cloud infrastructure
Developers can immediately begin experimenting with the model through a demo environment hosted by Browserbase, allowing hands-on experience without initial setup requirements.
Development Resources
Google provides comprehensive documentation and reference implementations to help developers build agent loops:
- Local Development: Integration with Playwright for browser automation on local machines
- Cloud Deployment: Integration with Browserbase for cloud-based VM environments
- API Documentation: Complete reference for the
computer_use
tool and Gemini API - Sample Code: Example implementations demonstrating common patterns
The documentation includes both general guidance for Google AI Studio users and enterprise-specific instructions for Vertex AI deployments.
Community Engagement
Google has established a Developer Forum where developers can share feedback, discuss use cases, and help guide the model's future development roadmap. This community engagement approach allows Google to gather diverse perspectives on capabilities, limitations, and desired improvements.
Implications for AI Development
Expanding Agent Capabilities
The release of the Gemini 2.5 Computer Use model represents a significant expansion in what AI agents can accomplish independently. By enabling direct UI interaction, the model removes a major barrier to agent deployment across existing software ecosystems.
This capability reduces the need for custom API integrations, allowing agents to work with legacy systems, third-party applications, and services that lack programmatic interfaces. The potential applications span virtually any domain where software interfaces are used.
Competitive Landscape
Google's entry into computer use capabilities intensifies competition in the AI agent space. Anthropic previously released similar capabilities for Claude, establishing computer use as an important frontier in LLM development.
The focus on achieving lower latency while maintaining high accuracy suggests Google is positioning Gemini 2.5 Computer Use for production applications where responsiveness matters. This practical orientation distinguishes the offering from research-focused demonstrations.
Enterprise Adoption Potential
The availability through Vertex AI, Google's enterprise AI platform, indicates Google's intention to target business customers. Enterprise use cases like automated testing, workflow automation, and business process optimization represent significant market opportunities where computer use capabilities can deliver immediate value.
The emphasis on safety controls and developer customization options aligns with enterprise requirements for controlled, auditable AI deployments.
Challenges and Limitations
Current Constraints
The Gemini 2.5 Computer Use model has several acknowledged limitations:
- Desktop OS Control: Not yet optimized for native desktop application control
- Visual Complexity: May struggle with highly dynamic or visually complex interfaces
- Action Reliability: Occasional misidentification of UI elements or inappropriate actions
- Context Length: Limited by the amount of action history that can be maintained
These limitations represent areas for future improvement as the technology matures. Google encourages developers to thoroughly test their systems and provide feedback through the Developer Forum to help guide future development priorities.
Conclusion
Google's release of the Gemini 2.5 Computer Use model represents a significant milestone in AI agent development. By enabling agents to interact directly with user interfaces through clicking, typing, and scrolling, the model removes a major barrier to deploying AI automation across existing software ecosystems.
The model's strong performance on web and mobile control benchmarks, combined with its low latency, positions it as a practical solution for production applications rather than just a research demonstration. The emphasis on safety controls and developer customization reflects Google's awareness of the risks associated with agents that can control computers.
Early deployments within Google and among beta testers demonstrate the model's potential across use cases including UI testing, workflow automation, and personal assistance. As the technology matures and developers gain experience building with it, new applications will likely emerge.
For developers interested in building the next generation of AI agents, the Gemini 2.5 Computer Use model provides powerful capabilities accessible through familiar Google Cloud infrastructure. The public preview availability allows immediate experimentation and development of innovative agent-powered applications.
The computer use capability represents an important step toward more capable, general-purpose AI agents that can assist with the full range of digital tasks humans perform daily. As this technology evolves, it will play an increasingly important role in how we interact with software and complete complex workflows.
To learn more about AI agents and their capabilities, explore our AI Fundamentals course or discover other AI tools and models shaping the future of artificial intelligence.