Google Ironwood TPU and Axion VMs: AI Inference Power

Introduction

Google Cloud announced that Ironwood, its seventh-generation Tensor Processing Unit (TPU), will be generally available in the coming weeks, and expanded its Axion portfolio with new Arm-based virtual machines designed for AI workloads. These announcements mark a significant milestone in Google's custom silicon strategy, targeting what the company calls the "age of inference"—a shift from training massive models to serving them at scale to millions of users.

Ironwood represents Google's most powerful and energy-efficient custom silicon to date, offering a 10X peak performance improvement over TPU v5p and more than 4X better performance per chip compared to TPU v6e (Trillium) for both training and inference workloads. The new Axion-based instances, including N4A (now in preview) and C4A metal (coming soon), deliver up to 2X better price-performance than comparable x86-based VMs, making them ideal for the operational backbone of modern AI applications.

These products are part of Google's AI Hypercomputer system—an integrated supercomputing platform that brings together compute, networking, storage, and software to optimize system-level performance. The announcements come as organizations increasingly need both specialized AI accelerators for model serving and efficient general-purpose compute for supporting workloads like data preparation, ingestion, and application servers.

Ironwood TPU: Purpose-built for scale

Performance and capabilities

Ironwood TPU is designed for the most demanding AI workloads, from large-scale model training and complex reinforcement learning to high-volume, low-latency inference and model serving. Key performance characteristics include:

10X peak performance improvement over TPU v5p
More than 4X better performance per chip compared to TPU v6e (Trillium) for training and inference
Most powerful and energy-efficient TPU in Google's history
9,216 chips in a superpod with breakthrough Inter-Chip Interconnect (ICI) networking at 9.6 Tb/s
1.77 Petabytes of shared High Bandwidth Memory (HBM) across the superpod

The massive connectivity enables thousands of chips to communicate quickly and access shared memory, overcoming data bottlenecks even for the most demanding models. This scale is essential for training frontier models like Google's Gemini, Veo, and Imagen, as well as Anthropic's Claude.

System-level design and reliability

Ironwood TPUs are integrated into Google's AI Hypercomputer architecture, which optimizes performance at the system level rather than just individual components. The system includes:

Optical Circuit Switching (OCS) technology: A dynamic, reconfigurable fabric that routes around interconnects automatically, ensuring uninterrupted availability even at massive scale
Liquid cooling infrastructure: Deployed at GigaWatt scale with fleet-wide uptime of approximately 99.999% since 2020
Titanium architecture: Advanced system-level optimizations that improve overall efficiency

According to a recent IDC report, AI Hypercomputer customers achieved on average:

353% three-year ROI
28% lower IT costs
55% more efficient IT teams

Early customer adoption

Several organizations are already leveraging Ironwood for their AI workloads:

Anthropic plans to access up to 1 million TPUs to scale their Claude models. James Bradbury, Head of Compute at Anthropic, noted: "Ironwood's improvements in both inference performance and training scalability will help us scale efficiently while maintaining the speed and reliability our customers expect."

Lightricks, known for their creative AI tools, used Google Cloud TPUs to achieve breakthrough training efficiency for LTX-2, their leading open-source multimodal generative model. Yoav HaCohen, PhD, Director of Foundational Generative AI Research at Lightricks, expressed enthusiasm about Ironwood's potential: "We believe that Ironwood will enable us to create more nuanced, precise, and higher-fidelity image and video generation for our millions of global customers."

Essential AI, focused on building powerful, open frontier models, found the platform easy to onboard. Philip Monk, Infrastructure Lead at Essential AI, stated: "The platform was incredibly easy to onboard, allowing our engineers to immediately leverage its power and focus on accelerating AI breakthroughs."

Axion: Redefining general-purpose compute

New Axion-based instances

Google expanded its Axion portfolio with two new Arm-based instances designed for different use cases:

N4A (Preview)

N4A is Google's most cost-effective N series virtual machine to date, now available in preview. It offers:

Up to 2X better price-performance than comparable current-generation x86-based VMs
Up to 64 vCPUs and 512GB of DDR5 Memory
50 Gbps networking
Support for Custom Machine Types, Hyperdisk Balanced and Throughput storage

N4A is ideal for:

Microservices and containerized applications
Open-source databases
Batch and data analytics workloads
Development environments and experimentation
Data preparation and web serving jobs that support AI applications

C4A Metal (Preview Soon)

C4A metal will be Google's first Arm-based bare-metal instance, coming soon in preview. It provides:

Up to 96 vCPUs and 768GB of DDR5 Memory
Hyperdisk storage and up to 100Gbps of networking
Dedicated physical servers for specialized workloads

C4A metal is designed for:

Android development
Automotive in-car systems
Software with strict licensing requirements
Scale test farms
Running complex simulations

Axion instances comparison

Instance Type	Status	Optimized For	vCPUs	Memory	Networking	Storage	Key Features
N4A	Preview	Price-performance and flexibility	Up to 64	512GB DDR5	50 Gbps	Hyperdisk Balanced, Throughput	Custom Machine Types, most cost-effective N series
C4A Metal	Preview Soon	Specialized workloads, bare metal	Up to 96	768GB DDR5	100 Gbps	Hyperdisk	Dedicated physical servers, first Arm-based bare metal
C4A	Available	Consistently high performance	Up to 72	576GB DDR5	100 Gbps Tier 1	Titanium SSD (up to 6TB local), Hyperdisk Balanced/Throughput/Extreme	Advanced maintenance controls, high-performance workloads

Together, the C and N series allow organizations to lower the total cost of running their business without compromising on performance or workload-specific requirements. Axion's inherent efficiency also makes it valuable for modern AI workflows, excelling at the operational backbone: supporting high-volume data preparation, ingestion, and running application servers that host intelligent applications.

Customer impact

Early adopters of Axion instances are seeing significant improvements:

Vimeo tested N4A instances for their massive video transcoding platform. Joe Peled, Sr. Director of Hosting & Delivery Ops at Vimeo, reported: "We've observed a 30% improvement in performance for our core transcoding workload compared to comparable x86 VMs. This points to a clear path for improving our unit economics and scaling our services more profitably."

ZoomInfo, operating a massive data intelligence platform, measured a 60% improvement in price-performance for their core data processing pipelines running on Dataflow and Java services in GKE. Sergei Koren, Chief Infrastructure Architect at ZoomInfo, noted: "This allows us to scale our platform more efficiently and deliver more value to our customers, faster."

Rise, a programmatic advertising company, migrated to Google Cloud's Axion portfolio and achieved a 20% reduction in compute consumption while maintaining low and stable latency with C4A instances. Or Ben Dahan, Cloud & Software Architect at Rise, shared: "We slashed our compute consumption by 20% while maintaining low and stable latency. Additionally, C4A enabled us to leverage Hyperdisk with precisely the IOPS we need for our stateful workloads, regardless of instance size."

Software ecosystem and developer tools

Training enhancements

Google is enhancing its software stack to support Ironwood and improve the developer experience:

MaxText enhancements: New improvements to MaxText, a high-performance, open-source LLM framework, make it easier to implement the latest training and reinforcement learning optimization techniques, including:
- Supervised Fine-Tuning (SFT)
- Generative Reinforcement Policy Optimization (GRPO)

Inference optimizations

For inference workloads, Google recently announced:

Enhanced TPU support in vLLM: Developers can switch between GPUs and TPUs, or run both, with only minor configuration changes
GKE Inference Gateway: Intelligently load balances across TPU servers, reducing:
- Time-to-first-token (TTFT) latency by up to 96%
- Serving costs by up to 30%

These software improvements enable AI Hypercomputer's high performance and reliability for training, tuning, and serving demanding AI workloads at scale. The deep integrations across the stack—from data-center-wide hardware optimizations to open software and managed services—make Ironwood TPUs Google's most powerful and energy-efficient TPUs to date.

The age of inference

Shifting focus from training to serving

Google's announcements reflect a broader industry shift from training frontier models to serving them at scale. This "age of inference" is characterized by:

Constantly shifting model architectures: New models and techniques emerge rapidly
Rise of agentic workflows: Applications that require orchestration and tight coordination between general-purpose compute and ML acceleration
Near-exponential growth in demand: Organizations need to serve models to millions of users with low latency
New opportunities for custom silicon: Vertically co-optimized system architectures become essential

The combination approach

To thrive in this era, organizations need both:

Purpose-built AI accelerators (like Ironwood TPUs) for model training and serving
Efficient, general-purpose CPUs (like Axion instances) for supporting workloads:
- High-volume data preparation
- Data ingestion
- Application servers hosting intelligent applications
- Operational backbone of AI systems

This combination approach gives organizations ultimate flexibility and capability for the most demanding workloads, whether using Ironwood and Axion together or mixing them with other compute options available on AI Hypercomputer.

Why it matters

Performance and cost efficiency

The announcements represent significant improvements in both performance and cost efficiency:

Ironwood: 10X performance improvement and 4X better performance per chip enable organizations to train and serve larger models more efficiently
Axion N4A: Up to 2X better price-performance makes AI infrastructure more accessible
System-level optimizations: AI Hypercomputer's integrated approach delivers better ROI and lower costs

Scalability for frontier models

Ironwood's ability to scale to 9,216 chips in a superpod with massive shared memory addresses the needs of frontier model developers:

Training at scale: Large models require massive compute resources
Inference at scale: Serving models to millions of users demands high throughput and low latency
Reliability: Optical Circuit Switching ensures uninterrupted availability even at massive scale

Developer experience

The software ecosystem improvements make it easier for developers to:

Switch between hardware: Move between GPUs and TPUs with minimal configuration changes
Optimize inference: Use GKE Inference Gateway to reduce latency and costs
Implement best practices: Leverage MaxText enhancements for training optimization

Competitive positioning

Google's custom silicon strategy positions the company to compete effectively in the AI infrastructure market:

Vertical integration: Hardware and software co-design enables optimizations impossible with off-the-shelf components
Energy efficiency: Custom silicon designed for AI workloads is more energy-efficient than general-purpose processors
Cost structure: Better price-performance makes AI infrastructure more accessible to organizations of all sizes

Conclusion

Google's announcement that Ironwood TPU will be generally available in the coming weeks, along with new Axion-based VMs, represents a significant advancement in AI infrastructure. Ironwood delivers unprecedented performance for training and inference workloads, while Axion instances provide cost-effective general-purpose compute for supporting AI applications.

The combination of purpose-built AI accelerators and efficient general-purpose CPUs addresses the needs of the "age of inference," where organizations must serve models at scale while maintaining cost efficiency. Early customer results demonstrate real-world impact: Anthropic plans to access up to 1 million TPUs, Vimeo saw 30% performance improvements, and ZoomInfo achieved 60% better price-performance.

As AI continues to drive unprecedented computational demands, custom silicon and vertically integrated systems like AI Hypercomputer become essential for meeting future needs. Organizations can sign up to test Ironwood, Axion N4A, or C4A metal today to evaluate these capabilities for their workloads.

Explore more about AI infrastructure, Tensor Processing Units, and cloud computing in our Glossary, and learn about Google's Gemini and other AI models in our Models catalog.