Google TPUv7 Ironwood: Challenging Nvidia's AI Dominance

Google's TPUv7 Ironwood commercializes AI chips externally. Anthropic's 1M TPU order signals potential end to Nvidia's CUDA dominance.

by HowAIWorks Team
aigoogletputpu-v7ironwoodnvidiaanthropicai-hardwaregpucudaai-infrastructuresemiconductors

Introduction

Google has made a decisive move to commercialize its Tensor Processing Units (TPUs) externally, marking a fundamental shift in strategy that could reshape the AI hardware landscape. The announcement of TPUv7 Ironwood and the revelation of Anthropic's massive 1 million TPU order signal that Google is positioning itself as a direct competitor to Nvidia's dominance in AI acceleration hardware.

The two best models in the world—Anthropic's Claude 4.5 Opus and Google's Gemini 3—have the majority of their training and inference infrastructure running on Google's TPUs and Amazon's Trainium. Now Google is selling TPUs physically to multiple firms, raising the question: Is this the end of Nvidia's dominance?

This development represents a critical inflection point in the AI infrastructure market. The cost structure of AI-driven software deviates considerably from traditional software, with chip microarchitecture and system architecture playing vital roles in development and scalability. Firms that have an advantage in infrastructure will also have an advantage in the ability to deploy and scale AI applications.

Google's Strategic Shift: From Internal to External TPU Sales

The Anthropic Deal: A $52 Billion Milestone

The Anthropic deal marks a major milestone in Google's push to externalize TPUs. We understand that GCP CEO Thomas Kurian played a central role in the negotiations. Google committed early by investing aggressively in Anthropic's funding rounds, even agreeing to no voting rights and a 15% cap on their ownership to expand the use of TPUs beyond internal Google.

Deal Structure:

  • 400,000 TPUv7 Ironwoods: Worth ~$10 billion in finished racks, sold directly to Anthropic by Broadcom
  • 600,000 TPUv7 units: Rented through GCP in a deal estimated at $42 billion of RPO (remaining performance obligations)
  • Total Value: Approximately $52 billion
  • Infrastructure Partners: Fluidstack handles setup, TeraWulf and Cipher Mining supply datacenter infrastructure

Beyond renting capacity in Google datacenters through GCP, Anthropic will deploy TPUs in its own facilities, positioning Google to compete directly with Nvidia as a true merchant hardware vendor.

Expanding Customer Base

Google's externalization strategy extends beyond Anthropic:

  • Meta: Renewed interest in buying TPUs, with Google developing native PyTorch support specifically for Meta
  • xAI: Confirmed as a major external TPU customer
  • SSI (Stability AI): Exploring TPU deployments
  • OpenAI: Even without deploying TPUs yet, OpenAI has already saved ~30% on their entire NVIDIA fleet due to competitive threats

The competitive pressure from TPUs has already forced Nvidia to offer better pricing, demonstrating how the mere threat of TPU adoption creates leverage for customers.

TPUv7 Ironwood: Technical Specifications

Microarchitecture Improvements

TPUv7 Ironwood represents a significant leap forward in Google's TPU silicon design:

Key Specifications:

  • Peak FLOPs: Up to 4,614 teraflops per chip
  • Memory: 8-Hi HBM3E (same capacity as GB200)
  • Memory Bandwidth: Slight shortfall compared to GB200, but competitive
  • World Size: Up to 9,216 TPUs in a single cluster
  • Manufacturing: Built on advanced process node with Broadcom as co-designer

Performance Evolution:

  • TPUv4/v5: Significantly lower compute throughput than Nvidia flagships
  • TPUv6 Trillium: Came very close to H100/H200 on FLOPs, but 2 years later
  • TPUv7 Ironwood: Nearly matches GB200 on FLOPs, available only a few quarters later

The shift in Google's design philosophy became clear with TPUv6 and TPUv7, which were designed post-LLM era and reflect the increased emphasis on training large language models.

Total Cost of Ownership Advantage

While theoretical performance is important, what matters is real-world performance per Total Cost of Ownership (TCO):

TCO Comparison:

  • Google's internal TCO: TPUv7 is ~44% lower than GB200 server TCO
  • External customer TCO: Up to ~30% lower than GB200, ~41% lower than GB300
  • Effective FLOPs: TPUs can achieve higher Model FLOP Utilization (MFU) than Blackwell, potentially reaching 40% MFU vs. 30% for GB300

The key insight is that TPUs can reach higher realized MFU than Blackwell, which translates into higher effective FLOPs for Ironwood. This is because marketed GPU FLOPs from Nvidia are significantly inflated—Hopper only reached ~80% of peak in optimized tests, Blackwell in the 70s, and AMD's MI300 series in the 50s-60s.

Why TPU FLOPs Are More Realistic:

  • Google places high emphasis on RAS (Reliability, Availability, Serviceability)
  • TPUs have been internal-facing with less pressure to inflate specifications
  • TPU clock frequencies are more sustainable, avoiding aggressive DVFS (Dynamic Voltage and Frequency Scaling)

ICI Network Architecture: The Secret Sauce

3D Torus Topology

One of the most distinctive features of the TPU is its extremely large scale-up world size through the ICI (Inter-Chip Interconnect) protocol:

Network Architecture:

  • Building Block: 4x4x4 3D torus consisting of 64 TPUs (one physical rack)
  • Maximum World Size: 9,216 TPUs in a single ICI cluster
  • Topology: 3D torus with each TPU connecting to 6 neighbors (2 per axis)
  • Interconnect: Mix of copper DAC cables and optical transceivers

Connection Strategy:

  • Interior TPUs: Connect via copper within the 4x4x4 cube
  • Face/Edge/Corner TPUs: Use optical transceivers for inter-cube connections
  • Optical Circuit Switches (OCSs): Enable reconfigurable network topologies
  • Attach Ratio: 1.5 optical transceivers per TPUv7

Advantages of ICI Architecture

World Size: The 9,216 TPU maximum world size is far larger than typical 64-72 GPU clusters, enabling training of extremely large models.

Reconfigurability: OCSs allow the network to support thousands of different topologies, enabling precise matching of data parallelism, tensor parallelism, and pipeline parallelism requirements.

Fungibility: Complete fungibility of cubes means slices can be formed from any set of cubes, improving fault tolerance and resource utilization.

Lower Cost: The mesh network reduces the overall number of switches and ports needed, eliminating costs from switch-to-switch connections.

Low Latency: Direct links between TPUs enable much lower latency for physically close or directly connected TPUs, with better data locality.

Datacenter Network (DCN)

Beyond the ICI layer, Google's Datacenter Network (DCN) connects up to 147,456 TPUs across multiple ICI clusters:

  • DCNI Layer: Uses optical circuit switches similar to ICI
  • Aggregation Blocks: Connect multiple 9,216 TPU ICI pods
  • Incremental Expansion: New aggregation blocks can be added without significant rewiring
  • Bandwidth Upgrades: Link speeds can be refreshed without changing fundamental architecture

Software Strategy: Opening the Ecosystem

PyTorch Native Support

Google has made a monumental shift in its TPU software strategy:

Previous Approach:

  • Only first-class support for JAX/XLA:TPU stack
  • PyTorch treated as second-class citizen through PyTorch/XLA
  • Relied on lazy tensor graph capture
  • No support for PyTorch native distributed APIs

New Strategy:

  • Native TPU PyTorch Backend: Moving to eager execution by default
  • Integration with torch.compile: Full support for PyTorch's compilation stack
  • DTensor & torch.distributed: Native support for PyTorch parallelism APIs
  • Pallas Kernel Integration: Custom TPU kernels as codegen target for Torch Dynamo/Inductor

This shift is primarily driven by Meta's renewed interest in buying TPUs, as Meta does not want to move to JAX. The new PyTorch <> TPU integration will create a smoother transition for ML scientists used to PyTorch on GPUs.

vLLM and SGLang Support

Google has also invested heavily in open ecosystem inference:

vLLM TPU Support:

  • Beta support for TPU v5p/v6e through unique PyTorch-to-JAX lowering
  • TPU-optimized paged attention kernels
  • Compute-comms overlapped GEMM kernels
  • All-fused MoE (Mixture of Experts) with 3-4x speedup

Current Limitations:

  • Experimental support for single-host disaggregated prefill-decode
  • No multi-host wideEP disaggregated prefill or MTP yet
  • Limited model support compared to CUDA backend

Critical Missing Piece: Open Source XLA

However, Google still has a critical gap in its software strategy:

Still Closed Source:

  • XLA:TPU compiler: Not open-sourced
  • TPU runtime: Not open-sourced
  • Networking libraries: Not open-sourced
  • MegaScaler codebase: Multi-pod training code remains proprietary

This has led to frustrated users unable to debug issues with their code. We strongly believe that open-sourcing XLA:TPU and TPU runtime would rapidly accelerate adoption, similar to how PyTorch and Linux being open-sourced increased their adoption.

Impact on the AI Hardware Market

Competitive Dynamics

Nvidia's Response:

  • Issued reassuring PR telling everyone to "keep calm and carry on"
  • Offered better pricing to customers (OpenAI saved ~30% without even deploying TPUs)
  • Defended against "circular economy" criticism regarding AI startup investments

Market Implications:

  • First serious challenger to Nvidia's AI hardware monopoly
  • Competitive pricing pressure already affecting Nvidia's margins
  • Supply chain shifts: Broadcom securing massive TPU orders
  • Neocloud market transformation: New financing templates with hyperscaler backstops

The "More TPU, Less GPU Capex" Dynamic

A key insight from the market dynamics: The more TPUs that Meta/SSI/xAI/OAI/Anthropic buy, the more GPU capex they save. This creates a virtuous cycle for TPU adoption:

  • Anthropic: 1M TPU order reduces dependence on Nvidia GPUs
  • OpenAI: 30% savings on GPU fleet without deploying a single TPU
  • Meta: Renewed interest driven by native PyTorch support
  • xAI & SSI: Exploring TPU deployments to reduce costs

Neocloud Market Reshaping

Google's deal structure has reshaped the Neocloud market:

New Financing Template:

  • Off-balance-sheet "IOU": Google offers credit backstop for datacenter leases
  • Solves duration mismatch: GPU clusters (4-5 years) vs. datacenter leases (15+ years)
  • Enables growth: Neoclouds can secure capacity without long-term commitments

Key Beneficiaries:

  • Fluidstack: Handles TPU setup and management
  • TeraWulf & Cipher Mining: Supply datacenter infrastructure
  • Crypto miners: Control power capacity through PPAs, pivoting to AI infrastructure

Constraint for Nvidia-Backed Neoclouds:

  • Neoclouds with Nvidia investment (CoreWeave, Nebius, Crusoe, Together, Lambda, Firmus, Nscale) have incentive to not adopt competing technology
  • This creates a gap in the market for TPU hosting, currently filled by crypto miners + Fluidstack

Why Anthropic Is Betting on TPUs

Technical Advantages

Effective FLOPs Utilization:

  • TPUs can achieve 40% MFU with proper optimization
  • This provides ~52% lower TCO per effective PFLOP compared to GB300 NVL72
  • Even at 19% MFU, TPU TCO matches GB300 baseline

Memory Bandwidth:

  • $ per memory bandwidth is much cheaper than GB300
  • At small message sizes (16MB-64MB), TPUs achieve higher memory bandwidth utilization than GPUs
  • Critical for inference workloads, especially bandwidth-intensive decode steps

System-Level Engineering:

  • Google's expertise in system design compensates for any silicon-level gaps
  • Proven track record: Gemini 3 trained entirely on TPUs
  • Anthropic has ex-Google compiler experts who understand both TPU stack and model architecture

Economic Advantages

Price Cuts Enabled by TPU Efficiency:

  • Anthropic's Opus 4.5 release included a ~67% price cut on the API
  • Lower verbosity and higher token efficiency (76% fewer tokens to match Sonnet's best score)
  • Could effectively raise Anthropic's realized token pricing

Diversification Strategy:

  • Reduces dependence on Nvidia
  • Provides leverage in negotiations
  • Enables cost optimization across multiple hardware platforms

Future Roadmap: TPUv8AX and TPUv8X

While details are behind paywalls, the article mentions next-generation TPUs:

  • TPUv8AX (Sunfish): Next-generation TPU variant
  • TPUv8X (Zebrafish): Alternative next-generation design
  • Comparison with Vera Rubin: Nvidia's next-generation GPU architecture

These future generations will likely continue closing the gap with Nvidia's latest offerings while maintaining TPU's TCO advantages.

Conclusion

Google's TPUv7 Ironwood and the strategic shift to external commercialization represent a fundamental challenge to Nvidia's GPU AI hardware dominance. The combination of competitive performance, superior TCO, and unique system architecture makes TPUs a viable alternative for major AI labs.

Key Takeaways:

  • Anthropic's 1M TPU order validates TPU's technical and economic advantages
  • ICI network architecture enables world sizes far beyond typical GPU clusters
  • Software ecosystem is improving with PyTorch native support and vLLM integration
  • Competitive pressure is already forcing Nvidia to offer better pricing
  • Market dynamics favor TPU adoption as more labs explore alternatives

However, Google still needs to open-source critical software components (XLA:TPU compiler, runtime, MegaScaler) to fully challenge Nvidia's CUDA moat. The ecosystem advantage remains Nvidia's strongest defense, but Google's aggressive externalization strategy and technical improvements are making TPUs increasingly attractive.

The AI hardware market is entering a new phase of competition, with Google positioning TPUs as a serious alternative to Nvidia's GPUs. Whether this marks the beginning of the end for Nvidia's dominance or simply creates a more competitive market remains to be seen, but the shift is undeniable.

For AI practitioners and organizations, this competition is beneficial—driving innovation, lowering costs, and providing more choices for AI infrastructure. The era of a single dominant AI hardware vendor may be coming to an end.

To learn more about AI hardware and infrastructure, explore our AI Architecture and GPU Computing guides, or check out related articles on AI infrastructure and hardware acceleration.

Sources

Frequently Asked Questions

TPUv7 Ironwood is Google's latest tensor processing unit, designed for AI training and inference. It features up to 4,614 teraflops per chip and can scale to clusters of 9,216 TPUs, competing directly with Nvidia's Blackwell GPUs.
Anthropic signed a deal for 1 million TPUv7 chips worth approximately $52 billion, with 400k chips purchased directly and 600k rented through GCP. This provides better TCO (total cost of ownership) compared to Nvidia GPUs, with up to 52% lower cost per effective FLOP.
TPUv7 Ironwood nearly matches GB200 on peak FLOPs and memory bandwidth, with similar memory capacity. However, TPUv7 offers 30-44% lower TCO due to Google's system-level engineering and more realistic performance specifications.
Google's Inter-Chip Interconnect (ICI) uses a 3D torus topology with optical circuit switches, enabling world sizes up to 9,216 TPUs. This is far larger than typical 64-72 GPU clusters and provides reconfigurable network topologies for different parallelism strategies.
Google is making major strides with PyTorch native support and vLLM integration, but critical components like XLA:TPU compiler, runtime, and MegaScaler multi-pod code remain closed-source, limiting broader adoption.
Google's external TPU commercialization, combined with deals with Meta, xAI, SSI, and potentially OpenAI, represents the first serious challenge to Nvidia's AI hardware monopoly. The competitive pressure has already forced Nvidia to offer better pricing to customers.

Continue Your AI Journey

Explore our lessons and glossary to deepen your understanding.