Introduction
Google has made a decisive move to commercialize its Tensor Processing Units (TPUs) externally, marking a fundamental shift in strategy that could reshape the AI hardware landscape. The announcement of TPUv7 Ironwood and the revelation of Anthropic's massive 1 million TPU order signal that Google is positioning itself as a direct competitor to Nvidia's dominance in AI acceleration hardware.
The two best models in the world—Anthropic's Claude 4.5 Opus and Google's Gemini 3—have the majority of their training and inference infrastructure running on Google's TPUs and Amazon's Trainium. Now Google is selling TPUs physically to multiple firms, raising the question: Is this the end of Nvidia's dominance?
This development represents a critical inflection point in the AI infrastructure market. The cost structure of AI-driven software deviates considerably from traditional software, with chip microarchitecture and system architecture playing vital roles in development and scalability. Firms that have an advantage in infrastructure will also have an advantage in the ability to deploy and scale AI applications.
Google's Strategic Shift: From Internal to External TPU Sales
The Anthropic Deal: A $52 Billion Milestone
The Anthropic deal marks a major milestone in Google's push to externalize TPUs. We understand that GCP CEO Thomas Kurian played a central role in the negotiations. Google committed early by investing aggressively in Anthropic's funding rounds, even agreeing to no voting rights and a 15% cap on their ownership to expand the use of TPUs beyond internal Google.
Deal Structure:
- 400,000 TPUv7 Ironwoods: Worth ~$10 billion in finished racks, sold directly to Anthropic by Broadcom
- 600,000 TPUv7 units: Rented through GCP in a deal estimated at $42 billion of RPO (remaining performance obligations)
- Total Value: Approximately $52 billion
- Infrastructure Partners: Fluidstack handles setup, TeraWulf and Cipher Mining supply datacenter infrastructure
Beyond renting capacity in Google datacenters through GCP, Anthropic will deploy TPUs in its own facilities, positioning Google to compete directly with Nvidia as a true merchant hardware vendor.
Expanding Customer Base
Google's externalization strategy extends beyond Anthropic:
- Meta: Renewed interest in buying TPUs, with Google developing native PyTorch support specifically for Meta
- xAI: Confirmed as a major external TPU customer
- SSI (Stability AI): Exploring TPU deployments
- OpenAI: Even without deploying TPUs yet, OpenAI has already saved ~30% on their entire NVIDIA fleet due to competitive threats
The competitive pressure from TPUs has already forced Nvidia to offer better pricing, demonstrating how the mere threat of TPU adoption creates leverage for customers.
TPUv7 Ironwood: Technical Specifications
Microarchitecture Improvements
TPUv7 Ironwood represents a significant leap forward in Google's TPU silicon design:
Key Specifications:
- Peak FLOPs: Up to 4,614 teraflops per chip
- Memory: 8-Hi HBM3E (same capacity as GB200)
- Memory Bandwidth: Slight shortfall compared to GB200, but competitive
- World Size: Up to 9,216 TPUs in a single cluster
- Manufacturing: Built on advanced process node with Broadcom as co-designer
Performance Evolution:
- TPUv4/v5: Significantly lower compute throughput than Nvidia flagships
- TPUv6 Trillium: Came very close to H100/H200 on FLOPs, but 2 years later
- TPUv7 Ironwood: Nearly matches GB200 on FLOPs, available only a few quarters later
The shift in Google's design philosophy became clear with TPUv6 and TPUv7, which were designed post-LLM era and reflect the increased emphasis on training large language models.
Total Cost of Ownership Advantage
While theoretical performance is important, what matters is real-world performance per Total Cost of Ownership (TCO):
TCO Comparison:
- Google's internal TCO: TPUv7 is ~44% lower than GB200 server TCO
- External customer TCO: Up to ~30% lower than GB200, ~41% lower than GB300
- Effective FLOPs: TPUs can achieve higher Model FLOP Utilization (MFU) than Blackwell, potentially reaching 40% MFU vs. 30% for GB300
The key insight is that TPUs can reach higher realized MFU than Blackwell, which translates into higher effective FLOPs for Ironwood. This is because marketed GPU FLOPs from Nvidia are significantly inflated—Hopper only reached ~80% of peak in optimized tests, Blackwell in the 70s, and AMD's MI300 series in the 50s-60s.
Why TPU FLOPs Are More Realistic:
- Google places high emphasis on RAS (Reliability, Availability, Serviceability)
- TPUs have been internal-facing with less pressure to inflate specifications
- TPU clock frequencies are more sustainable, avoiding aggressive DVFS (Dynamic Voltage and Frequency Scaling)
ICI Network Architecture: The Secret Sauce
3D Torus Topology
One of the most distinctive features of the TPU is its extremely large scale-up world size through the ICI (Inter-Chip Interconnect) protocol:
Network Architecture:
- Building Block: 4x4x4 3D torus consisting of 64 TPUs (one physical rack)
- Maximum World Size: 9,216 TPUs in a single ICI cluster
- Topology: 3D torus with each TPU connecting to 6 neighbors (2 per axis)
- Interconnect: Mix of copper DAC cables and optical transceivers
Connection Strategy:
- Interior TPUs: Connect via copper within the 4x4x4 cube
- Face/Edge/Corner TPUs: Use optical transceivers for inter-cube connections
- Optical Circuit Switches (OCSs): Enable reconfigurable network topologies
- Attach Ratio: 1.5 optical transceivers per TPUv7
Advantages of ICI Architecture
World Size: The 9,216 TPU maximum world size is far larger than typical 64-72 GPU clusters, enabling training of extremely large models.
Reconfigurability: OCSs allow the network to support thousands of different topologies, enabling precise matching of data parallelism, tensor parallelism, and pipeline parallelism requirements.
Fungibility: Complete fungibility of cubes means slices can be formed from any set of cubes, improving fault tolerance and resource utilization.
Lower Cost: The mesh network reduces the overall number of switches and ports needed, eliminating costs from switch-to-switch connections.
Low Latency: Direct links between TPUs enable much lower latency for physically close or directly connected TPUs, with better data locality.
Datacenter Network (DCN)
Beyond the ICI layer, Google's Datacenter Network (DCN) connects up to 147,456 TPUs across multiple ICI clusters:
- DCNI Layer: Uses optical circuit switches similar to ICI
- Aggregation Blocks: Connect multiple 9,216 TPU ICI pods
- Incremental Expansion: New aggregation blocks can be added without significant rewiring
- Bandwidth Upgrades: Link speeds can be refreshed without changing fundamental architecture
Software Strategy: Opening the Ecosystem
PyTorch Native Support
Google has made a monumental shift in its TPU software strategy:
Previous Approach:
- Only first-class support for JAX/XLA:TPU stack
- PyTorch treated as second-class citizen through PyTorch/XLA
- Relied on lazy tensor graph capture
- No support for PyTorch native distributed APIs
New Strategy:
- Native TPU PyTorch Backend: Moving to eager execution by default
- Integration with torch.compile: Full support for PyTorch's compilation stack
- DTensor & torch.distributed: Native support for PyTorch parallelism APIs
- Pallas Kernel Integration: Custom TPU kernels as codegen target for Torch Dynamo/Inductor
This shift is primarily driven by Meta's renewed interest in buying TPUs, as Meta does not want to move to JAX. The new PyTorch <> TPU integration will create a smoother transition for ML scientists used to PyTorch on GPUs.
vLLM and SGLang Support
Google has also invested heavily in open ecosystem inference:
vLLM TPU Support:
- Beta support for TPU v5p/v6e through unique PyTorch-to-JAX lowering
- TPU-optimized paged attention kernels
- Compute-comms overlapped GEMM kernels
- All-fused MoE (Mixture of Experts) with 3-4x speedup
Current Limitations:
- Experimental support for single-host disaggregated prefill-decode
- No multi-host wideEP disaggregated prefill or MTP yet
- Limited model support compared to CUDA backend
Critical Missing Piece: Open Source XLA
However, Google still has a critical gap in its software strategy:
Still Closed Source:
- XLA:TPU compiler: Not open-sourced
- TPU runtime: Not open-sourced
- Networking libraries: Not open-sourced
- MegaScaler codebase: Multi-pod training code remains proprietary
This has led to frustrated users unable to debug issues with their code. We strongly believe that open-sourcing XLA:TPU and TPU runtime would rapidly accelerate adoption, similar to how PyTorch and Linux being open-sourced increased their adoption.
Impact on the AI Hardware Market
Competitive Dynamics
Nvidia's Response:
- Issued reassuring PR telling everyone to "keep calm and carry on"
- Offered better pricing to customers (OpenAI saved ~30% without even deploying TPUs)
- Defended against "circular economy" criticism regarding AI startup investments
Market Implications:
- First serious challenger to Nvidia's AI hardware monopoly
- Competitive pricing pressure already affecting Nvidia's margins
- Supply chain shifts: Broadcom securing massive TPU orders
- Neocloud market transformation: New financing templates with hyperscaler backstops
The "More TPU, Less GPU Capex" Dynamic
A key insight from the market dynamics: The more TPUs that Meta/SSI/xAI/OAI/Anthropic buy, the more GPU capex they save. This creates a virtuous cycle for TPU adoption:
- Anthropic: 1M TPU order reduces dependence on Nvidia GPUs
- OpenAI: 30% savings on GPU fleet without deploying a single TPU
- Meta: Renewed interest driven by native PyTorch support
- xAI & SSI: Exploring TPU deployments to reduce costs
Neocloud Market Reshaping
Google's deal structure has reshaped the Neocloud market:
New Financing Template:
- Off-balance-sheet "IOU": Google offers credit backstop for datacenter leases
- Solves duration mismatch: GPU clusters (4-5 years) vs. datacenter leases (15+ years)
- Enables growth: Neoclouds can secure capacity without long-term commitments
Key Beneficiaries:
- Fluidstack: Handles TPU setup and management
- TeraWulf & Cipher Mining: Supply datacenter infrastructure
- Crypto miners: Control power capacity through PPAs, pivoting to AI infrastructure
Constraint for Nvidia-Backed Neoclouds:
- Neoclouds with Nvidia investment (CoreWeave, Nebius, Crusoe, Together, Lambda, Firmus, Nscale) have incentive to not adopt competing technology
- This creates a gap in the market for TPU hosting, currently filled by crypto miners + Fluidstack
Why Anthropic Is Betting on TPUs
Technical Advantages
Effective FLOPs Utilization:
- TPUs can achieve 40% MFU with proper optimization
- This provides ~52% lower TCO per effective PFLOP compared to GB300 NVL72
- Even at 19% MFU, TPU TCO matches GB300 baseline
Memory Bandwidth:
- $ per memory bandwidth is much cheaper than GB300
- At small message sizes (16MB-64MB), TPUs achieve higher memory bandwidth utilization than GPUs
- Critical for inference workloads, especially bandwidth-intensive decode steps
System-Level Engineering:
- Google's expertise in system design compensates for any silicon-level gaps
- Proven track record: Gemini 3 trained entirely on TPUs
- Anthropic has ex-Google compiler experts who understand both TPU stack and model architecture
Economic Advantages
Price Cuts Enabled by TPU Efficiency:
- Anthropic's Opus 4.5 release included a ~67% price cut on the API
- Lower verbosity and higher token efficiency (76% fewer tokens to match Sonnet's best score)
- Could effectively raise Anthropic's realized token pricing
Diversification Strategy:
- Reduces dependence on Nvidia
- Provides leverage in negotiations
- Enables cost optimization across multiple hardware platforms
Future Roadmap: TPUv8AX and TPUv8X
While details are behind paywalls, the article mentions next-generation TPUs:
- TPUv8AX (Sunfish): Next-generation TPU variant
- TPUv8X (Zebrafish): Alternative next-generation design
- Comparison with Vera Rubin: Nvidia's next-generation GPU architecture
These future generations will likely continue closing the gap with Nvidia's latest offerings while maintaining TPU's TCO advantages.
Conclusion
Google's TPUv7 Ironwood and the strategic shift to external commercialization represent a fundamental challenge to Nvidia's GPU AI hardware dominance. The combination of competitive performance, superior TCO, and unique system architecture makes TPUs a viable alternative for major AI labs.
Key Takeaways:
- Anthropic's 1M TPU order validates TPU's technical and economic advantages
- ICI network architecture enables world sizes far beyond typical GPU clusters
- Software ecosystem is improving with PyTorch native support and vLLM integration
- Competitive pressure is already forcing Nvidia to offer better pricing
- Market dynamics favor TPU adoption as more labs explore alternatives
However, Google still needs to open-source critical software components (XLA:TPU compiler, runtime, MegaScaler) to fully challenge Nvidia's CUDA moat. The ecosystem advantage remains Nvidia's strongest defense, but Google's aggressive externalization strategy and technical improvements are making TPUs increasingly attractive.
The AI hardware market is entering a new phase of competition, with Google positioning TPUs as a serious alternative to Nvidia's GPUs. Whether this marks the beginning of the end for Nvidia's dominance or simply creates a more competitive market remains to be seen, but the shift is undeniable.
For AI practitioners and organizations, this competition is beneficial—driving innovation, lowering costs, and providing more choices for AI infrastructure. The era of a single dominant AI hardware vendor may be coming to an end.
To learn more about AI hardware and infrastructure, explore our AI Architecture and GPU Computing guides, or check out related articles on AI infrastructure and hardware acceleration.