Google DeepMind Demos Fault-Tolerant Distributed Training Across Four US Regions

Google DeepMind has demonstrated Decoupled DiLoCo, a fault-tolerant distributed training system combining Pathways and DiLoCo that trained a 12B Gemma model across four US regions using low-bandwidth networks, mixed TPU generations, and self-healing failure recovery.

Google DeepMind Demos Fault-Tolerant Distributed Training Across Four US Regions

Google DeepMind has published results for Decoupled DiLoCo, a distributed training system that combines Pathways (chip-to-chip data sharing at each chip's own pace) and DiLoCo (minimise cross-datacenter bandwidth) to eliminate the need for global chip synchronisation. The system successfully trained a 12B Gemma model across four US regions using low-bandwidth networks. It also demonstrated mixed-hardware training using both TPU6e and TPUv5p generations simultaneously without performance degradation. Self-healing capabilities were verified by introducing artificial hardware failures — the system isolated disruptions, continued training, and reintegrated recovered hardware upon return.

Why It Matters

Eliminating global synchronisation constraints in distributed training has profound implications for training cost, geographic flexibility, and resilience. Self-healing removes one of the largest operational risks in large-scale training runs. For organisations planning frontier model training infrastructure, Decoupled DiLoCo's approach could make geographically distributed, heterogeneous hardware clusters a viable training substrate — reducing dependency on co-located high-bandwidth GPU clusters.

Google DeepMind Demos Fault-Tolerant Distributed Training Across Four US Regions

Google DeepMind Demos Fault-Tolerant Distributed Training Across Four US Regions

Why It Matters

Live Intel Feed