Google DeepMind Demos Fault-Tolerant Distributed Training Across Four US Regions
Google DeepMind has published results for Decoupled DiLoCo, a distributed training system that combines Pathways (chip-to-chip data sharing at each chip's own pace) and DiLoCo (minimise cross-datacenter bandwidth) to eliminate the need for global chip synchronisation. The system successfully trained a 12B Gemma model across four US regions using low-bandwidth networks. It also demonstrated mixed-hardware training using both TPU6e and TPUv5p generations simultaneously without performance degradation. Self-healing capabilities were verified by introducing artificial hardware failures — the system isolated disruptions, continued training, and reintegrated recovered hardware upon return.
Why It Matters
Eliminating global synchronisation constraints in distributed training has profound implications for training cost, geographic flexibility, and resilience. Self-healing removes one of the largest operational risks in large-scale training runs. For organisations planning frontier model training infrastructure, Decoupled DiLoCo's approach could make geographically distributed, heterogeneous hardware clusters a viable training substrate — reducing dependency on co-located high-bandwidth GPU clusters.