A fault tolerant framework for distributed training with negligible overhead
en-GBde-DEes-ESfr-FR

A fault tolerant framework for distributed training with negligible overhead

26/01/2025 Frontiers Journals

In the context of deep learning model training, checkpoint-based error recovery techniques are a simple and effective form of fault tolerance. By regularly saving the state of the model during training, we can resume training from the most recent checkpoint in case of errors, thereby reducing computational time loss. However, frequent checkpointing introduces some overhead, impacting the training efficiency of the model. On the other hand, infrequent checkpointing may result in longer training time losses in the event of an error.
To solve the dilemma, a research team led by Minyi GUO published their new research on 15 Jan 2025 in Frontiers of Computer Science co-published by Higher Education Press and Springer Nature.
This team has proposed a fault tolerance solution that is capable of perceiving idle system resources during the training process. By summarizing the distribution of idle system resources when training with different parallel modes, they have designed a scheduling algorithm that effectively coordinates the existing computing tasks with additional fault tolerance functions. Building upon this foundation, they have reengineered the task scheduler in distributed training to efficiently manage both training tasks and fault tolerance tasks, effectively alleviating the conflict between checkpointing frequency and training efficiency in distributed training scenarios.
In this research, they analyzed the distribution of idle time (bubble time) on computational devices in distributed training during the training process, as well as the resource utilization of checkpoint recording at various steps. Subsequently, they proposed a fault tolerance mechanism that can perceive and utilize bubble time for model checkpoint recording. Finally, they introduced the integration of this checkpoint recording mechanism with elastic training, achieving automated fault tolerance capabilities in distributed training.
To validate the performance of this fault tolerance framework in real training scenarios, they conducted training tasks with different models and configurations on a training cluster. Results from the experiments indicate that the fault tolerance framework developed in this study is effectively applicable to distributed training scenarios. Furthermore, the overhead introduced due to checkpointing is only 1% compared to scenarios without fault tolerance mechanisms, surpassing other similar fault tolerance frameworks in terms of efficiency.
DOI: 10.1007/s11704-023-3401-5
Runzhe CHEN, Guandong LU, Yakai WANG, Rui ZHANG, Zheng HU, Yanming MIAO, Zhifang CAI, Jingwen LENG, Minyi GUO. BAFT: bubble-aware fault-tolerant framework for distributed DNN training with hybrid parallelism. Front. Comput. Sci., 2025, 19(1): 191102, https://doi.org/10.1007/s11704-023-3401-5
Attached files
  • The procedures of BAFT
  • Figure
26/01/2025 Frontiers Journals
Regions: Asia, China
Keywords: Applied science, Artificial Intelligence, Computing

Disclaimer: AlphaGalileo is not responsible for the accuracy of news releases posted to AlphaGalileo by contributing institutions or for the use of any information through the AlphaGalileo system.

Testimonials

For well over a decade, in my capacity as a researcher, broadcaster, and producer, I have relied heavily on Alphagalileo.
All of my work trips have been planned around stories that I've found on this site.
The under embargo section allows us to plan ahead and the news releases enable us to find key experts.
Going through the tailored daily updates is the best way to start the day. It's such a critical service for me and many of my colleagues.
Koula Bouloukos, Senior manager, Editorial & Production Underknown
We have used AlphaGalileo since its foundation but frankly we need it more than ever now to ensure our research news is heard across Europe, Asia and North America. As one of the UK’s leading research universities we want to continue to work with other outstanding researchers in Europe. AlphaGalileo helps us to continue to bring our research story to them and the rest of the world.
Peter Dunn, Director of Press and Media Relations at the University of Warwick
AlphaGalileo has helped us more than double our reach at SciDev.Net. The service has enabled our journalists around the world to reach the mainstream media with articles about the impact of science on people in low- and middle-income countries, leading to big increases in the number of SciDev.Net articles that have been republished.
Ben Deighton, SciDevNet

We Work Closely With...


  • BBC
  • The Times
  • National Geographic
  • The University of Edinburgh
  • University of Cambridge
  • iesResearch
Copyright 2025 by AlphaGalileo Terms Of Use Privacy Statement