In the context of deep learning model training, checkpoint-based error recovery techniques are a simple and effective form of fault tolerance. By regularly saving the state of the model during training, we can resume training from the most recent checkpoint in case of errors, thereby reducing computational time loss. However, frequent checkpointing introduces some overhead, impacting the training efficiency of the model. On the other hand, infrequent checkpointing may result in longer training time losses in the event of an error.
To solve the dilemma, a research team led by Minyi GUO published their
new research on 15 Jan 2025 in
Frontiers of Computer Science co-published by Higher Education Press and Springer Nature.
This team has proposed a fault tolerance solution that is capable of perceiving idle system resources during the training process. By summarizing the distribution of idle system resources when training with different parallel modes, they have designed a scheduling algorithm that effectively coordinates the existing computing tasks with additional fault tolerance functions. Building upon this foundation, they have reengineered the task scheduler in distributed training to efficiently manage both training tasks and fault tolerance tasks, effectively alleviating the conflict between checkpointing frequency and training efficiency in distributed training scenarios.
In this research, they analyzed the distribution of idle time (bubble time) on computational devices in distributed training during the training process, as well as the resource utilization of checkpoint recording at various steps. Subsequently, they proposed a fault tolerance mechanism that can perceive and utilize bubble time for model checkpoint recording. Finally, they introduced the integration of this checkpoint recording mechanism with elastic training, achieving automated fault tolerance capabilities in distributed training.
To validate the performance of this fault tolerance framework in real training scenarios, they conducted training tasks with different models and configurations on a training cluster. Results from the experiments indicate that the fault tolerance framework developed in this study is effectively applicable to distributed training scenarios. Furthermore, the overhead introduced due to checkpointing is only 1% compared to scenarios without fault tolerance mechanisms, surpassing other similar fault tolerance frameworks in terms of efficiency.
DOI:
10.1007/s11704-023-3401-5