-
1.
公开(公告)号:US20250103959A1
公开(公告)日:2025-03-27
申请号:US18885339
申请日:2024-09-13
Inventor: Liang Shen , Dianhai Yu , Weibao Gong , Jinle Zeng , Haifeng Wang
IPC: G06N20/00
Abstract: Provided is a performance optimization method for a model training device, an electronic device, and a storage medium, relating to the fields of deep learning, large model training, and distributed parallel strategies. The method includes: determining communication timing of a current model training device with respect to a target model block at a target sorting position, so as to be able to perform synchronously collective communication with other model training devices of a plurality of model training devices with respect to model blocks at the target sorting position; and performing the collective communication on a backward gradient of the target model block at the communication timing.
-
公开(公告)号:US20250139327A1
公开(公告)日:2025-05-01
申请号:US18895722
申请日:2024-09-25
Inventor: Liang Shen , Jinle Zeng , Hongxiang Hao , Weibao Gong , Dianhai Yu , Haifeng Wang
IPC: G06F30/20
Abstract: A method for processing a model operator includes: determining an operator set for model networking, wherein the operator set comprises a plurality of operators; determining a storage amount occupied by an output tensor of each operator in the operator set and a computation time period consumed in a forward computation of each operator in the operator set; and determining a first operator participating in recomputation in a model from the operator set, based on the storage amounts and the computation time periods of the plurality of operators.
-
公开(公告)号:US20250029010A1
公开(公告)日:2025-01-23
申请号:US18895264
申请日:2024-09-24
Inventor: Dianhai Yu , Gexiao Tian , Weibao Gong , Haifeng Wang , Yongsheng Xu , Jiabin Yang
IPC: G06N20/00
Abstract: A cluster-based training method includes: in response to a hardware fault in the training node, selecting a target standby node from the plurality of standby nodes, and obtaining a target training snapshot of the model training task in the training node, in which the target training snapshot includes training state data of the model training task; and initializing the target standby node based on a container image of a model training program in the training node and the training state data to replace the training node with the target standby node to continue executing the model training task.
-
-