CHECKPOINT STATE STORAGE FOR MACHINE-LEARNING MODEL TRAINING

    公开(公告)号:US20230229905A1

    公开(公告)日:2023-07-20

    申请号:US17578326

    申请日:2022-01-18

    Inventor: Yuan YU

    CPC classification number: G06N3/08

    Abstract: A method for training a machine-learning model. A plurality of nodes are assigned for training the machine-learning model. Nodes include agents comprising at least an agent processing unit and local memory. Each agent manages, via a local network, one or more workers that include a worker processing unit. Shards of a training data set are distributed for parallel processing by workers at different nodes. Each worker processing unit is configured to iteratively train on minibatches of a shard, and to report checkpoint states indicating updated parameters for storage in local memory. Based at least on recognizing a worker processing unit failing, the failed worker processing unit is reassigned and initialized based at least on a checkpoint state stored in local memory.

Patent Agency Ranking