TRAINING NEURAL NETWORKS BASED ON DUAL PIPELINE ARCHITECTURES

    公开(公告)号:US20220138524A1

    公开(公告)日:2022-05-05

    申请号:US17151007

    申请日:2021-01-15

    Abstract: Embodiments of the present disclosure include systems and methods for training neural networks based on dual pipeline architectures. In some embodiments, a first set of compute elements are configured to implement a first set of layers of a first instance of a neural network. A second set of compute elements are configured to implement a second set of layers of the first instance of the neural network. The second set of compute elements are further configured to implement a first set of layers of a second instance of the neural network. The first set of compute elements are further configured to implement a second set of layers of the second instance of the neural network. The first set of layers of the first instance of the neural network and the first set of layers of the second instance of the neural network are each configured to receive training data.

    INTEGRATED HARDWARE ARCHITECTURE AND DISTRIBUTION STRATEGY OPTIMIZATION FOR DEEP LEARNING MODELS

    公开(公告)号:US20250061533A1

    公开(公告)日:2025-02-20

    申请号:US18452162

    申请日:2023-08-18

    Abstract: A training optimization system implements algorithmic solutions to solve the conjoined problem of accelerator architecture search and model partitioning for distributed training. The system makes the multi-dimensional optimization space of architecture search and device placement tractable by reducing the number of accelerator architectures explored through area-based heuristics and employing a novel integer linear program (ILP), the size of which is dependent only on the number of operators. The ILP scheduling optimization also explores the partitioning of operators across cores, known as intra-operator parallelism. Despite the vast space, the ILP described herein requires significantly less time to perform the optimizations across all explored accelerator configurations. Based on the optimal backward and forward pass latencies, the system leverages a novel dynamic programming (DP) approach to determine the device placement and model partitioning scheme.

    MITIGATING COMMUNICATION BOTTLENECKS DURING PARAMETER EXCHANGE IN DATA-PARALLEL DNN TRAINING

    公开(公告)号:US20200160171A1

    公开(公告)日:2020-05-21

    申请号:US16276250

    申请日:2019-02-14

    Abstract: Technologies are disclosed herein for dynamically generating communication primitives for use in model parameter synchronization during data-parallel DNN training by packing directed spanning trees. An interconnect topology for communication between GPUs in a computing system is determined. A quantity of directed spanning trees are generated for transmitting data between the GPUs using the interconnect topology and packed. The directed spanning trees define the connections between GPUs that are to be utilized for the transmission and the amount of data to be transmitted on each connection. Program code is generated for implementing the data transfer defined by the directed spanning trees. When the program code is executed, the directed spanning trees are used to pipeline the transmission of chunks of data, such as model parameters used during data-parallel DNN training, between the GPUs. The program code can also determine an optimal chunk size for data to be transferred between the GPUs.

    AUTOMATIC LATENCY OPTIMIZATION FOR CPU-BASED DNN SERVING

    公开(公告)号:US20250060998A1

    公开(公告)日:2025-02-20

    申请号:US18452326

    申请日:2023-08-18

    Abstract: Systems and methods for optimizing thread allocation in a model serving system include estimating a batch size for inference requests. An optimal configuration is then determined that defines a number of inference instances, a number of threads per inference instance, and a sub-batch size per inference instance for processing a batch of inference requests of the batch size using intra-operator parallelism that minimizes average per-batch latency. The optimal configuration is determined with reference to a plurality of predetermined model profiles that define single-inference average batch latencies for different combinations of thread counts and batch sizes, the predetermined model profiles being used as input to a dynamic programming algorithm that identifies optimal configurations that minimize the average per-batch latency.

    SELECTIVE DATA STRUCTURE ENCODING FOR DEEP NEURAL NETWORK TRAINING

    公开(公告)号:US20220414457A1

    公开(公告)日:2022-12-29

    申请号:US17362751

    申请日:2021-06-29

    Abstract: Methods, systems, apparatuses, and computer-readable storage mediums described herein are directed to techniques for efficient data encoding for neural network training. In particular, the embodiments described herein train a DNN based on a selective encoding (e.g., compressing) of data structures that are generated during training. For example, multiple training sessions may be performed where, in each training session, a different set of data structures performed by various operators of the DNN are encoded. Memory allocation information generated based on each training session is analyzed to determine which combination of encoded data structures results in a reduction of memory required to train the DNN.

Patent Agency Ranking