Adaptive batch reuse on deep memories

    公开(公告)号:US12039450B2

    公开(公告)日:2024-07-16

    申请号:US16424115

    申请日:2019-05-28

    Inventor: Abhinav Vishnu

    Abstract: A method of adaptive batch reuse includes prefetching, from a CPU to a GPU, a first plurality of mini-batches comprising a subset of a training dataset. The GPU trains the neural network for the current epoch by reusing, without discard, the first plurality of mini-batches in training the neural network for the current epoch based on a reuse count value. The GPU also runs a validation set to identify a validation error for the current epoch. If the validation error for the current epoch is less than a validation error of a previous epoch, the reuse count value is incremented for a next epoch. However, if the validation error for the current epoch is greater than a validation error of a previous epoch, the reuse count value is decremented for the next epoch.

    Dropout for accelerated deep learning in heterogeneous architectures

    公开(公告)号:US11620525B2

    公开(公告)日:2023-04-04

    申请号:US16141648

    申请日:2018-09-25

    Inventor: Abhinav Vishnu

    Abstract: A heterogeneous processing system includes at least one central processing unit (CPU) core and at least one graphics processing unit (GPU) core. The heterogeneous processing system is configured to compute an activation for each one of a plurality of neurons for a first network layer of a neural network. The heterogeneous processing system randomly drops a first subset of the plurality of neurons for the first network layer and keeps a second subset of the plurality of neurons for the first network layer. Activation for each one of the second subset of the plurality of neurons is forwarded to the CPU core and coalesced to generate a set of coalesced activation sub-matrices.

    Proactive management of inter-GPU network links

    公开(公告)号:US11436060B2

    公开(公告)日:2022-09-06

    申请号:US16552065

    申请日:2019-08-27

    Abstract: Systems, apparatuses, and methods for proactively managing inter-processor network links are disclosed. A computing system includes at least a control unit and a plurality of processing units. Each processing unit of the plurality of processing units includes a compute module and a configurable link interface. The control unit dynamically adjusts a clock frequency and a link width of the configurable link interface of each processing unit based on a data transfer size and layer computation time of a plurality of layers of a neural network so as to reduce execution time of each layer. By adjusting the clock frequency and the link width of the link interface on a per-layer basis, the overlapping of communication and computation phases is closely matched, allowing layers to complete more quickly.

    ALLREDUCE ENHANCED DIRECT MEMORY ACCESS FUNCTIONALITY

    公开(公告)号:US20210406209A1

    公开(公告)日:2021-12-30

    申请号:US17032195

    申请日:2020-09-25

    Abstract: Systems, apparatuses, and methods for performing an allreduce operation on an enhanced direct memory access (DMA) engine are disclosed. A system implements a machine learning application which includes a first kernel and a second kernel. The first kernel corresponds to a first portion of a machine learning model while the second kernel corresponds to a second portion of the machine learning model. The first kernel is invoked on a plurality of compute units and the second kernel is converted into commands executable by an enhanced DMA engine to perform a collective communication operation. The first kernel is executed on the plurality of compute units in parallel with the enhanced DMA engine executing the commands for performing the collective communication operation. As a result, the allreduce operation may be executed in parallel on the enhanced DMA engine to the compute units.

    COMPUTER RESOURCE SCHEDULING USING GENERATIVE ADVERSARIAL NETWORKS

    公开(公告)号:US20200379814A1

    公开(公告)日:2020-12-03

    申请号:US16425878

    申请日:2019-05-29

    Abstract: Techniques for scheduling resources on a managed computer system are provided herein. A generative adversarial network generates predicted resource utilization. An orchestrator trains the generative adversarial network and provides the predicted resource utilization from the generative adversarial network to a resource scheduler for usage when the quality of the predicted resource utilization is above a threshold. The quality is measured as the ability of a generator component of the generative adversarial network to “fool” a discriminator component of the generative adversarial network into misclassifying the predicted resource utilization as being real (i.e., being of the type that is actually measured from the computer system).

    PROACTIVE MANAGEMENT OF INTER-GPU NETWORK LINKS

    公开(公告)号:US20210064444A1

    公开(公告)日:2021-03-04

    申请号:US16552065

    申请日:2019-08-27

    Abstract: Systems, apparatuses, and methods for proactively managing inter-processor network links are disclosed. A computing system includes at least a control unit and a plurality of processing units. Each processing unit of the plurality of processing units includes a compute module and a configurable link interface. The control unit dynamically adjusts a clock frequency and a link width of the configurable link interface of each processing unit based on a data transfer size and layer computation time of a plurality of layers of a neural network so as to reduce execution time of each layer. By adjusting the clock frequency and the link width of the link interface on a per-layer basis, the overlapping of communication and computation phases is closely matched, allowing layers to complete more quickly.

    ADAPTIVE FILTER REPLACEMENT IN CONVOLUTIONAL NEURAL NETWORKS

    公开(公告)号:US20210012203A1

    公开(公告)日:2021-01-14

    申请号:US16508277

    申请日:2019-07-10

    Abstract: Systems, methods, and devices for increasing inference speed of a trained convolutional neural network (CNN). A first computation speed of first filters having a first filter size in a layer of the CNN is determined, and a second computation speed of second filters having a second filter size in the layer of the CNN is determined. The size of at least one of the first filters is changed to the second filter size if the second computation speed is faster than the first computation speed. In some implementations the CNN is retrained, after changing the size of at least one of the first filters to the second filter size, to generate a retrained CNN. The size of a fewer number of the first filters is changed to the second filter size if a key performance indicator loss of the retrained CNN exceeds a threshold.

    RUNTIME EXTENSION FOR NEURAL NETWORK TRAINING WITH HETEROGENEOUS MEMORY

    公开(公告)号:US20200042859A1

    公开(公告)日:2020-02-06

    申请号:US16194958

    申请日:2018-11-19

    Abstract: Systems, apparatuses, and methods for managing buffers in a neural network implementation with heterogeneous memory are disclosed. A system includes a neural network coupled to a first memory and a second memory. The first memory is a relatively low-capacity, high-bandwidth memory while the second memory is a relatively high-capacity, low-bandwidth memory. During a forward propagation pass of the neural network, a run-time manager monitors the usage of the buffers for the various layers of the neural network. During a backward propagation pass of the neural network, the run-time manager determines how to move the buffers between the first and second memories based on the monitored buffer usage during the forward propagation pass. As a result, the run-time manager is able to reduce memory access latency for the layers of the neural network during the backward propagation pass.

Patent Agency Ranking