Emulating fine-grained sparsity in a systolic array

    公开(公告)号:US11500962B1

    公开(公告)日:2022-11-15

    申请号:US16917033

    申请日:2020-06-30

    Abstract: To take advantage of the architecture of a systolic array tailored to perform sparse matrix multiplications, a weight matrix can be converted into a set of constrained fine-grained sparse weight matrices. The conversion process may include receiving a request to perform a matrix multiplication operation with a weight matrix, and determining that the weight matrix satisfies a sparsity condition to convert the weight matrix into a set of constrained fine-grained sparse weight matrices. The weight matrix can then be converted into a set of constrained fine-grained sparse weight matrices. Computer instructions can then be generated for an integrated circuit device to perform the requested matrix multiplication operation as a set of sparse matrix multiplication operations using the set of constrained fine-grained sparse weight matrices.

    Programmable computations in direct memory access engine

    公开(公告)号:US11494326B1

    公开(公告)日:2022-11-08

    申请号:US17301273

    申请日:2021-03-30

    Inventor: Kun Xu Ron Diamant

    Abstract: To perform complex arithmetic operations in neural networks without compromising the performance of the neural network accelerator, a programmable computation unit is integrated with a direct memory access (DMA) engine that is used to exchange neural network parameters between the neural network accelerator and system memory. The DMA engine may include a calculation circuit operable to perform a multiply-and-add calculation on a set of operands, and an operand selector circuit operable to select a source for each operand of the calculation circuit. The DMA engine may also include a control circuit operable to retrieve a meta-descriptor for performing a computation, configure the operand selector circuit based on the meta-descriptor, and use the calculation circuit to perform the computation based on the meta-descriptor to generate a computation result.

    Scheduling neural network computations based on memory capacity

    公开(公告)号:US11461631B2

    公开(公告)日:2022-10-04

    申请号:US15933225

    申请日:2018-03-22

    Abstract: Disclosed herein are techniques for scheduling and executing multi-layer neural network computations for multiple contexts. In one embodiment, a method comprises determining a set of computation tasks to be executed, the set of computation tasks including a first computation task and a second computation task, as well as a third computation task and a fourth computation task to provide input data for the first and second computation tasks; determining a first execution batch comprising the first and second computation tasks; determining a second execution batch comprising at least the third computation task to be executed before the first execution batch; determining whether to include the fourth computation task in the second execution batch based on whether the memory device has sufficient capacity to hold input data and output data of both of the third and fourth computation; executing the second execution batch followed by the first execution batch.

    Matrix transpose hardware acceleration

    公开(公告)号:US11435941B1

    公开(公告)日:2022-09-06

    申请号:US16911127

    申请日:2020-06-24

    Abstract: In one example, an apparatus comprises: a memory array having an array of memory elements arranged in rows and columns, each memory element being configured to store a data element; and a memory access circuit configured to: perform a row write operation to store a first group of data elements at a first row of the array of memory elements; perform a column read operation at a first column of the array of memory elements to obtain a second group of data elements; and perform a column write operation to store a third group of data elements at the first column of the array of memory elements to replace the second group of data elements.

    MULTI-MODEL TRAINING PIPELINE IN DISTRIBUTED SYSTEMS

    公开(公告)号:US20210303988A1

    公开(公告)日:2021-09-30

    申请号:US16835161

    申请日:2020-03-30

    Abstract: A first worker node of a distributed system computes a first set of gradients using a first neural network model and a first set of weights associated with the first neural network model. The first set of gradients are transmitted from the first worker node to a second worker node of the distributed system. The second worker node computes a first set of synchronized gradients based on the first set of gradients. While the first set of synchronized gradients are being computed, the first worker node computes a second set of gradients using a second neural network model and a second set of weights associated with the second neural network model. The second set of gradients are transmitted from the first worker node to the second worker node. The second worker node computes a second set of synchronized gradients based on the second set of gradients.

    Synchronization of concurrent computation engines

    公开(公告)号:US11061654B1

    公开(公告)日:2021-07-13

    申请号:US16217797

    申请日:2018-12-12

    Abstract: Provided are systems and methods for synchronizing program code execution for a plurality of execution engines in an integrated circuit device. In some cases, the operation of one execution engine may be dependent on the operation of another execution engine. To accommodate this dependency, the instructions for the first execution engine can include a set-event instruction and the instructions for the second execution engine can include a wait-on-event instruction. The wait-on-event instruction can cause the second execution engine to wait for the first execution engine to reach the set-event instruction. In this way, the two execution engines can be synchronized around the data or resource dependency.

Patent Agency Ranking