Dynamic processing element array expansion

    公开(公告)号:US11868895B2

    公开(公告)日:2024-01-09

    申请号:US18154576

    申请日:2023-01-13

    CPC classification number: G06N3/08 G06N3/04

    Abstract: A computer-implemented method includes receiving a neural network model that includes a tensor operation, dividing the tensor operation into a set of sub-operations, and generating instructions for performing a plurality of sub-operations of the set of sub-operations on respective computing engines of a plurality of computing engines on a same integrated circuit device or on different integrated circuit devices. Each sub-operation of the set of sub-operations generates a portion of a final output of the tensor operation. An inference is made based on a result of a sub-operation of the plurality of sub-operations, or based on results of the plurality of sub-operations.

    Global modulo allocation in neural network compilation

    公开(公告)号:US11809849B1

    公开(公告)日:2023-11-07

    申请号:US17326175

    申请日:2021-05-20

    CPC classification number: G06F8/452 G06F9/3853 G06F13/28 G06N3/04

    Abstract: In one example, a method performed by a compiler comprises: receiving a dataflow graph of a neural network, the neural network comprising a neural network operator; receiving information of computation resources and memory resources of a neural network hardware accelerator intended to execute the neural network operator; determining, based on the dataflow graph, iterations of an operation on elements of a tensor included in the neural network operator; determining, based on the information, a mapping between the elements of the tensor to addresses in the portion of the local memory, and a number of the iterations of the operation to be included in a batch, wherein the number of the iterations in the batch are to be executed in parallel by the neural network hardware accelerator; and generating a schedule of execution of the batches of the iterations of the operations.

    Neural network processing based on subgraph recognition

    公开(公告)号:US11714992B1

    公开(公告)日:2023-08-01

    申请号:US16219760

    申请日:2018-12-13

    CPC classification number: G06N3/04 G06F9/4881 G06F9/30003 G06F16/9024

    Abstract: Systems and methods for providing executable instructions to a neural network processor are provided. In one example, a system comprises a database that stores a plurality of executable instructions and a plurality of subgraph identifiers, each subgraph identifier of the plurality of subgraph identifiers being associated with a subset of instructions of the plurality of executable instructions. The system further includes a compiler configured to: identify a computational subgraph from a computational graph of a neural network model; compute a subgraph identifier for the computational subgraph, based on whether the subgraph identifier is included in the plurality of subgraph identifiers, either: obtain, from the database, first instructions associated with the subgraph identifier; or generate second instructions representing the computational subgraph; and provide the first instructions or the second instructions for execution by a neural network processor to perform computation operations for the neural network model.

    Workload reduction for non-maximum suppression operation

    公开(公告)号:US11562554B1

    公开(公告)日:2023-01-24

    申请号:US16949749

    申请日:2020-11-12

    Abstract: A technique for improving the computational time for performing a non-maximum suppression operation may include receiving a request to perform a non-maximum suppression operation on a set of candidate predictions of a computing task, and performing a statistical analysis on a set of confidence scores corresponding to the set of candidate predictions to determine a standard deviation of the set of confidence scores. A confidence score threshold can be determined based on the standard deviation. Candidate predictions having a confidence score below the confidence score threshold can then be discarded to form a reduced set of candidate predictions. Additional candidate predictions can be discarded from the reduced set of candidate predictions based on an intersection-over-union overlap metric, and the remaining candidate predictions from the reduced set of candidate predictions can be provided as a result of the non-maximum suppression operation.

    Registers for restricted memory
    26.
    发明授权

    公开(公告)号:US11294599B1

    公开(公告)日:2022-04-05

    申请号:US16891438

    申请日:2020-06-03

    Abstract: Provided are integrated circuits and methods for operating integrated circuits. An integrated circuit can include a plurality of memory banks and an execution engine including a set of execution components. Each execution component can be associated with a respective memory bank and can read from and write to the respective memory bank. The integrated circuit can further include a set of registers each associated with a respective memory bank from the plurality of memory banks. The integrated circuit can further be operable to load to or store from the set of registers in parallel, and load to or store from the set of registers serially. A parallel operation followed by a serial operation enables data to be moved from many memory banks into one memory bank. A serial operation followed by a parallel operation enables data to be moved from one memory bank into many memory banks.

    Low latency neural network model loading

    公开(公告)号:US11182314B1

    公开(公告)日:2021-11-23

    申请号:US16698761

    申请日:2019-11-27

    Abstract: An integrated circuit device implementing a neural network accelerator may have a peripheral bus interface to interface with a host memory, and neural network models can be loaded from the host memory onto the state buffer of the neural network accelerator for execution by the array of processing elements. The neural network accelerator may also have a memory interface to interface with a local memory. The local memory may store neural network models from the host memory, and the models can be loaded from the local memory into the state buffer with reduced latency as compared to loading from the host memory. In systems with multiple accelerators, the models in the local memory can also be shared amongst different accelerators.

    Multicast master
    28.
    发明授权

    公开(公告)号:US10831693B1

    公开(公告)日:2020-11-10

    申请号:US16145122

    申请日:2018-09-27

    Abstract: Provided are integrated circuit devices and methods for operating integrated circuit devices. In various examples, an integrated circuit device can include a master port operable to send transactions to a target components of the device. The master port can have point-to-point connections with each of the targets. The master port can be configured with a first address range for a first target, a second address range for a second target, and a multicast address range for both the first and second target. When the master port receive a request with an address that is in the multicast address range, the master port can generate, for the one request, a transaction for each of the first and second transactions.

    Efficient utilization of processing element array

    公开(公告)号:US12198041B2

    公开(公告)日:2025-01-14

    申请号:US18352768

    申请日:2023-07-14

    Abstract: Generating instructions for programming a processing element array to implement a convolution operation can include determining that the convolution operation under-utilizes the processing element array. The convolution operation involves using the processing element array to perform a series of matrix multiplications between a set of filters and a set of input matrices. Each filter comprises a weight matrix. Each input matrix is assigned to a respective row in the processing element array. Under-utilization can be determined through detecting that less than a threshold number of rows would be used concurrently. In response to determining that the convolution operation under-utilizes the processing element array, instructions can be added for modifying the convolution operation to increase the number of rows used concurrently. The added instructions are executable to cause at least one input matrix to be processed in parallel across more rows compared to processing without modifying the convolution operation.

    Compilation time reduction for memory and compute bound neural networks

    公开(公告)号:US12079734B1

    公开(公告)日:2024-09-03

    申请号:US17878824

    申请日:2022-08-01

    CPC classification number: G06N3/10 G06N3/04 G06N3/08

    Abstract: Techniques for reducing a compilation time for compiling a neural network are disclosed. A description of a neural network is received by a compiler. A plurality of operators are identified based on the description of the neural network. A plurality of subgraphs are formed, each including one or more operators. For each subgraph, a performance factor is calculated based on a compute usage and a memory usage associated with the operators included in the subgraph. The performance factor is compared to a threshold. Based on the comparison, either the subgraph is classified as a compute bound subgraph and a set of memory optimizations are suppressed or the subgraph is classified as a memory bound subgraph and a set of compute optimizations are suppressed.

Patent Agency Ranking