Restructuring a multi-dimensional array

    公开(公告)号:US10445638B1

    公开(公告)日:2019-10-15

    申请号:US15908236

    申请日:2018-02-28

    Abstract: Disclosed herein are techniques for performing neural network computations. In one embodiment, an apparatus may include an array of processing elements, the array having a configurable first effective dimension and a configurable second effective dimension. The apparatus may also include a controller configured to determine at least one of: a first number of input data sets to be provided to the array at the first time or a second number of output data sets to be generated by the array at the second time, and to configure, based on at least one of the first number or the second number, at least one of the first effective dimension or the second effective dimension of the array.

    Throughput increase for compute engine

    公开(公告)号:US12260214B1

    公开(公告)日:2025-03-25

    申请号:US17937332

    申请日:2022-09-30

    Abstract: A compute channel can have multiple computational circuit blocks coupled in series to form a pipeline. The compute channel can perform a computation on an input tensor to generate an output tensor based on an instruction. When the computational does not require all of the computational circuit blocks, the throughput of the compute channel can be increased by splitting the data elements of the input tensor into multiple input data streams. The multiple input data streams are provided to respective subsets of one or more computational circuit blocks in the pipeline using bypass circuitry of the computational circuit blocks, and the computation can be performed on multiple input data streams in the respective subsets of one or more computational circuit blocks to generate multiple output data streams corresponding to the output tensor.

    Efficient utilization of processing element array

    公开(公告)号:US12198041B2

    公开(公告)日:2025-01-14

    申请号:US18352768

    申请日:2023-07-14

    Abstract: Generating instructions for programming a processing element array to implement a convolution operation can include determining that the convolution operation under-utilizes the processing element array. The convolution operation involves using the processing element array to perform a series of matrix multiplications between a set of filters and a set of input matrices. Each filter comprises a weight matrix. Each input matrix is assigned to a respective row in the processing element array. Under-utilization can be determined through detecting that less than a threshold number of rows would be used concurrently. In response to determining that the convolution operation under-utilizes the processing element array, instructions can be added for modifying the convolution operation to increase the number of rows used concurrently. The added instructions are executable to cause at least one input matrix to be processed in parallel across more rows compared to processing without modifying the convolution operation.

    Multiple accumulate busses in a systolic array

    公开(公告)号:US12182064B2

    公开(公告)日:2024-12-31

    申请号:US18446357

    申请日:2023-08-08

    Abstract: Systems and methods are provided to enable parallelized multiply-accumulate operations in a systolic array. Each column of the systolic array can include multiple busses enabling independent transmission of input partial sums along the respective bus. Each processing element of a given columnar bus can receive an input partial sum from a prior element of the given columnar bus, and perform arithmetic operations on the input partial sum. Each processing element can generate an output partial sum based on the arithmetic operations, provide the output partial sum to a next processing element of the given columnar bus, without the output partial sum being processed by a processing element of the column located between the two processing elements that uses a different columnar bus. Use of columnar busses can enable parallelization to increase speed or enable increased latency at individual processing elements.

    Configurable function approximation based on switching mapping table content

    公开(公告)号:US11423313B1

    公开(公告)日:2022-08-23

    申请号:US16218082

    申请日:2018-12-12

    Abstract: Methods and systems for performing hardware approximation of function are provided. In one example, a system comprises a controller, configurable arithmetic circuits, and a mapping table. The mapping table stores a first set of function parameters in a first mode of operation and stores a second set of function parameters in a second mode of operation. Depending on the mode of operation, the controller may configure the arithmetic circuits to compute a first approximation result of a function at an input value based on the first set of function parameters, or to compute a second approximation result of the function at the input value based on the second set of function parameters and to perform post-processing, such as quantization, of the second approximation result.

    DATA-TYPE-AWARE CLOCK-GATING
    36.
    发明申请

    公开(公告)号:US20220188073A1

    公开(公告)日:2022-06-16

    申请号:US17247475

    申请日:2020-12-11

    Abstract: To reduce power consumption, data bits or a portion of a data register that is not expected to toggle frequently can be grouped together, and be clock-gated independently from the rest of the data register. The grouping of the data bits can be determined based on the data types of the workload being operated on. For a data register configured to store a numeric value that supports multiple data types, the portion of the data register being clock-gated may store a group of data bits that are unused for one or more data types of the multiple data types supported by the data register. The portion of the data register being clock-gated can also be a group of data bits that remain unchanged or have a constant value for numeric values within a certain numeric range that is frequently operated on.

    Parallelism within a systolic array using multiple accumulate busses

    公开(公告)号:US11232062B1

    公开(公告)日:2022-01-25

    申请号:US16915783

    申请日:2020-06-29

    Abstract: Systems and methods are provided to enable parallelized multiply-accumulate operations in a systolic array. Each column of the systolic array can include multiple busses enabling independent transmission of input partial sums along the respective bus. Each processing element can include a plurality of interconnects to receive a plurality of inputs corresponding to the multiple busses. Each processing element of a given columnar bus can receive an input from a prior element of the given columnar bus at an active bus position and perform arithmetic operations on the input. Each processing element can further receive a plurality of inputs at passive bus positions and provide the plurality of inputs to subsequent processing elements without the plurality of inputs being processed by the processing element. Use of columnar busses can enable parallelization to increase speed or enable increased latency at individual processing elements.

    EFFICIENT UTILIZATION OF PROCESSING ELEMENT ARRAY

    公开(公告)号:US20210158132A1

    公开(公告)日:2021-05-27

    申请号:US16698461

    申请日:2019-11-27

    Abstract: A computer-implemented method includes receiving a neural network model for implementation using a processing element array, where the neural network model includes a convolution operation on a set of input feature maps and a set of filters. The method also includes determining, based on the neural network model, that the convolution operation utilizes less than a threshold number of rows in the processing element array for applying a set of filter elements to the set of input feature maps, where the set of filter elements includes one filter element in each filter of the set of filters. The method further includes generating, for the convolution operation and based on the neural network model, a first instruction and a second instruction for execution by respective rows in the processing element array, where the first instruction and the second instruction use different filter elements of a filter in the set of filters.

    ACCELERATED QUANTIZED MULTIPLY-AND-ADD OPERATIONS

    公开(公告)号:US20200293284A1

    公开(公告)日:2020-09-17

    申请号:US16891010

    申请日:2020-06-02

    Abstract: Disclosed herein are techniques for accelerating convolution operations or other matrix multiplications in applications such as neural network. In one example, an apparatus comprises a first circuit, a second circuit, and a third circuit. The first circuit is configured to: receive first values in a first format, the first values being generated from one or more asymmetric quantization operations of second values in a second format, and generate difference values based on subtracting a third value from each of the first values, the third value representing a zero value in the first format. The second circuit is configured to generate a sum of products in the first format using the difference values. The third circuit is configured to convert the sum of products from the first format to the second format based on scaling the sum of products with a scaling factor.

Patent Agency Ranking