Accelerated quantized multiply-and-add operations

    公开(公告)号:US10678508B2

    公开(公告)日:2020-06-09

    申请号:US15934681

    申请日:2018-03-23

    Abstract: Disclosed herein are techniques for accelerating convolution operations or other matrix multiplications in applications such as neural network. A computer-implemented method includes receiving low-precision inputs for a convolution operation from a storage device, and subtracting a low-precision value representing a high-precision zero value from the low-precision inputs to generate difference values, where the low-precision inputs are asymmetrically quantized from high-precision inputs. The method also includes performing multiplication and summation operations on the difference values to generate a sum of products, and generating a high-precision output by scaling the sum of products with a scaling factor.

    Configuration of a deep vector engine using an opcode table, control table, and datapath table

    公开(公告)号:US12271732B1

    公开(公告)日:2025-04-08

    申请号:US17937333

    申请日:2022-09-30

    Abstract: A technique to program a compute channel having multiple computational circuit blocks coupled in series in a pipeline can include receiving a machine instruction for the compute channel. The machine instruction is decoded to obtain an opcode, and the opcode can be used as an index to access an opcode entry in an opcode table. The opcode entry contains a pointer to a microoperation, and the pointer can be used to access a microoperation represented by a control entry in a control table and a datapath configuration entry in a datapath table. The microoperation can then be issued to the compute channel by configuring the compute channel with the control entry and the datapath configuration entry.

    Increasing performance of computational array accelerators

    公开(公告)号:US12182691B1

    公开(公告)日:2024-12-31

    申请号:US17249900

    申请日:2021-03-17

    Abstract: To improve performance of a computational array, the architecture of the array can be modified to allow the processing engines of a column to operate in parallel and the clock frequency of the array to be increased. The processing engines of each column of the array can be grouped into a series of row groups. The processing engines of each row group can be loaded with input values, and computations on the input values can be carried out in parallel to generate the column output. One or more flip-flop stages can be inserted into the computational logic of each of the processing engines. The computational logic can then be distributed across the flip-flop stages to reduce the propagation delay between flip-flop stages of the processing engine, hence allowing the clock frequency of the array to be increased.

    Resizable scratchpad memory
    24.
    发明授权

    公开(公告)号:US12045475B1

    公开(公告)日:2024-07-23

    申请号:US17457502

    申请日:2021-12-03

    Abstract: Techniques for implementing a dynamically resizable memory region for alternative use in a memory are described. The techniques may include using two concurrent address maps corresponding to two address ranges for a memory represented as an array of memory blocks. The first address range can be mapped to the memory with starting addresses of the memory blocks incrementing sequentially along each row. The second address range can be mapped to the memory with starting addresses of the memory blocks incrementing sequentially along each column. When an access request is received having a target address belonging to the first address range, the target address is provided as the memory address to access the memory. When an access request having a target address belonging to the second address range, the target address is translated by address translation logic into a memory address to access the memory.

    Data selection circuit
    25.
    发明授权

    公开(公告)号:US11868875B1

    公开(公告)日:2024-01-09

    申请号:US16127170

    申请日:2018-09-10

    CPC classification number: G06N3/065 G06N3/049 G11C11/54

    Abstract: Provided are systems and methods for operating a neural network processor, wherein the processor includes an input selector circuit that can be configured to select the data that will be input into the processor's computational array. In various implementations, the selector circuit can determine, for a row of the array, whether the row input will be the output from a buffer memory or data that the input selector circuit has selected for a different row. The row can receive an input feature map from a set of input data or an input feature map that was selected for inputting into a different row, such that the input feature map is input into more than one row at a time. The selector circuit can also include a delay circuit, so that the duplicated input feature map can be input into the computational array later than the original input feature map.

    Memory access for multiple circuit components

    公开(公告)号:US11775430B1

    公开(公告)日:2023-10-03

    申请号:US17000842

    申请日:2020-08-24

    CPC classification number: G06F12/08 G06N3/063 G11C11/418 G11C11/419

    Abstract: Disclosed herein are techniques for performing memory access. In one embodiment, an integrated circuit includes a port and an access engine. The integrated circuit is coupled with a memory device. The access engine is configured to: receive, from an access requester device, a request to access data stored at a memory device; and based on receiving the request: provide, via the port, a sequential access of a plurality of portions of the data to the access requester device; and access the plurality of portions of the data in a parallel form at the memory device for the access requester device. The sequential access can include a sequential write access or a sequential read access of the plurality of portions of the data.

    Registers for restricted memory
    30.
    发明授权

    公开(公告)号:US10678479B1

    公开(公告)日:2020-06-09

    申请号:US16204943

    申请日:2018-11-29

    Abstract: Provided are integrated circuits and methods for operating integrated circuits. An integrated circuit can include a plurality of memory banks and an execution engine including a set of execution components. Each execution component can be associated with a respective memory bank, and can read from and write to only the respective memory bank. The integrated circuit can further include a set of registers each associated with a respective memory bank from the plurality of memory banks. The integrated circuit can further be operable to load to or store from the set of registers in parallel, and load to or store from the set of registers serially. A parallel operation followed by a serial operation enables data to be moved from many memory banks into one memory bank. A serial operation followed by a parallel operation enables data to be moved from one memory bank into many memory banks.

Patent Agency Ranking