DATAFLOW ACCELERATOR ARCHITECTURE FOR GENERAL MATRIX-MATRIX MULTIPLICATION AND TENSOR COMPUTATION IN DEEP LEARNING

    公开(公告)号:US20210374210A1

    公开(公告)日:2021-12-02

    申请号:US17374988

    申请日:2021-07-13

    Abstract: A general matrix-matrix multiplication (GEMM) dataflow accelerator circuit is disclosed that includes a smart 3D stacking DRAM architecture. The accelerator circuit includes a memory bank, a peripheral lookup table stored in the memory bank, and a first vector buffer to store a first vector that is used as a row address into the lookup table. The circuit includes a second vector buffer to store a second vector that is used as a column address into the lookup table, and lookup table buffers to receive and store lookup table entries from the lookup table. The circuit further includes adders to sum the first product and a second product, and an output buffer to store the sum. The lookup table buffers determine a product of the first vector and the second vector without performing a multiply operation. The embodiments include a hierarchical lookup architecture to reduce latency. Accumulation results are propagated in a systolic manner.

    SMART IN-MODULE REFRESH FOR DRAM
    2.
    发明申请
    SMART IN-MODULE REFRESH FOR DRAM 有权
    用于DRAM的SMART IN-MODULE刷新

    公开(公告)号:US20160307619A1

    公开(公告)日:2016-10-20

    申请号:US14850938

    申请日:2015-09-10

    Abstract: A dynamic Random Access Memory (DRAM) module (105) is disclosed. The DRAM module (105) can includes a plurality of banks (205-1, 205-2, 205-3, 205-4) to store data and a refresh engine (115) that can be used to refresh one of the plurality of banks (205-1, 205-2, 205-3, 205-4). The DRAM module (105) can also include a Smart Refresh Component (305) that can advise the refresh engine (115) which bank to refresh using an out-of-order per-bank refresh. The Smart Refresh Component (305) can use a logic (415) to identify a farthest bank in the pending transactions in the transaction queue (430) at the time of refresh.

    Abstract translation: 公开了一种动态随机存取存储器(DRAM)模块(105)。 DRAM模块(105)可以包括用于存储数据的多个存储体(205-1,205-2,205-3,205-4)和可用于刷新多个存储数据中的一个的刷新引擎(115) 银行(205-1,205-2,205-3,205-4)。 DRAM模块(105)还可以包括智能刷新组件(305),该智能刷新组件可以通过使用每次刷新无序刷新哪个存储体来刷新刷新引擎(115)。 在刷新时,智能刷新组件(305)可以使用逻辑(415)来识别事务队列(430)中的待处理事务中的最远存储体。

    HBM RAS CACHE ARCHITECTURE
    3.
    发明申请

    公开(公告)号:US20250077370A1

    公开(公告)日:2025-03-06

    申请号:US18953042

    申请日:2024-11-19

    Abstract: According to one general aspect, an apparatus may include a plurality of stacked integrated circuit dies that include a memory cell die and a logic die. The memory cell die may be configured to store data at a memory address. The logic die may include an interface to the stacked integrated circuit dies and configured to communicate memory accesses between the memory cell die and at least one external device. The logic die may include a reliability circuit configured to ameliorate data errors within the memory cell die. The reliability circuit may include a spare memory configured to store data, and an address table configured to map a memory address associated with an error to the spare memory. The reliability circuit may be configured to determine if the memory access is associated with an error, and if so completing the memory access with the spare memory.

    DATAFLOW ACCELERATOR ARCHITECTURE FOR GENERAL MATRIX-MATRIX MULTIPLICATION AND TENSOR COMPUTATION IN DEEP LEARNING

    公开(公告)号:US20200183837A1

    公开(公告)日:2020-06-11

    申请号:US16388863

    申请日:2019-04-18

    Abstract: A tensor computation dataflow accelerator semiconductor circuit is disclosed. The data flow accelerator includes a DRAM bank and a peripheral array of multiply-and-add units disposed adjacent to the DRAM bank. The peripheral array of multiply-and-add units are configured to form a pipelined dataflow chain in which partial output data from one multiply-and-add unit from among the array of multiply-and-add units is fed into another multiply-and-add unit from among the array of multiply-and-add units for data accumulation. Near-DRAM-processing dataflow (NDP-DF) accelerator unit dies may be stacked atop a base die. The base die may be disposed on a passive silicon interposer adjacent to a processor or a controller. The NDP-DF accelerator units may process partial matrix output data in parallel. The partial matrix output data may be propagated in a forward or backward direction. The tensor computation dataflow accelerator may perform a partial matrix transposition.

    SOFTWARE STACK AND PROGRAMMING FOR DPU OPERATIONS

    公开(公告)号:US20180121130A1

    公开(公告)日:2018-05-03

    申请号:US15426015

    申请日:2017-02-06

    Abstract: A system includes a library, a compiler, a driver and at least one dynamic random access memory (DRAM) processing unit (DPU). The library may determine at least one DPU operation corresponding to a received command. The compiler may form at least one DPU instruction for the DPU operation. The driver may send the at least one DPU instruction to at least one DPU. The DPU may include at least one computing cell array that includes a plurality of DRAM-based computing cells arranged in an array having at least one column in which the at least one column may include at least three rows of DRAM-based computing cells configured to provide a logic function that operates on a first row and a second row of the at least three rows and configured to store a result of the logic function in a third row of the at least three rows.

    DATAFLOW ACCELERATOR ARCHITECTURE FOR GENERAL MATRIX-MATRIX MULTIPLICATION AND TENSOR COMPUTATION IN DEEP LEARNING

    公开(公告)号:US20200184001A1

    公开(公告)日:2020-06-11

    申请号:US16388860

    申请日:2019-04-18

    Abstract: A general matrix-matrix multiplication (GEMM) dataflow accelerator circuit is disclosed that includes a smart 3D stacking DRAM architecture. The accelerator circuit includes a memory bank, a peripheral lookup table stored in the memory bank, and a first vector buffer to store a first vector that is used as a row address into the lookup table. The circuit includes a second vector buffer to store a second vector that is used as a column address into the lookup table, and lookup table buffers to receive and store lookup table entries from the lookup table. The circuit further includes adders to sum the first product and a second product, and an output buffer to store the sum. The lookup table buffers determine a product of the first vector and the second vector without performing a multiply operation. The embodiments include a hierarchical lookup architecture to reduce latency. Accumulation results are propagated in a systolic manner.

Patent Agency Ranking