DISTRIBUTED REGISTER FILE CACHE TO REDUCE L1 BANDWIDTH REQUIREMENTS

    公开(公告)号:US20250068473A1

    公开(公告)日:2025-02-27

    申请号:US18453867

    申请日:2023-08-22

    Abstract: Described herein is a graphics processor comprising a graphics processing cluster coupled with the memory interface, the graphics processing cluster including a plurality of processing resources, a processing resource of the plurality of processing resources including a register file including a first plurality of registers associated with a first hardware thread of a plurality of hardware threads of the processing resource and a second plurality of registers associated with a second hardware thread of the plurality of hardware threads of the processing resource and first circuitry configured to facilitate access to memory on behalf of the plurality of hardware threads and store metadata for memory access requests from the plurality of hardware threads.

    CROSS-THREAD REGISTER SHARING FOR MATRIX MULTIPLICATION COMPUTE

    公开(公告)号:US20240168807A1

    公开(公告)日:2024-05-23

    申请号:US18056949

    申请日:2022-11-18

    CPC classification number: G06F9/5027 G06F9/48 G06F9/522 G06F15/8046

    Abstract: An apparatus to facilitate cross-thread register sharing for matrix multiplication compute is disclosed. The apparatus includes matrix acceleration hardware comprising a plurality of data processing units, wherein the respective plurality of data processing units are to: receive a decoded instruction for a first thread having a first register space, wherein the decoded instruction is for a matrix multiplication operation and comprises an indication to utilize a second register space of a second thread for an operand of the decoded instruction for the first thread; access the second register space of the second thread to obtain data for the operand of the decoded instruction; and perform the matrix multiplication operation for the first thread using the data for the operand from the second register space of the second thread.

    DETERMINISTIC BROADCASTING FROM SHARED MEMORY

    公开(公告)号:US20240111534A1

    公开(公告)日:2024-04-04

    申请号:US17957486

    申请日:2022-09-30

    CPC classification number: G06F9/30047 G06F9/3009 G06F9/542

    Abstract: Embodiments described herein provide a technique enable a broadcast load from an L1 cache or shared local memory to register files associated with hardware threads of a graphics core. One embodiment provides a graphics processor comprising a cache memory and a graphics core coupled with the cache memory. The graphics core includes a plurality of hardware threads and memory access circuitry to facilitate access to memory by the plurality of hardware threads. The graphics core is configurable to process a plurality of load request from the plurality of hardware threads, detect duplicate load requests within the plurality of load requests, perform a single read from the cache memory in response to the duplicate load requests, and transmit data associated with the duplicate load requests to requesting hardware threads.

    Use of a single instruction set architecture (ISA) instruction for vector normalization

    公开(公告)号:US11593069B2

    公开(公告)日:2023-02-28

    申请号:US17477939

    申请日:2021-09-17

    Abstract: Embodiments described herein are generally directed to an improved vector normalization instruction. An embodiment of a method includes responsive to receipt by a GPU of a single instruction specifying a vector normalization operation to be performed on V vectors: (i) generating V squared length values, N at a time, by a first processing unit, by, for each N sets of inputs, each representing multiple component vectors for N of the vectors, performing N parallel dot product operations on the N sets of inputs. Generating V sets of outputs representing multiple normalized component vectors of the V vectors, N at a time, by a second processing unit, by, for each N squared length values of the V squared length values, performing N parallel operations on the N squared length values, wherein each of the N parallel operations implement a combination of a reciprocal square root function and a vector scaling function.

Patent Agency Ranking