Patent search ap:("INTEL CORPORATION") AND inv:"Supratim Pal" Page 1

1.

发明申请
DISTRIBUTED REGISTER FILE CACHE TO REDUCE L1 BANDWIDTH REQUIREMENTS 有权

公开(公告)号：US20250068473A1

公开(公告)日：2025-02-27

申请号：US18453867

申请日：2023-08-22

Applicant: Intel Corporation

Inventor： Jorge Eduardo Parra Osorio , Jiasheng Chen , Supratim Pal , James Valerio

IPC: G06F9/50 , G06F9/30

Abstract: Described herein is a graphics processor comprising a graphics processing cluster coupled with the memory interface, the graphics processing cluster including a plurality of processing resources, a processing resource of the plurality of processing resources including a register file including a first plurality of registers associated with a first hardware thread of a plurality of hardware threads of the processing resource and a second plurality of registers associated with a second hardware thread of the plurality of hardware threads of the processing resource and first circuitry configured to facilitate access to memory on behalf of the plurality of hardware threads and store metadata for memory access requests from the plurality of hardware threads.

2.

发明公开
CROSS-THREAD REGISTER SHARING FOR MATRIX MULTIPLICATION COMPUTE 审中-公开

公开(公告)号：US20240168807A1

公开(公告)日：2024-05-23

申请号：US18056949

申请日：2022-11-18

Applicant: Intel Corporation

Inventor： Jorge Eduardo Parra Osorio , Guei-Yuan Lueh , Maxim Kazakov , Fangwen Fu , Supratim Pal , Kaiyu Chen

IPC: G06F9/50 , G06F9/48 , G06F9/52 , G06F15/80

CPC classification number: G06F9/5027 , G06F9/48 , G06F9/522 , G06F15/8046

Abstract: An apparatus to facilitate cross-thread register sharing for matrix multiplication compute is disclosed. The apparatus includes matrix acceleration hardware comprising a plurality of data processing units, wherein the respective plurality of data processing units are to: receive a decoded instruction for a first thread having a first register space, wherein the decoded instruction is for a matrix multiplication operation and comprises an indication to utilize a second register space of a second thread for an operand of the decoded instruction for the first thread; access the second register space of the second thread to obtain data for the operand of the decoded instruction; and perform the matrix multiplication operation for the first thread using the data for the operand from the second register space of the second thread.

3.

发明公开
DETERMINISTIC BROADCASTING FROM SHARED MEMORY 审中-公开

公开(公告)号：US20240111534A1

公开(公告)日：2024-04-04

申请号：US17957486

申请日：2022-09-30

Applicant: Intel Corporation

Inventor： Fangwen Fu , Chunhui Mei , Maxim Kazakov , Biju George , Jorge Parra , Supratim Pal

IPC: G06F9/30 , G06F9/54

CPC classification number: G06F9/30047 , G06F9/3009 , G06F9/542

Abstract: Embodiments described herein provide a technique enable a broadcast load from an L1 cache or shared local memory to register files associated with hardware threads of a graphics core. One embodiment provides a graphics processor comprising a cache memory and a graphics core coupled with the cache memory. The graphics core includes a plurality of hardware threads and memory access circuitry to facilitate access to memory by the plurality of hardware threads. The graphics core is configurable to process a plurality of load request from the plurality of hardware threads, detect duplicate load requests within the plurality of load requests, perform a single read from the cache memory in response to the duplicate load requests, and transmit data associated with the duplicate load requests to requesting hardware threads.

4.

发明公开
GRAPHICS PROCESSORS AND GRAPHICS PROCESSING UNITS HAVING DOT PRODUCT ACCUMULATE INSTRUCTION FOR HYBRID FLOATING POINT FORMAT 审中-公开

公开(公告)号：US20230195685A1

公开(公告)日：2023-06-22

申请号：US18170900

申请日：2023-02-17

Applicant: Intel Corporation

Inventor： Subramaniam Maiyuran , Shubra Marwaha , Ashutosh Garg , Supratim Pal , Jorge Parra , Chandra Gurram , Varghese George , Darin Starkey , Guei-Yuan Lueh

IPC: G06F15/78 , G06F9/30 , G06F12/128 , G06F17/16 , G06F12/0811 , G06F12/02 , G06F12/0866 , G06F7/544 , G06F9/50 , G06F17/18 , G06F9/38 , G06F12/0891 , G06F12/06 , G06F12/0888 , G06F12/0802 , G06T1/60 , G06F12/0871 , G06T1/20 , H03M7/46 , G06F12/0875 , G06F12/0862 , G06F15/80 , G06F12/0897 , G06F12/0893 , G06F12/0804 , G06F12/0882 , G06F7/575 , G06F12/1009 , G06F12/0895 , G06F7/58 , G06T15/06 , G06N3/08

CPC classification number: G06F15/7839 , G06F9/30043 , G06F12/128 , G06F17/16 , G06F12/0811 , G06F12/0238 , G06F12/0866 , G06F9/30014 , G06F7/5443 , G06F9/5077 , G06F12/0246 , G06F17/18 , G06F9/3887 , G06F12/0891 , G06F12/0607 , G06F12/0888 , G06F12/0802 , G06T1/60 , G06F9/30079 , G06F12/0871 , G06F9/30036 , G06T1/20 , H03M7/46 , G06F12/0215 , G06F12/0875 , G06F12/0862 , G06F15/8046 , G06F9/30047 , G06F9/30065 , G06F12/0897 , G06F9/5011 , G06F12/0893 , G06F12/0804 , G06F12/0882 , G06F9/3001 , G06F7/575 , G06F12/1009 , G06F9/3004 , G06F12/0895 , G06F7/588 , G06F2212/401 , G06F2212/1044 , G06F9/3867 , G06F9/3818 , G06F9/3802 , G06F2212/455 , G06F2212/1021 , G06F2212/60 , G06F2212/1008 , G06T15/06 , G06N3/08 , G06F2212/302

Abstract: Described herein is a graphics processing unit (GPU) configured to receive an instruction having multiple operands, where the instruction is a single instruction multiple data (SIMD) instruction configured to use a bfloat16 (BF16) number format and the BF16 number format is a sixteen-bit floating point format having an eight-bit exponent. The GPU can process the instruction using the multiple operands, where to process the instruction includes to perform a multiply operation, perform an addition to a result of the multiply operation, and apply a rectified linear unit function to a result of the addition.

5.

发明授权
Instruction and logic for systolic dot product with accumulate 有权

公开(公告)号：US11640297B2

公开(公告)日：2023-05-02

申请号：US17304153

申请日：2021-06-15

Applicant: Intel Corporation

Inventor： Subramaniam Maiyuran , Guei-Yuan Lueh , Supratim Pal , Ashutosh Garg , Chandra S. Gurram , Jorge E. Parra , Junjie Gu , Konrad Trifunovic , Hong Bin Liao , Mike B. MacPherson , Shubh B. Shah , Shubra Marwaha , Stephen Junkins , Timothy R. Bauer , Varghese George , Weiyu Chen

IPC: G06F9/30 , G06T1/20 , G06F9/38

Abstract: Embodiments described herein provided for an instruction and associated logic to enable GPGPU program code to access special purpose hardware logic to accelerate dot product operations. One embodiment provides for a graphics processing unit comprising a fetch unit to fetch an instruction for execution and a decode unit to decode the instruction into a decoded instruction. The decoded instruction is a matrix instruction to cause the graphics processing unit to perform a parallel dot product operation. The GPGPU also includes systolic dot product circuitry to execute the decoded instruction across one or more SIMD lanes using multiple systolic layers, wherein to execute the decoded instruction, a dot product computed at a first systolic layer is to be output to a second systolic layer, wherein each systolic layer includes one or more sets of interconnected multipliers and adders, each set of multipliers and adders to generate a dot product.

6.

发明授权
Use of a single instruction set architecture (ISA) instruction for vector normalization 有权

公开(公告)号：US11593069B2

公开(公告)日：2023-02-28

申请号：US17477939

申请日：2021-09-17

Applicant: Intel Corporation

Inventor： Abhishek Rhisheekesan , Supratim Pal , Shashank Lakshminarayana , Subramaniam Maiyuran

IPC: G06F7/552 , G06F9/30

Abstract: Embodiments described herein are generally directed to an improved vector normalization instruction. An embodiment of a method includes responsive to receipt by a GPU of a single instruction specifying a vector normalization operation to be performed on V vectors: (i) generating V squared length values, N at a time, by a first processing unit, by, for each N sets of inputs, each representing multiple component vectors for N of the vectors, performing N parallel dot product operations on the N sets of inputs. Generating V sets of outputs representing multiple normalized component vectors of the V vectors, N at a time, by a second processing unit, by, for each N squared length values of the V squared length values, performing N parallel operations on the N squared length values, wherein each of the N parallel operations implement a combination of a reciprocal square root function and a vector scaling function.

7.

发明授权
Instructions and logic for vector multiply add with zero skipping 有权

公开(公告)号：US11314515B2

公开(公告)日：2022-04-26

申请号：US16724831

申请日：2019-12-23

Applicant: Intel Corporation

Inventor： Supratim Pal , Sasikanth Avancha , Ishwar Bhati , Wei-Yu Chen , Dipankar Das , Ashutosh Garg , Chandra S. Gurram , Junjie Gu , Guei-Yuan Lueh , Subramaniam Maiyuran , Jorge E. Parra , Sudarshan Srinivasan , Varghese George

IPC: G06F9/38 , G06F9/30

Abstract: Embodiments described herein provide for an instruction and associated logic to enable a vector multiply add instructions with automatic zero skipping for sparse input. One embodiment provides for a general-purpose graphics processor comprising logic to perform operations comprising fetching a hardware macro instruction having a predicate mask, a repeat count, and a set of initial operands, where the initial operands include a destination operand and multiple source operands. The hardware macro instruction is configured to perform one or more multiply/add operations on input data associated with a set of matrices.

8.

发明授权
Computing efficient cross channel operations in parallel computing machines using systolic arrays 有权

公开(公告)号：US11182337B1

公开(公告)日：2021-11-23

申请号：US16900236

申请日：2020-06-12

Applicant: Intel Corporation

Inventor： Subramaniam Maiyuran , Jorge Parra , Supratim Pal , Chandra Gurram

IPC: G06F15/80 , G06N20/00 , G06F17/16

Abstract: An apparatus to facilitate computing efficient cross channel operations in parallel computing machines using systolic arrays is disclosed. The apparatus includes a plurality of registers and one or more processing elements communicably coupled to the plurality of registers. The one or more processing elements include a systolic array circuit to perform cross-channel operations on source data received from a single source register of the plurality of registers, the systolic array circuit modified to receive inputs from the single source register and route elements of the single source register to multiple channels in the systolic array circuit.

9.

发明授权
Instruction and logic for systolic dot product with accumulate 有权

公开(公告)号：US11042370B2

公开(公告)日：2021-06-22

申请号：US15957728

申请日：2018-04-19

Applicant: Intel Corporation

Inventor： Subramaniam Maiyuran , Guei-Yuan Lueh , Supratim Pal , Ashutosh Garg , Chandra S. Gurram , Jorge E. Parra , Junjie Gu , Konrad Trifunovic , Hong Bin Liao , Mike B. Macpherson , Shubh B. Shah , Shubra Marwaha , Stephen Junkins , Timothy R. Bauer , Varghese George , Weiyu Chen

IPC: G06F9/30 , G06T1/20 , G06F9/38

Abstract: Embodiments described herein provided for an instruction and associated logic to enable GPGPU program code to access special purpose hardware logic to accelerate dot product operations. One embodiment provides for a graphics processing unit comprising a fetch unit to fetch an instruction for execution and a decode unit to decode the instruction into a decoded instruction. The decoded instruction is a matrix instruction to cause the graphics processing unit to perform a parallel dot product operation. The GPGPU also includes a systolic dot product unit to execute the decoded instruction across one or more SIMD lanes using multiple systolic layers, wherein to execute the decoded instruction, a dot product computed at a first systolic layer is to be output to a second systolic layer, wherein each systolic layer includes one or more sets of interconnected multipliers and adders, each set of multipliers and adders to generate a dot product.

10.

发明授权
Register sharing mechanism 有权

公开(公告)号：US10983794B2

公开(公告)日：2021-04-20

申请号：US16443285

申请日：2019-06-17

Applicant: Intel Corporation

Inventor： Guei-Yuan Lueh , Subramaniam Maiyuran , Weiyu Chen , Konrad Trifunovic , Supratim Pal , Chandra S. Gurram , Jorge E. Parra , Pratik J. Ashar , Tomasz Bujewski

IPC: G06F9/30 , G06F9/54 , G06F9/48 , G06F12/1009 , G06F9/50

Abstract: An processor to facilitate register sharing is disclosed. The processor includes a plurality of execution units (EUs), each including a General Purpose Register File (GRF) having a plurality of registers; and register sharing hardware to divide the plurality of registers into a first set of registers dedicated for execution of a first set of threads and a second set of registers shared for execution of a second set of threads.

Search Results

Country/Region

Patent validity

Application date

Publication (announcement) day

applicant

The country/region where the applicant is located

Inventor

IPC

IPC Department

IPC class

IPC subclass

IPC group

IPC team

Appearance classification