-
公开(公告)号:US12248788B2
公开(公告)日:2025-03-11
申请号:US17691690
申请日:2022-03-10
Applicant: NVIDIA Corporation
Inventor: Prakash Bangalore Prabhakar , Gentaro Hirota , Ronny Krashinsky , Ze Long , Brian Pharris , Rajballav Dash , Jeff Tuckey , Jerome F. Duluk, Jr. , Lacky Shah , Luke Durant , Jack Choquette , Eric Werness , Naman Govil , Manan Patel , Shayani Deb , Sandeep Navada , John Edmondson , Greg Palmer , Wish Gandhi , Ravi Manyam , Apoorv Parle , Olivier Giroux , Shirish Gadre , Steve Heinrich
Abstract: Distributed shared memory (DSMEM) comprises blocks of memory that are distributed or scattered across a processor (such as a GPU). Threads executing on a processing core local to one memory block are able to access a memory block local to a different processing core. In one embodiment, shared access to these DSMEM allocations distributed across a collection of processing cores is implemented by communications between the processing cores. Such distributed shared memory provides very low latency memory access for processing cores located in proximity to the memory blocks, and also provides a way for more distant processing cores to also access the memory blocks in a manner and using interconnects that do not interfere with the processing cores' access to main or global memory such as hacked by an L2 cache. Such distributed shared memory supports cooperative parallelism and strong scaling across multiple processing cores by permitting data sharing and communications previously possible only within the same processing core.
-
公开(公告)号:US11182207B2
公开(公告)日:2021-11-23
申请号:US16450508
申请日:2019-06-24
Applicant: NVIDIA CORPORATION
Inventor: Gentaro Hirota , Brian Pharris , Jeff Tuckey , Robert Overman , Stephen Jones
Abstract: Techniques are disclosed for reducing the latency between the completion of a producer task and the launch of a consumer task dependent on the producer task. Such latency exists when the information needed to launch the consumer task is unavailable when the producer task completes. Thus, various techniques are disclosed, where a task management unit initiates the retrieval of the information needed to launch the consumer task from memory in parallel with the producer task being launched. Because the retrieval of such information is initiated in parallel with the launch of the producer task, the information is often available when the producer task completes, thus allowing for the consumer task to be launched without delay. The disclosed techniques, therefore, enable the latency between completing the producer task and launching the consumer task to be reduced.
-
公开(公告)号:US10423424B2
公开(公告)日:2019-09-24
申请号:US13631685
申请日:2012-09-28
Applicant: NVIDIA Corporation
Inventor: Lincoln G. Garlick , Philip Browning Johnson , Rafal Zboinski , Jeff Tuckey , Samuel H. Duncan , Peter C. Mills
IPC: G06F9/38
Abstract: Techniques are disclosed for performing an auxiliary operation via a compute engine associated with a host computing device. The method includes determining that the auxiliary operation is directed to the compute engine, and determining that the auxiliary operation is associated with a first context comprising a first set of state parameters. The method further includes determining a first subset of state parameters related to the auxiliary operation based on the first set of state parameters. The method further includes transmitting the first subset of state parameters to the compute engine, and transmitting the auxiliary operation to the compute engine. One advantage of the disclosed technique is that surface area and power consumption are reduced within the processor by utilizing copy engines that have no context switching capability.
-
-