-
公开(公告)号:US20240362084A1
公开(公告)日:2024-10-31
申请号:US18677458
申请日:2024-05-29
申请人: Intel Corporation
CPC分类号: G06F9/52 , G06F9/4881 , G06F9/522 , G06T1/20
摘要: An apparatus to facilitate thread synchronization is disclosed. The apparatus comprises one or more processors to execute a producer thread to generate a plurality of commands, execute a consumer thread to process the plurality of commands and synchronize the producer thread with the consumer thread, including updating a producer fence value upon generation of in-order commands, updating a consumer fence value upon processing of the in-order commands and performing a synchronization operation based on the consumer fence value, wherein the producer fence value and the consumer fence value each correspond to an order position of an in-order command.
-
公开(公告)号:US20240329990A1
公开(公告)日:2024-10-03
申请号:US18740430
申请日:2024-06-11
申请人: Apple Inc.
发明人: Deepankar Duggal , Kulin N Kothari , Mridul Agarwal , Chang Xu , Yanran Yang , Richard F Russo , Yuan C Chou , Douglas C Holman
CPC分类号: G06F9/30087 , G06F9/3802 , G06F9/522
摘要: A system, e.g., a system on a chip (SOC), may include one or more processors. A processor may execute an instruction synchronization barrier (ISB) instruction to enforce an ordering constraint on instructions. To execute the ISB instruction, the processor may determine whether contexts of the processor required for execution of instructions older than the ISB instruction are consumed for the older instructions. Responsive to determining that the contexts are consumed for the older instructions, the processor may initiate fetching of an instruction younger than the ISB instruction, without waiting for the older instructions to retire.
-
公开(公告)号:US20240289132A1
公开(公告)日:2024-08-29
申请号:US18660763
申请日:2024-05-10
申请人: NVIDIA Corporation
发明人: Apoorv PARLE , Ronny KRASHINSKY , John EDMONDSON , Jack CHOQUETTE , Shirish GADRE , Steve HEINRICH , Manan PATEL , Prakash Bangalore PRABHAKAR, JR. , Ravi MANYAM , Wish GANDHI , Lacky SHAH , Alexander L. Minkin
CPC分类号: G06F9/3887 , G06F9/522 , G06F13/1689 , G06F13/4022 , G06T1/20 , G06T1/60 , H04L49/101
摘要: This specification describes a programmatic multicast technique enabling one thread (for example, in a cooperative group array (CGA) on a GPU) to request data on behalf of one or more other threads (for example, executing on respective processor cores of the GPU). The multicast is supported by tracking circuitry that interfaces between multicast requests received from processor cores and the available memory. The multicast is designed to reduce cache (for example, layer 2 cache) bandwidth utilization enabling strong scaling and smaller tile sizes.
-
公开(公告)号:US12073262B2
公开(公告)日:2024-08-27
申请号:US17338898
申请日:2021-06-04
申请人: Graphcore Limited
发明人: Ola Torudbakken , Wei-Lin Guay
IPC分类号: G06F9/52 , G06F9/38 , G06F9/54 , G06F15/173
CPC分类号: G06F9/522 , G06F9/3851 , G06F9/543 , G06F9/544 , G06F15/173 , G06F15/17325
摘要: A host system compiles a set of local programs which are provided over a network to a plurality of subsystems. By defining the synchronisation activity on the host, and then providing that information to the subsystems, the host can service a large number of subsystems. The defined synchronisation activity includes defining the synchronisation groups between which synchronisation barriers occur and the points during program execution at which data exchange with the host occurs. Defining synchronisation activity between the subsystems allows a large number of subsystems to be connecting whilst minimising the required exchanges with the host.
-
公开(公告)号:US20240231957A9
公开(公告)日:2024-07-11
申请号:US17973234
申请日:2022-10-25
申请人: Intel Corporation
发明人: Fangwen Fu , Chunhui Mei , John A. Wiegert , Yongsheng Liu , Ben J. Ashbaugh
CPC分类号: G06F9/522 , G06F9/4881
摘要: Embodiments described herein provide a technique to facilitate the synchronization of workgroups executed on multiple graphics cores of a graphics core cluster. One embodiment provides a graphics core including a cache memory and a graphics core coupled with the cache memory. The graphics core includes execution resources to execute an instruction via a plurality of hardware threads and barrier circuitry to synchronize execution of the plurality of hardware threads, wherein the barrier circuitry is configured to provide a plurality of re-usable named barriers.
-
公开(公告)号:US20240220335A1
公开(公告)日:2024-07-04
申请号:US18148993
申请日:2022-12-30
申请人: Intel Corporation
发明人: Chunhui Mei , Yongsheng Liu , John A. Wiegert , Vasanth Ranganathan , Ben J. Ashbaugh , Fangwen Fu , Hong Jiang , Guei-Yuan Lueh , James Valerio , Alan M. Curtis , Maxim Kazakov
CPC分类号: G06F9/522 , G06F9/3877 , G06F9/5072 , G06F9/3887
摘要: Synchronization for data multicast in compute core clusters is described. An example of an apparatus includes one or more processors including at least a graphics processing unit (GPU), the GPU including one or more clusters of cores and a memory, wherein each cluster of cores includes a plurality of cores, each core including one or more processing resources, shared local memory, and gateway circuitry, wherein the GPU is to initiate broadcast of a data element from a producer core to one or more consumer cores, and synchronize the broadcast of the data element utilizing the gateway circuitry of the producer core and the one or more consumer cores, and wherein synchronizing the broadcast of the data element includes establishing a multi-core barrier for broadcast of the data element.
-
公开(公告)号:US20240220314A1
公开(公告)日:2024-07-04
申请号:US18091441
申请日:2022-12-30
发明人: Harris Gasparakis
CPC分类号: G06F9/4881 , G06F9/522
摘要: A processing system flexibly schedules workgroups across kernels based on data dependencies between workgroups to enhance processing efficiency. The workgroups are partitioned into subsets based on the data dependencies and workgroups of a first subset that produces data are scheduled to execute immediately before workgroups of a second subset that consumes the data generated by the first subset. Thus, the processing system does not execute one kernel at a time, but instead schedules workgroups across kernels based on data dependencies across kernels. By limiting the sizes of the subsets to the amount of data that can be stored at local caches, the processing system increases the probability that data to be consumed by workgroups of a subset will be resident in a local cache and will not require a memory access.
-
公开(公告)号:US11995463B2
公开(公告)日:2024-05-28
申请号:US17237752
申请日:2021-04-22
IPC分类号: G06F9/48 , G06F3/06 , G06F9/52 , G06N20/00 , G06F9/30 , G06F9/38 , G06F15/78 , G06F15/80 , G06F17/16 , G06N5/04
CPC分类号: G06F9/4818 , G06F3/0604 , G06F3/0659 , G06F3/0673 , G06F9/4881 , G06F9/52 , G06N20/00 , G06F9/30018 , G06F9/30087 , G06F9/3869 , G06F9/3871 , G06F9/522 , G06F15/7807 , G06F15/7846 , G06F15/8053 , G06F17/16 , G06N5/04
摘要: A system to support a machine learning (ML) operation comprises an array-based inference engine comprising a plurality of processing tiles each comprising at least one or more of an on-chip memory (OCM) configured to maintain data for local access by components in the processing tile and one or more processing units configured to perform one or more computation tasks on the data in the OCM by executing a set of task instructions. The system also comprises a data streaming engine configured to stream data between a memory and the OCMs and an instruction streaming engine configured to distribute said set of task instructions to the corresponding processing tiles to control their operations and to synchronize said set of task instructions to be executed by each processing tile, respectively, to wait current certain task at each processing tile to finish before starting a new one.
-
公开(公告)号:US20240111609A1
公开(公告)日:2024-04-04
申请号:US17958213
申请日:2022-09-30
申请人: Intel Corporation
发明人: Biju George , Supratim Pal , James Valerio , Vasanth Ranganathan , Fangwen Fu , Chunhui Mei
CPC分类号: G06F9/522 , G06F9/30098
摘要: Low-latency synchronization utilizing local team barriers for thread team processing is described. An example of an apparatus includes one or more processors including a graphics processor, the graphics processor including a plurality of processing resources; and memory for storage of data including data for graphics processing, wherein the graphics processor is to receive a request for establishment of a local team barrier for a thread team, the thread team being allocated to a first processing resource, the thread team including multiple threads; determine requirements and designated threads for the local team barrier; and establish the local team barrier in a local register of the first processing resource based at least in part on the requirements and designated threads for the local barrier.
-
10.
公开(公告)号:US11947928B2
公开(公告)日:2024-04-02
申请号:US17017557
申请日:2020-09-10
CPC分类号: G06F7/5443 , G06F9/3867 , G06F9/522 , G06F40/20 , G06N3/063
摘要: Systems and methods are provided for a multi-die dot-product engine (DPE) to provision large-scale machine learning inference applications. The multi-die DPE leverages a multi-chip architecture. For example, a multi-chip interface can include a plurality of DPE chips, where each DPE chip performs inference computations for performing deep learning operations. A hardware interface between a memory of a host computer and the plurality of DPE chips communicatively connects the plurality of DPE chips to the memory of the host computer system during an inference operation such that the deep learning operations are spanned across the plurality of DPE chips. Due to the multi-die architecture, multiple silicon devices are allowed to be used for inference, thereby enabling power-efficient inference for large-scale machine learning applications and complex deep neural networks. The multi-die DPE can be used to build a multi-device DNN inference system performing specific applications, such as object recognition, with high accuracy.
-
-
-
-
-
-
-
-
-