THREAD SYNCHRONIZATION MECHANISM
    1.
    发明公开

    公开(公告)号:US20240362084A1

    公开(公告)日:2024-10-31

    申请号:US18677458

    申请日:2024-05-29

    申请人: Intel Corporation

    IPC分类号: G06F9/52 G06F9/48 G06T1/20

    摘要: An apparatus to facilitate thread synchronization is disclosed. The apparatus comprises one or more processors to execute a producer thread to generate a plurality of commands, execute a consumer thread to process the plurality of commands and synchronize the producer thread with the consumer thread, including updating a producer fence value upon generation of in-order commands, updating a consumer fence value upon processing of the in-order commands and performing a synchronization operation based on the consumer fence value, wherein the producer fence value and the consumer fence value each correspond to an order position of an in-order command.

    NAMED AND CLUSTER BARRIERS
    5.
    发明公开

    公开(公告)号:US20240231957A9

    公开(公告)日:2024-07-11

    申请号:US17973234

    申请日:2022-10-25

    申请人: Intel Corporation

    IPC分类号: G06F9/52 G06F9/48

    CPC分类号: G06F9/522 G06F9/4881

    摘要: Embodiments described herein provide a technique to facilitate the synchronization of workgroups executed on multiple graphics cores of a graphics core cluster. One embodiment provides a graphics core including a cache memory and a graphics core coupled with the cache memory. The graphics core includes execution resources to execute an instruction via a plurality of hardware threads and barrier circuitry to synchronize execution of the plurality of hardware threads, wherein the barrier circuitry is configured to provide a plurality of re-usable named barriers.

    DATA DEPENDENCY-AWARE SCHEDULING
    7.
    发明公开

    公开(公告)号:US20240220314A1

    公开(公告)日:2024-07-04

    申请号:US18091441

    申请日:2022-12-30

    发明人: Harris Gasparakis

    IPC分类号: G06F9/48 G06F9/52

    CPC分类号: G06F9/4881 G06F9/522

    摘要: A processing system flexibly schedules workgroups across kernels based on data dependencies between workgroups to enhance processing efficiency. The workgroups are partitioned into subsets based on the data dependencies and workgroups of a first subset that produces data are scheduled to execute immediately before workgroups of a second subset that consumes the data generated by the first subset. Thus, the processing system does not execute one kernel at a time, but instead schedules workgroups across kernels based on data dependencies across kernels. By limiting the sizes of the subsets to the amount of data that can be stored at local caches, the processing system increases the probability that data to be consumed by workgroups of a subset will be resident in a local cache and will not require a memory access.

    SYNCHRONIZATION UTILIZING LOCAL TEAM BARRIERS FOR THREAD TEAM PROCESSING

    公开(公告)号:US20240111609A1

    公开(公告)日:2024-04-04

    申请号:US17958213

    申请日:2022-09-30

    申请人: Intel Corporation

    IPC分类号: G06F9/52 G06F9/30

    CPC分类号: G06F9/522 G06F9/30098

    摘要: Low-latency synchronization utilizing local team barriers for thread team processing is described. An example of an apparatus includes one or more processors including a graphics processor, the graphics processor including a plurality of processing resources; and memory for storage of data including data for graphics processing, wherein the graphics processor is to receive a request for establishment of a local team barrier for a thread team, the thread team being allocated to a first processing resource, the thread team including multiple threads; determine requirements and designated threads for the local team barrier; and establish the local team barrier in a local register of the first processing resource based at least in part on the requirements and designated threads for the local barrier.

    Multi-die dot-product engine to provision large scale machine learning inference applications

    公开(公告)号:US11947928B2

    公开(公告)日:2024-04-02

    申请号:US17017557

    申请日:2020-09-10

    摘要: Systems and methods are provided for a multi-die dot-product engine (DPE) to provision large-scale machine learning inference applications. The multi-die DPE leverages a multi-chip architecture. For example, a multi-chip interface can include a plurality of DPE chips, where each DPE chip performs inference computations for performing deep learning operations. A hardware interface between a memory of a host computer and the plurality of DPE chips communicatively connects the plurality of DPE chips to the memory of the host computer system during an inference operation such that the deep learning operations are spanned across the plurality of DPE chips. Due to the multi-die architecture, multiple silicon devices are allowed to be used for inference, thereby enabling power-efficient inference for large-scale machine learning applications and complex deep neural networks. The multi-die DPE can be used to build a multi-device DNN inference system performing specific applications, such as object recognition, with high accuracy.