COLLAPSING BUBBLES IN A PROCESSING UNIT PIPELINE

    公开(公告)号:US20210096877A1

    公开(公告)日:2021-04-01

    申请号:US16583969

    申请日:2019-09-26

    Abstract: An arithmetic logic unit (ALU) pipeline of a processing unit collapses execution bubbles in response to a stall at a stage of the ALU pipeline. An execution bubble occurs at the pipeline in response to an invalid instruction being placed in the pipeline for execution. The invalid instruction thus consumes an available “slot” in the pipeline, and proceeds through the pipeline until a stall in a subsequent stage (that is, a stage after the stage executing the invalid instruction) is detected. In response to detecting the stall, the ALU continues to execute instructions that are behind the invalid instruction in the pipeline, thereby collapsing the execution bubble and conserving resources of the ALU.in response to a stall at a stage of the ALU pipeline.

    GRAPHICS CONTEXT BOUNCING
    22.
    发明申请

    公开(公告)号:US20200379767A1

    公开(公告)日:2020-12-03

    申请号:US16426613

    申请日:2019-05-30

    Abstract: A method of context bouncing includes receiving, at a command processor of a graphics processing unit (GPU), a conditional execute packet providing a hash identifier corresponding to an encapsulated state. The encapsulated state includes one or more context state packets following the conditional execute packet. A command packet following the encapsulated state is executed based at least in part on determining whether the hash identifier of the encapsulated state matches one of a plurality of hash identifiers of active context states currently stored at the GPU.

    SOFTWARE-CONTROLLED VARIABLE WAVEFRONT SIZE EXECUTION AT GPU

    公开(公告)号:US20190278605A1

    公开(公告)日:2019-09-12

    申请号:US16425625

    申请日:2019-05-29

    Abstract: A system includes a processor configured to operate in at least a first mode and a second mode. In the first mode the first processor operates to execute an instruction for an entire wavefront before executing a next instruction for the entire wavefront. In the second mode the processor operates to execute a set instructions for a portion of a wavefront before executing the set instructions for another portion of the same wavefront. The system further includes a memory coupled to the processor. The memory is configured to store a shader program for execution by the processor, wherein the shader program includes at least one indication associated with one of the first mode or the second mode. The processor is further to implement one of the first mode or the second mode while executing the shader program responsive to the at least one indication present in the first shader program.

    Optimized Context Switching for Long-Running Processes
    24.
    发明申请
    Optimized Context Switching for Long-Running Processes 审中-公开
    优化的长时间运行过程的上下文切换

    公开(公告)号:US20140157287A1

    公开(公告)日:2014-06-05

    申请号:US13691066

    申请日:2012-11-30

    CPC classification number: G06F9/461

    Abstract: Methods, systems, and computer readable storage media embodiments allow for low overhead context switching of threads. In embodiments, applications, such as, but not limited to, iterative data-parallel applications, substantially reduce the overhead of context switching by adding a user or higher-level program configurability of a state to be saved upon preempting of a executing thread. These methods, systems, and computer readable storage media include aspects of running a group of threads on a processor, saving state information by respective threads in the group in response to a signal from a scheduler, and pre-empting running of the group after the saving of the state information.

    Abstract translation: 方法,系统和计算机可读存储介质实施例允许线程的低开销上下文切换。 在实施例中,诸如但不限于迭代数据并行应用的应用通过在抢占执行线程时添加要保存的状态的用户或更高级别的程序可配置性来大大减少上下文切换的开销。 这些方法,系统和计算机可读存储介质包括在处理器上运行一组线程的方面,响应于来自调度器的信号来节省组中相应线程的状态信息,并且在先后处理 保存状态信息。

    DUAL VECTOR ARITHMETIC LOGIC UNIT
    25.
    发明公开

    公开(公告)号:US20240168719A1

    公开(公告)日:2024-05-23

    申请号:US18414164

    申请日:2024-01-16

    CPC classification number: G06F7/57 G06F9/3867 G06F17/16 G06T1/20 G06F15/8015

    Abstract: A processing system executes wavefronts at multiple arithmetic logic unit (ALU) pipelines of a single instruction multiple data (SIMD) unit in a single execution cycle. The ALU pipelines each include a number of ALUs that execute instructions on wavefront operands that are collected from vector general process register (VGPR) banks at a cache and output results of the instructions executed on the wavefronts at a buffer. By storing wavefronts supplied by the VGPR banks at the cache, a greater number of wavefronts can be made available to the SIMD unit without increasing the VGPR bandwidth, enabling multiple ALU pipelines to execute instructions during a single execution cycle.

    PROCESSING UNIT WITH SMALL FOOTPRINT ARITHMETIC LOGIC UNIT

    公开(公告)号:US20240143283A1

    公开(公告)日:2024-05-02

    申请号:US18219268

    申请日:2023-07-07

    CPC classification number: G06F7/57 G06F17/16 G06N3/08

    Abstract: A parallel processing unit employs an arithmetic logic unit (ALU) having a relatively small footprint, thereby reducing the overall power consumption and circuit area of the processing unit. To support the smaller footprint, the ALU includes multiple stages to execute operations corresponding to a received instruction. The ALU executes at least one operation at a precision indicated by the received instruction, and then reduces the resulting data of the at least one operation to a smaller size before providing the results to another stage of the ALU to continue execution of the instruction.

    SPATIAL PARTITIONING IN A MULTI-TENANCY GRAPHICS PROCESSING UNIT

    公开(公告)号:US20220237851A1

    公开(公告)日:2022-07-28

    申请号:US17706811

    申请日:2022-03-29

    Abstract: A graphics processing unit (GPU) or other apparatus includes a plurality of shader engines. The apparatus also includes a first front end (FE) circuit and one or more second FE circuits. The first FE circuit is configured to schedule geometry workloads for the plurality of shader engines in a first mode. The first FE circuit is configured to schedule geometry workloads for a first subset of the plurality of shader engines and the one or more second FE circuits are configured to schedule geometry workloads for a second subset of the plurality of shader engines in a second mode. In some cases, a partition switch is configured to selectively connect the first FE circuit or the one or more second FE circuits to the second subset of the plurality of shader engines depending on whether the apparatus is in the first mode or the second mode.

    SPARSE MATRIX-VECTOR MULTIPLICATION

    公开(公告)号:US20220197973A1

    公开(公告)日:2022-06-23

    申请号:US17125457

    申请日:2020-12-17

    Abstract: A processing system includes a first set and a second set of general-purpose registers (GPRs) and memory access circuitry that fetches nonzero values of a sparse matrix into consecutive slots in the first set. The memory access circuitry also fetches values of an expanded matrix into consecutive slots in the second set of GPRs. The expanded matrix is formed based on values of a vector and locations of the nonzero values in the sparse matrix. The processing system also includes a set of multipliers that concurrently perform multiplication of the nonzero values in slots of the first set of GPRs with the values of the vector in corresponding slots of the second set. Reduced sum circuitry accumulates results from the set of multipliers for rows of the sparse matrix.

    SPLIT FRAME RENDERING
    29.
    发明申请

    公开(公告)号:US20190318527A1

    公开(公告)日:2019-10-17

    申请号:US16452831

    申请日:2019-06-26

    Abstract: Improvements in the graphics processing pipeline that allow multiple pipelines to cooperate to render a single frame are disclosed. Two approaches are provided. In a first approach, world-space pipelines for the different graphics processing pipelines process all work for draw calls received from a central processing unit (CPU). In a second approach, the world-space pipelines divide up the work. Work that is divided is synchronized and redistributed at various points in the world-space pipeline. In either approach, the triangles output by the world-space pipelines are distributed to the screen-space pipelines based on the portions of the render surface overlapped by the triangles. Triangles are rendered by screen-space pipelines associated with the render surface portions overlapped by those triangles.

    SELECTIVE PREFETCHING IN MULTITHREADED PROCESSING UNITS

    公开(公告)号:US20190155604A1

    公开(公告)日:2019-05-23

    申请号:US15818304

    申请日:2017-11-20

    Abstract: A processing unit includes a plurality of processing elements and one or more caches. A first thread executes a program that includes one or more prefetch instructions to prefetch information into a first cache. Prefetching is selectively enabled when executing the first thread on a first processing element dependent upon whether one or more second threads previously executed the program on the first processing element. The first thread is then dispatched to execute the program on the first processing element. In some cases, a dispatcher receives the first thread four dispatching to the first processing element. The dispatcher modifies the prefetch instruction to disable prefetching into the first cache in response to the one or more second threads having previously executed the program on the first processing element.

Patent Agency Ranking