Abstract:
A graphics processor capable of efficiently performing arithmetic operations and computing elementary functions is described. The graphics processor has at least one arithmetic logic unit (ALU) that can perform arithmetic operations and at least one elementary function unit that can compute elementary functions. The ALU(s) and elementary function unit(s) may be arranged such that they can operate in parallel to improve throughput. The graphics processor may also include fewer elementary function units than ALUs, e.g., four ALUs and a single elementary function unit. The four ALUs may perform an arithmetic operation on (1) four components of an attribute for one pixel or (2) one component of an attribute for four pixels. The single elementary function unit may operate on one component of one pixel at a time. The use of a single elementary function unit may reduce cost while still providing good performance.
Abstract:
Methods, systems, and devices for graphic processing are described. The methods, systems, and devices may include or be associated with identifying a graphics instruction, determining that the graphics instruction is alias enabled for the device, partitioning an alias lookup table into one or more slots, allocating a slot of the alias lookup table based on the partitioning and determining that the graphics instruction is alias enabled, generating an alias instruction based on allocating the slot of the alias lookup table and determining that the graphics instruction is alias enabled, and processing the alias instruction.
Abstract:
Systems and techniques are disclosed for general purpose register dynamic allocation based on latency associated with of instructions in processor threads. A streaming processor can include a general purpose registers configured to stored data associated with threads, and a thread scheduler configured to receive allocation information for the general purpose registers, the information describing general purpose registers that are to be assigned as persistent general purpose registers (pGPRs) and volatile general purpose registers (vGPRs). The plurality of general purpose registers can be allocated according to the received information. The streaming processor can include the general purpose registers allocated according to the received information, the allocated based on execution latencies of instructions included in the threads.
Abstract:
Techniques are described for determining whether data of a variable for each of a plurality of graphics items is same. If determined that the data is the same, the techniques store the data in a storage location of a specialized shared general purpose register that is associated with the variable.
Abstract:
A cache memory system includes a cache memory including a plurality of cache memory lines and a dirty buffer including a plurality of dirty masks. A cache controller is configured to allocate one of the dirty masks to each of the cache memory lines when a write to the respective cache memory line is not a full write to that cache memory line. Each of the dirty masks indicates dirty states of data units in one of the cache memory lines. The cache controller may include a dirty buffer index which stores an identification (ID) information that associates the dirty masks with the cache memory lines to which the dirty masks are allocated. A cache line may include a fully dirty flag indicating when each byte in that cache line is dirty, so that a dirty mask does not need to be allocated for that cache line.
Abstract:
Techniques are described in which an indication is included to indicate a last use of an intermediate value generated as part of determining a final value is not be stored in a general purpose register (GPR). A processing unit avoids storing the intermediate value in the GPR based on the indication because the intermediate value is no longer needed for determining the final value.
Abstract:
Systems and methods related to a memory system including a cache memory are disclosed. The cache memory system includes a cache memory including a plurality of cache memory lines and a dirty buffer including a plurality of dirty masks. A cache controller is configured to allocate one of the dirty masks to each of the cache memory lines when a write to the respective cache memory line is not a full write to that cache memory line. Each of the dirty masks indicates dirty states of data units in one of the cache memory lines. The cache controller stores an identification (ID) information that associates the dirty masks with the cache memory lines to which the dirty masks are allocated.
Abstract:
The present disclosure relates to methods and apparatus for graphics processing. For example, disclosed techniques facilitate improving bindless state processing at a graphics processor. Aspects of the present disclosure can receive, at a graphics processor, a shader program including a preamble section and a main instructions section. Aspects of the present disclosure can also execute, with a scalar processor dedicated to processing preamble sections, instructions of the preamble section to implement a bindless mechanism for loading constant data associated with the shader program. Additionally, aspects of the present disclosure can distribute the main instructions section and the constant data to a streaming processor for executing the shader program.
Abstract:
This disclosure provides systems, devices, apparatus, and methods, including computer programs encoded on storage media, for dynamic wave pairing. A graphics processor may allocate one or more GPU workloads to one or more wave slots of a plurality of wave slots. The graphics processor may select a first execution slot of a plurality of execution slots for executing the one or more GPU workloads. The selection may be based on one of a plurality of granularities. The graphics processor may execute, at the selected first execution slot, the one or more GPU workloads at the one of the plurality of granularities.
Abstract:
The present disclosure relates to methods and apparatus for compute processing. For example, disclosed techniques facilitate improving performance of matrix multiplication in streaming processor. Aspects of the present disclosure can execute, with a load control unit, a first load instruction to load a set of input data of an input matrix from a first memory to a second memory. Aspects of the present disclosure can also execute, with the load control unit, a second load instruction to load a set of weight data of a weight matrix from the first memory to the second memory. Additionally, aspects of the present disclosure can perform, with an ALU component, a matrix multiplication operation using the set of input data and the set of weight data to generate an output matrix. Further, aspects of the present disclosure can store the output matrix at a general purpose register accessible to the ALU component.