Abstract:
Methods, systems, and devices for dependency detection of a graphical processor unit (GPU) workload at a device are described. The method relates to generating a resource packet for a first GPU workload of a set of GPU workloads, the resource packet including a list of resources, identifying a first resource from the list of resources, retrieving a GPU address from a first memory location associated with the first resource, determining whether a dependency of the first resource exists between the first GPU workload and a second GPU workload from the set of GPU workloads based on the retrieving of the GPU address, and processing, when the dependency exists, the first resource after waiting for a duration to lapse.
Abstract:
A graphics processing unit (GPU) may include a triangle setup engine (TSE) configured to determine coordinates of a triangle, rotate coordinates of the triangle based on an angle. To rotate the coordinates, the TSE generates coordinates of the triangle in a rotated domain, and determines coordinates of a bounding box in the rotated domain based on the coordinates of the triangle in the rotated domain. The TSE determines a first plurality of parallel scanlines in the rotated domain, and a second plurality of parallel scanlines in the rotated domain. The first and second pluralities of scanlines are perpendicular. The TSE determines whether the bounding box coordinates are located within two adjacent scanlines. If the bounding box coordinates are located within the two adjacent scanlines, the TSE removes the triangle from the scene.
Abstract:
A graphics processing unit (GPU) may perform three-dimensional (3D) graphics processing in accordance with a 3D graphics pipeline using a first plurality of graphics processing hardware units of the GPU. The GPU may further perform a two-dimensional (2D) graphics operation using a second plurality of graphics processing hardware units of the GPU not used in performing the 3D graphics processing and one or more graphics processing hardware units of the first plurality of graphics processing hardware units of the GPU.
Abstract:
A cache memory system includes a cache memory including a plurality of cache memory lines and a dirty buffer including a plurality of dirty masks. A cache controller is configured to allocate one of the dirty masks to each of the cache memory lines when a write to the respective cache memory line is not a full write to that cache memory line. Each of the dirty masks indicates dirty states of data units in one of the cache memory lines. The cache controller may include a dirty buffer index which stores an identification (ID) information that associates the dirty masks with the cache memory lines to which the dirty masks are allocated. A cache line may include a fully dirty flag indicating when each byte in that cache line is dirty, so that a dirty mask does not need to be allocated for that cache line.
Abstract:
Systems and methods related to a memory system including a cache memory are disclosed. The cache memory system includes a cache memory including a plurality of cache memory lines and a dirty buffer including a plurality of dirty masks. A cache controller is configured to allocate one of the dirty masks to each of the cache memory lines when a write to the respective cache memory line is not a full write to that cache memory line. Each of the dirty masks indicates dirty states of data units in one of the cache memory lines. The cache controller stores an identification (ID) information that associates the dirty masks with the cache memory lines to which the dirty masks are allocated.
Abstract:
A sliced graphics processing unit (GPU) architecture in processor-based devices is disclosed. In some aspects, a GPU based on a sliced GPU architecture includes multiple hardware slices. The GPU further includes a sliced low-resolution Z buffer (LRZ) that is communicatively coupled to each hardware slice of the plurality of hardware slices, and that comprises a plurality of LRZ regions. Each hardware slice is configured to store, in an LRZ region corresponding exclusively to the hardware slice among the plurality of LRZ regions, a pixel tile assigned to the hardware slice.
Abstract:
A graphics processing unit (GPU) may include a triangle setup engine (TSE) configured to determine coordinates of a triangle, rotate coordinates of the triangle based on an angle. To rotate the coordinates, the TSE generates coordinates of the triangle in a rotated domain, and determines coordinates of a bounding box in the rotated domain based on the coordinates of the triangle in the rotated domain. The TSE determines a first plurality of parallel scanlines in the rotated domain, and a second plurality of parallel scanlines in the rotated domain. The first and second pluralities of scanlines are perpendicular. The TSE determines whether the bounding box coordinates are located within two adjacent scanlines. If the bounding box coordinates are located within the two adjacent scanlines, the TSE removes the triangle from the scene.
Abstract:
This disclosure describes an adaptive memory address scanning technique that defines an address scanning pattern, to be used for a particular surface, based on one or more properties of the surface. In addition, a number, shape, and arrangement of sub-primitives of a surface to process in parallel may be determined. In one example of the disclosure, a memory accessing method for graphics processing comprises, determining, by a graphics processing unit (GPU), properties of a surface, determining, by the GPU, a memory address scanning technique based on the determined properties of the surface, and performing, by the GPU, at least one of a read or a write of data associated with the surface in a memory based on the determined memory address scanning technique.
Abstract:
A sliced graphics processing unit (GPU) architecture in processor-based devices is disclosed. In some aspects, a GPU based on a sliced GPU architecture includes multiple hardware slices. The GPU further includes a command processor (CP) circuit and an unslice primitive controller (PC_US). Upon receiving a graphics instruction from a central processing unit (CPU), the CP circuit determines a graphics workload, and transmits the graphics workload to the PC_US. The PC_US then partitions the graphics workload into multiple subbatches and distributes each subbatch to a PC_S of a hardware slice for processing.
Abstract:
A cache memory may receive, from a client, a request for a long cache line of data. The cache memory may receive, from a memory, the requested long cache line of data. The cache memory may store the requested long cache line of data into a plurality of data stores across a plurality of memory banks as a plurality of short cache lines of data distributed across the plurality of data stores in the cache memory. The cache memory may also store a plurality of tags associated with the plurality of short cache lines of data into one of a plurality of tag stores in the plurality of memory banks.