Abstract:
A data processing device is provided that includes an array of working memory banks and an associated processing engine. The working memory bank array is configured with at least one independently activatable memory bank. A dirty data counter (DDC) is associated with the independently activatable memory bank and is configured to reflect a count of dirty data migrated from the independently activatable memory bank upon selective deactivation of the independently activatable memory bank. The DDC is configured to selectively decrement the count of dirty data upon the reactivation of the independently activatable memory bank in connection with a transient state. In the transient state, each dirty data access by the processing engine to the reactivated memory bank is also conducted with respect to another memory bank of the array. Upon a condition that dirty data is found in the other memory bank, the count of dirty data is decremented.
Abstract:
A method of storing stack data in a cache hierarchy is provided. The cache hierarchy comprises a data cache and a stack filter cache. Responsive to a request to access a stack data block, the method stores the stack data block in the stack filter cache, wherein the stack filter cache is configured to store any requested stack data block.
Abstract:
Permute instructions for register-based lookups is described. In accordance with the described techniques, a processor is configured to perform a register-based lookup by retrieving a first result from a first lookup table based on a subset of bits included in an index of a destination register, retrieving a second result from a second lookup table based on the subset of bits included in the index of the destination register, selecting the first result or the second result based on a bit in the index of the destination register that is excluded from the subset of bits, and overwriting data included in the index of the destination register using a selected one of the first result or the second result.
Abstract:
A parallel processing (PP) level coherence directory, also referred to as a Processing In-Memory Probe Filter (PimPF), is added to a coherence directory controller. When the coherence directory controller receives a broadcast PIM command from a host, or a PIM command that is directed to multiple memory banks in parallel, the PimPF accelerates processing of the PIM command by maintaining a directory for cache coherence that is separate from existing system level directories in the coherence directory controller. The PimPF maintains a directory according to address signatures that define the memory addresses affected by a broadcast PIM command. Two implementations are described: a lightweight implementation that accelerates PIM loads into registers, and a heavyweight implementation that accelerates both PIM loads into registers and PIM stores into memory.
Abstract:
Methods and systems for load balancing in a neural network system using metadata are disclosed. Any one or a combination of one or more kernels, one or more neurons, and one or more layers of the neural network system are tagged with metadata. A scheduler detects whether there are neurons that are available to execute. The scheduler uses the metadata to schedule and load balance computations across compute resources and available resources.
Abstract:
A processing system includes a first set of one or more processing units including a first processing unit, a second set of one or more processing units including a second processing unit, and a memory having an address space shared by the first and second sets. The processing system further includes a distributed coherence directory subsystem having a first coherence directory to support a first subset of one or more address regions of the address space and a second coherence directory to support a second subset of one or more address regions of the address space. In some implementations, the first coherence directory is implemented in the system so as to have a lower access latency for the first set, whereas the second coherence directory is implemented in the system so as to have a lower access latency for the second set.
Abstract:
A processor distributes memory timing parameters and data among different memory modules based upon memory access patterns. The memory access patterns indicate different types, or classes, of data for an executing workload, with each class associated with different memory access characteristics, such as different row buffer hit rate levels, different frequencies of access, different criticalities, and the like. The processor assigns each memory module to a data class and sets the memory timing parameters for each memory module according to the module's assigned data class, thereby tailoring the memory timing parameters for efficient access of the corresponding data.
Abstract:
Systems, apparatuses, and methods for implementing memory request priority assignment techniques for parallel processors are disclosed. A system includes at least a parallel processor coupled to a memory subsystem, where the parallel processor includes at least a plurality of compute units for executing wavefronts in lock-step. The parallel processor assigns priorities to memory requests of wavefronts on a per-work-item basis by indexing into a first priority vector, with the index generated based on lane-specific information. If a given event is detected, a second priority vector is generated by applying a given priority promotion vector to the first priority vector. Then, for subsequent wavefronts, memory requests are assigned priorities by indexing into the second priority vector with lane-specific information. The use of priority vectors to assign priorities to memory requests helps to reduce the memory divergence problem experienced by different work-items of a wavefront.
Abstract:
A method of dynamic cache configuration includes determining, for a first clustering configuration, whether a current cache miss rate exceeds a miss rate threshold. The first clustering configuration includes a plurality of graphics processing unit (GPU) compute units clustered into a first plurality of compute unit clusters. The method further includes clustering, based on the current cache miss rate exceeding the miss rate threshold, the plurality of GPU compute units into a second clustering configuration having a second plurality of compute unit clusters fewer than the first plurality of compute unit clusters.
Abstract:
A system and method for efficiently processing memory requests are described. A computing system includes multiple compute units, multiple caches of a memory hierarchy and a communication fabric. A compute unit generates a memory access request that misses in a higher level cache, which sends a miss request to a lower level shared cache. During servicing of the miss request, the lower level cache merges identification information of multiple memory access requests targeting a same cache line from multiple compute units into a merged memory access response. The lower level shared cache continues to insert information into the merged memory access response until the lower level shared cache is ready to issue the merged memory access response. An intermediate router in the communication fabric broadcasts the merged memory access response into multiple memory access responses to send to corresponding compute units.