PROCESSING DEVICE WITH INDEPENDENTLY ACTIVATABLE WORKING MEMORY BANK AND METHODS
    21.
    发明申请
    PROCESSING DEVICE WITH INDEPENDENTLY ACTIVATABLE WORKING MEMORY BANK AND METHODS 有权
    具有独立可启动工作记忆体的处理装置和方法

    公开(公告)号:US20140181411A1

    公开(公告)日:2014-06-26

    申请号:US13723294

    申请日:2012-12-21

    CPC classification number: G06F12/0891 G06F12/0804 G06F2212/601 Y02D10/13

    Abstract: A data processing device is provided that includes an array of working memory banks and an associated processing engine. The working memory bank array is configured with at least one independently activatable memory bank. A dirty data counter (DDC) is associated with the independently activatable memory bank and is configured to reflect a count of dirty data migrated from the independently activatable memory bank upon selective deactivation of the independently activatable memory bank. The DDC is configured to selectively decrement the count of dirty data upon the reactivation of the independently activatable memory bank in connection with a transient state. In the transient state, each dirty data access by the processing engine to the reactivated memory bank is also conducted with respect to another memory bank of the array. Upon a condition that dirty data is found in the other memory bank, the count of dirty data is decremented.

    Abstract translation: 提供了一种数据处理装置,其包括工作存储器组和相关处理引擎的阵列。 工作存储器阵列配置有至少一个可独立激活的存储体。 脏数据计数器(DDC)与可独立激活的存储体相关联,并且被配置为反映从可独立激活的存储体选择性地去激活时从可独立激活的存储体组迁移的脏数据的计数。 DDC被配置为在与暂时状态相关联的可独立激活的存储体的重新激活时选择性地减少脏数据的计数。 在过渡状态下,处理引擎对重新激活的存储体的每个脏数据访问也相对于阵列的另一存储体进行。 在另一个存储体中发现脏数据的情况下,脏数据的计数减少。

    Permute Instructions for Register-Based Lookups

    公开(公告)号:US20240220247A1

    公开(公告)日:2024-07-04

    申请号:US18148873

    申请日:2022-12-30

    CPC classification number: G06F9/30127 G06F9/30134

    Abstract: Permute instructions for register-based lookups is described. In accordance with the described techniques, a processor is configured to perform a register-based lookup by retrieving a first result from a first lookup table based on a subset of bits included in an index of a destination register, retrieving a second result from a second lookup table based on the subset of bits included in the index of the destination register, selecting the first result or the second result based on a bit in the index of the destination register that is excluded from the subset of bits, and overwriting data included in the index of the destination register using a selected one of the first result or the second result.

    Mechanism for reducing coherence directory controller overhead for near-memory compute elements

    公开(公告)号:US12008378B2

    公开(公告)日:2024-06-11

    申请号:US18132879

    申请日:2023-04-10

    Abstract: A parallel processing (PP) level coherence directory, also referred to as a Processing In-Memory Probe Filter (PimPF), is added to a coherence directory controller. When the coherence directory controller receives a broadcast PIM command from a host, or a PIM command that is directed to multiple memory banks in parallel, the PimPF accelerates processing of the PIM command by maintaining a directory for cache coherence that is separate from existing system level directories in the coherence directory controller. The PimPF maintains a directory according to address signatures that define the memory addresses affected by a broadcast PIM command. Two implementations are described: a lightweight implementation that accelerates PIM loads into registers, and a heavyweight implementation that accelerates both PIM loads into registers and PIM stores into memory.

    Distributed coherence directory subsystem with exclusive data regions

    公开(公告)号:US11726915B2

    公开(公告)日:2023-08-15

    申请号:US16821632

    申请日:2020-03-17

    CPC classification number: G06F12/0824 G06F12/084

    Abstract: A processing system includes a first set of one or more processing units including a first processing unit, a second set of one or more processing units including a second processing unit, and a memory having an address space shared by the first and second sets. The processing system further includes a distributed coherence directory subsystem having a first coherence directory to support a first subset of one or more address regions of the address space and a second coherence directory to support a second subset of one or more address regions of the address space. In some implementations, the first coherence directory is implemented in the system so as to have a lower access latency for the first set, whereas the second coherence directory is implemented in the system so as to have a lower access latency for the second set.

    Memory request priority assignment techniques for parallel processors

    公开(公告)号:US11507522B2

    公开(公告)日:2022-11-22

    申请号:US16706421

    申请日:2019-12-06

    Abstract: Systems, apparatuses, and methods for implementing memory request priority assignment techniques for parallel processors are disclosed. A system includes at least a parallel processor coupled to a memory subsystem, where the parallel processor includes at least a plurality of compute units for executing wavefronts in lock-step. The parallel processor assigns priorities to memory requests of wavefronts on a per-work-item basis by indexing into a first priority vector, with the index generated based on lane-specific information. If a given event is detected, a second priority vector is generated by applying a given priority promotion vector to the first priority vector. Then, for subsequent wavefronts, memory requests are assigned priorities by indexing into the second priority vector with lane-specific information. The use of priority vectors to assign priorities to memory requests helps to reduce the memory divergence problem experienced by different work-items of a wavefront.

    Adaptive cache reconfiguration via clustering

    公开(公告)号:US11360891B2

    公开(公告)日:2022-06-14

    申请号:US16355168

    申请日:2019-03-15

    Abstract: A method of dynamic cache configuration includes determining, for a first clustering configuration, whether a current cache miss rate exceeds a miss rate threshold. The first clustering configuration includes a plurality of graphics processing unit (GPU) compute units clustered into a first plurality of compute unit clusters. The method further includes clustering, based on the current cache miss rate exceeding the miss rate threshold, the plurality of GPU compute units into a second clustering configuration having a second plurality of compute unit clusters fewer than the first plurality of compute unit clusters.

    MEMORY ACCESS RESPONSE MERGING IN A MEMORY HIERARCHY

    公开(公告)号:US20220091980A1

    公开(公告)日:2022-03-24

    申请号:US17031706

    申请日:2020-09-24

    Abstract: A system and method for efficiently processing memory requests are described. A computing system includes multiple compute units, multiple caches of a memory hierarchy and a communication fabric. A compute unit generates a memory access request that misses in a higher level cache, which sends a miss request to a lower level shared cache. During servicing of the miss request, the lower level cache merges identification information of multiple memory access requests targeting a same cache line from multiple compute units into a merged memory access response. The lower level shared cache continues to insert information into the merged memory access response until the lower level shared cache is ready to issue the merged memory access response. An intermediate router in the communication fabric broadcasts the merged memory access response into multiple memory access responses to send to corresponding compute units.

Patent Agency Ranking