Abstract:
Systems, apparatuses, and methods for performing coherence processing and memory cache processing in parallel are disclosed. A system includes a communication fabric and a plurality of dual-processing pipelines. Each dual-processing pipeline includes a coherence processing pipeline and a memory cache processing pipeline. The communication fabric forwards a transaction to a given dual-processing pipeline, with the communication fabric selecting the given dual-processing pipeline, from the plurality of dual-processing pipelines, based on a hash of the address of the transaction. The given dual-processing pipeline performs a duplicate tag lookup in parallel with a memory cache tag lookup for the transaction. By performing the duplicate tag lookup and the memory cache tag lookup in a parallel fashion rather than in a serial fashion, latency and power consumption are reduced while performance is enhanced.
Abstract:
Various systems and methods for ensuring real-time snoop latency are disclosed. A system includes a processor and a cache controller. The cache controller receives, via a channel, cache snoop requests from the processor, the snoop requests including latency-sensitive and non-latency sensitive requests. Requests are not prioritized by type within the channel. The cache controller limits a number of non-latency sensitive snoop requests that can be processed ahead of an incoming latency-sensitive snoop requests. Limiting the number of non-latency sensitive snoop requests that can be processed ahead of an incoming latency-sensitive snoop request includes the cache controller determining that the number of received non-latency sensitive snoop requests has reached a predetermined value and responsively prioritizing latency-sensitive requests over non-latency sensitive requests.
Abstract:
Systems and methods for managing fast to slow links in a bus fabric. A pair of link interface units connect agents with a clock mismatch. Each link interface unit includes an asynchronous FIFO for storing transactions that are sent over the clock domain crossing. When the command for a new transaction is ready to be sent while data for the previous transaction is still being sent, the link interface unit prevents the last data beat of the previous transaction from being sent. Instead, after a delay of one or more clock cycles, the last data beat overlaps with the command of the new transaction.
Abstract:
Processors and methods for preventing lower level prefetch units from stalling at page boundaries. An upper level prefetch unit closest to the processor core issues a preemptive request for a translation of the next page in a given prefetch stream. The upper level prefetch unit sends the translation to the lower level prefetch units prior to the lower level prefetch units reaching the end of the current page for the given prefetch stream. When the lower level prefetch units reach the boundary of the current page, instead of stopping, these prefetch units can continue to prefetch by jumping to the next physical page number provided in the translation.
Abstract:
Systems, processors, and methods for keeping uncacheable data coherent. A processor includes a multi-level cache hierarchy, and uncacheable load memory operations can be cached at any level of the cache hierarchy. If an uncacheable load misses in the L2 cache, then allocation of the uncacheable load will be restricted to a subset of the ways of the L2 cache. If an uncacheable store memory operation hits in the L1 cache, then the hit cache line can be updated with the data from the memory operation. If the uncacheable store misses in the L1 cache, then the uncacheable store is sent to a core interface unit. Multiple contiguous store misses are merged into larger blocks of data in the core interface unit before being sent to the L2 cache.
Abstract:
A memory controller circuit receives memory access requests from a network of a computer system. Entries are reserved for these requests in a retry queue circuit. An arbitration circuit of the memory controller circuit issues those requests to a tag pipeline circuit that determines whether the received memory access requests hit in a memory cache. As a memory access request passes through the tag pipeline circuit, it may require another pass through this pipeline—for example, if resources such as certain storage circuits needed to complete the memory access request are unavailable (for example a snoop queue circuit). The reservation that has been made in the retry queue circuit thus keeps the request from having to be returned to the network for resubmission to the memory controller circuit if initial processing of the memory access request cannot be completed.
Abstract:
Systems, apparatuses, and methods for performing coherence processing and memory cache processing in parallel are disclosed. A system includes a communication fabric and a plurality of dual-processing pipelines. Each dual-processing pipeline includes a coherence processing pipeline and a memory cache processing pipeline. The communication fabric forwards a transaction to a given dual-processing pipeline, with the communication fabric selecting the given dual-processing pipeline, from the plurality of dual-processing pipelines, based on a hash of the address of the transaction. The given dual-processing pipeline performs a duplicate tag lookup in parallel with a memory cache tag lookup for the transaction. By performing the duplicate tag lookup and the memory cache tag lookup in a parallel fashion rather than in a serial fashion, latency and power consumption are reduced while performance is enhanced.
Abstract:
A method and apparatus for evicting cache lines from a cache memory includes receiving a request from one of a plurality of processors. The cache memory is configured to store a plurality of cache lines, and a given cache line includes an identifier indicating a processor that performed a most recent access of the given cache line. The method further includes selecting a cache line for eviction from a group of least recently used cache lines, where each cache line of the group of least recently used cache lines occupy a priority position less that a predetermined value, and then evicting the selected cache line.
Abstract:
Systems and methods for reducing leakage power in a L2 cache within a SoC. The L2 cache is partitioned into multiple banks, and each bank has its own separate power supply. An idle counter is maintained for each bank to count a number of cycles during which the bank has been inactive. The temperature and leaky factor of the SoC are used to select an operating point of the SoC. Based on the operating point, an idle counter threshold is set, with a high temperature and high leaky factor corresponding to a relatively low idle counter threshold, and with a low temperature and low leaky factor corresponding to a relatively high idle counter threshold. When a given idle counter exceeds the idle counter threshold, the voltage supplied to the corresponding bank is reduced to a voltage sufficient for retention of data but not for access.
Abstract:
Systems, processors, and methods for sharing an agent's private cache with other agents within a SoC. Many agents in the SoC have a private cache in addition to the shared caches and memory of the SoC. If an agent's processor is shut down or operating at less than full capacity, the agent's private cache can be shared with other agents. When a requesting agent generates a memory request and the memory request misses in the memory cache, the memory cache can allocate the memory request in a separate agent's cache rather than allocating the memory request in the memory cache.