Abstract:
According to one general aspect, an apparatus may include a cache pre-fetcher, and a pre-fetch scheduler. The cache pre-fetcher may be configured to predict, based at least in part upon a virtual address, data to be retrieved from a memory system. The pre-fetch scheduler may be configured to convert the virtual address of the data to a physical address of the data, and request the data from one of a plurality of levels of the memory system. The memory system may include a plurality of levels, each level of the memory system configured to store data.
Abstract:
A system and a method to cascade execution of instructions in a load-store unit (LSU) of a central processing unit (CPU) to reduce latency associated with the instructions. First data stored in a cache is read by the LSU in response a first memory load instruction of two immediately consecutive memory load instructions. Alignment, sign extension and/or endian operations are performed on the first data read from the cache in response to the first memory load instruction, and, in parallel, a memory-load address-forwarded result is selected based on a corrected alignment of the first data read in response to the first memory load instruction to provide a next address for a second of the two immediately consecutive memory load instructions. Second data stored in the cache is read by the LSU in response to the second memory load instruction based on the selected memory-load address-forwarded result.
Abstract:
According to one general aspect, an apparatus may include a front end logic section comprising a main-branch target buffer (BTB). The apparatus may also include a micro-BTB separate from the main BTB, and configured to produce prediction information associated with a branching instruction and mark prediction information as verified when one or more conditions are satisfied. Wherein the front end logic section is configured to be, at least partially, powered down when the data stored by the micro-BTB that results in the prediction information is marked as previously verified.
Abstract:
A system and a method to cascade execution of instructions in a load-store unit (LSU) of a central processing unit (CPU) to reduce latency associated with the instructions. First data stored in a cache is read by the LSU in response a first memory load instruction of two immediately consecutive memory load instructions. Alignment, sign extension and/or endian operations are performed on the first data read from the cache in response to the first memory load instruction, and, in parallel, a memory-load address-forwarded result is selected based on a corrected alignment of the first data read in response to the first memory load instruction to provide a next address for a second of the two immediately consecutive memory load instructions. Second data stored in the cache is read by the LSU in response to the second memory load instruction based on the selected memory-load address-forwarded result.
Abstract:
A Fill Buffer (FB) based data forwarding scheme that stores a combination of Virtual Address (VA), TLB (Translation Look-aside Buffer) entry# or an indication of a location of a Page Table Entry (PTE) in the TLB, and a TLB page size information in the FB and uses these values to expedite FB forwarding. Load (Ld) operations send their non-translated VA for an early comparison against the VA entries in the FB, and are then further qualified with the TLB entry# to determine a “hit.” This hit determination is fast and enables FB forwarding at higher frequencies without waiting for a comparison of Physical Addresses (PA) to conclude in the FB. A safety mechanism may detect a false hit in the FB and generate a late load cancel indication to cancel the earlier-started FB forwarding by ignoring the data obtained as a result of the Ld execution. The Ld is then re-executed later and tries to complete successfully with the correct data.
Abstract:
According to one general aspect, a load unit may include a load circuit configured to load at least one piece of data from a memory. The load unit may include an alignment circuit configured to align the data to generate an aligned data. The load unit may also include a mathematical operation execution circuit configured to generate a resultant of a predetermined mathematical operation with the at least one piece of data as an operand. Wherein the load unit is configured to, if an active instruction is associated with the predetermined mathematical operation, bypass the alignment circuit and input the piece of data directly to the mathematical operation execution circuit.
Abstract:
According to one general aspect, a method may include receiving, by a pre-fetch unit, a demand to access data stored at a memory address. The method may include determining if a first portion of the memory address matches a prior defined region of memory. The method may further include determining if a second portion of the memory address matches a previously detected pre-fetched address portion. The method may also include, if the first portion of the memory address matches the prior defined region of memory, and the second portion of the memory address matches the previously detected pre-fetched address portion, confirming that a pre-fetch pattern is associated with the memory address.