摘要:
An instruction buffer for a processor configured to execute multiple threads is disclosed. The instruction buffer is configured to receive instructions from a fetch unit and provide instructions to a selection unit. The instruction buffer includes one or more memory arrays comprising a plurality of entries configured to store instructions and/or other information (e.g., program counter addresses). One or more indicators are maintained by the processor and correspond to the plurality of threads. The one or more indicators are usable such that for instructions received by the instruction buffer, one or more of the plurality entries of a memory array can be determined as a write destination for the received instructions, and for instructions to be read from the instruction buffer (and sent to a selection unit), one or more entries can be determined as the correct source location from which to read.
摘要:
A system and method for balancing instruction loads between multiple execution units are disclosed. One or more execution units may be represented by a slot configured to accept instructions on behalf of the execution unit(s). A decode unit may assign instructions to a particular slot for subsequent scheduling for execution. Slot assignments may be made based on an instruction's type and/or on a history of previous slot assignments. A cumulative slot assignment history may be maintained in a bias counter, the value of which reflects the bias of previous slot assignments. Slot assignments may be determined based on the value of the bias counter, in order to balance the instruction load across all slots, and all execution units. The bias counter may reflect slot assignments made only within a desired historical window. A separate data structure may store data reflecting the actual slot assignments made during the desired historical window.
摘要:
A processor includes an instruction fetch unit configured to issue instructions for execution, where the instructions are selected from a number of threads, where each given instruction has a corresponding thread identifier, and where at least some of the instructions specify operand(s) via register identifiers. A register file stores operands usable by the instructions, and may include several banks, each corresponding to a register identifiers and including several entries corresponding to the several threads, wherein the entries are configured to store data values. In response to receiving a request to read a particular register identifier for a given thread identifier, the register file may be configured to decode the given thread identifier to retrieve entries from the banks that correspond to the given thread identifier. The register file may further select, from among the retrieved entries, a data value corresponding to the particular register identifier to be output.
摘要:
Sharing functional units within a multithreaded processor. In one embodiment, the multithreaded processor may include a multithreaded instruction source that may provide an instruction from each of a plurality of thread groups in a given cycle. A given thread group may include one or more instructions from one or more threads. The arbitration functionality may arbitrate between the plurality of thread groups for access to a functional unit such as a load store unit, for example, that may be shared between the thread groups.
摘要:
Systems and methods for efficient execution of operations in a multi-threaded processor. Each thread may include a blocking instruction. A blocking instruction blocks other threads from utilizing hardware resources for an appreciable amount of time. One example of a blocking type instruction is a Montgomery multiplication cryptographic instruction. Each thread can operate in a thread-based mode that allows the insertion of stall cycles during the execution of blocking instructions, during which other threads may utilize the previously blocked hardware resources. At times when multiple threads are scheduled to execute blocking instructions, the thread-based mode may be changed to increase throughput for these multiple threads. For example, the mode may be changed to disallow the insertion of stall cycles. Therefore, the time for sequential operation of the blocking instructions corresponding to the multiple threads may be reduced.
摘要:
A method and mechanism for managing access to a plurality of registers in a processing device are contemplated. A processing device includes multiple nodes coupled to a ring bus, each of which include one or more registers which may be accessed by processes executing within the device. Also coupled to the ring bus is a ring control unit which is configured to initiate transactions targeted to nodes on the ring bus. Each of the nodes are configured receive and process bus transaction with a fixed latency whether or not the first transaction is targeted to the receiving node. The ring control unit is configured to periodically convey idle transactions on the ring bus in order to allow nodes responding to indeterminate transactions to gain access to the bus.
摘要:
In one embodiment, a processor comprises a register file, register management logic coupled to the register file, and at least two sources of window swap operations coupled to the register management logic. The register management logic is configured to control an interface to the register file to switch register windows in the register file in response to one or more window swap operations. The sources of window swap operations and the register management logic are configured to cooperate according to an arbitration scheme to arbitrate between conflicting window swap operations to be performed using the interface. In one particular implementation, for example, block signals may be used from higher priority sources to lower priority sources to block issuance of window swap operations by the lower priority sources.
摘要:
A method to communicate data is disclosed which includes communicating a virtual address to a translation lookaside buffer (TLB) and translating the virtual address to a physical address of a computer memory. The method also includes loading the physical address translated by the TLB into a register within a processor and transmitting the data from the physical address to a destination computing device.
摘要:
In one embodiment, a processor comprises a plurality of pipeline stages and a first circuit operable at a first pipeline stage of the plurality of pipeline stages. The first circuit is configured to maintain a plurality of program counters (PCs), each of which corresponds to one of a plurality of threads that the processor is configured to have concurrently in process with respect to the plurality of pipeline stages. The first circuit is configured to provide a first PC to a second pipeline stage of the plurality of pipeline stages. The first PC is derived from one of the plurality of PCs that corresponds to a first thread of the plurality of threads, and a first instruction entering the second pipeline stage is from the first thread.
摘要:
A superscalar processor and method for executing fixed-point instructions within a superscalar processor are disclosed. The superscalar processor has a memory and multiple execution units, including a fixed point execution unit (FXU) and a non-fixed point execution unit (non-FXU). According to the present invention, a set of instructions to be executed are fetched from among a number of instructions stored within memory. A determination is then made if n instructions, the maximum number possible, can be dispatched to the multiple execution units during a first processor cycle if fixed point arithmetic and logical instructions are dispatched only to the FXU. If so, n instructions are dispatched to the multiple execution units for execution. In response to a determination that n instructions cannot be dispatched during the first processor cycle, a determination is made whether a fixed point instruction is available to be dispatched and whether dispatching the fixed point instruction to the non-FXU for execution will result in greater efficiency. In response to a determination that a fixed point instruction is not available to be dispatched or that dispatching the fixed point instruction to the non-FXU will not result in greater efficiency, dispatch of the fixed point instruction is delayed until a second processor cycle. However, in response to a determination that dispatching the fixed point instruction to the non-FXU will result in greater efficiency, the fixed point instruction is dispatched to the non-FXU and executed, thereby improving execution unit utilization.