Abstract:
A computer-readable storage medium, method and system for optimization-level aware branch prediction is described. A gear level is assigned to a set of application instructions that have been optimized. The gear level is also stored in a register of a branch prediction unit of a processor. Branch prediction is then performed by the processor based upon the gear level.
Abstract:
A combination of hardware and software collect profile data for asynchronous events, at code region granularity. An exemplary embodiment is directed to collecting metrics for prefetching events, which are asynchronous in nature. Instructions that belong to a code region are identified using one of several alternative techniques, causing a profile bit to be set for the instruction, as a marker. Each line of a data block that is prefetched is similarly marked. Events corresponding to the profile data being collected and resulting from instructions within the code region are then identified. Each time that one of the different types of events is identified, a corresponding counter is incremented. Following execution of the instructions within the code region, the profile data accumulated in the counters are collected, and the counters are reset for use with a new code region.
Abstract:
In one embodiment, the present invention includes a method for determining if data of a memory request by a first agent is in a memory region represented by a region indicator of a region table of the first agent, and transmitting a compressed address for the memory request to other agents of a system if the memory region is represented by the region indicator, otherwise transmitting a full address. Other embodiments are described and claimed.
Abstract:
An apparatus associated with identifying a critical thread based on information gathered during meeting point processing is provided. One embodiment of the apparatus may include logic to selectively update meeting point counts for threads upon determining that they have arrived at a meeting point. The embodiment may also include logic to periodically identify which thread in a set of threads is a critical thread. The critical thread may be the slowest thread and criticality may be determined by examining meeting point counts. The embodiment may also include logic to selectively manipulate a configurable attribute of the critical thread and/or core upon which the critical thread will run.
Abstract:
Methods and apparatus to provide leakage power estimation are described. In one embodiment, one or more sensed temperature values (108) and one or more voltage values (110) are utilized to determine the leakage power of an integrated circuit (IC) component. Other embodiments are also described.
Abstract:
A combination of hardware and software collect profile data for asynchronous events, at code region granularity. An exemplary embodiment is directed to collecting metrics for prefetching events, which are asynchronous in nature. Instructions that belong to a code region are identified using one of several alternative techniques, causing a profile bit to be set for the instruction, as a marker. Each line of a data block that is prefetched is similarly marked. Events corresponding to the profile data being collected and resulting from instructions within the code region are then identified. Each time that one of the different types of events is identified, a corresponding counter is incremented. Following execution of the instructions within the code region, the profile data accumulated in the counters are collected, and the counters are reset for use with a new code region.
Abstract:
Methods, apparatus, instructions and logic are disclosed providing double rounded combined floating-point multiply and add functionality as scalar or vector SIMD instructions or as fused micro-operations. Embodiments include detecting floating-point (FP) multiplication operations and subsequent FP operations specifying as source operands results of the FP multiplications. The FP multiplications and the subsequent FP operations are encoded as combined FP operations including rounding of the results of FP multiplication followed by the subsequent FP operations. The encoding of said combined FP operations may be stored and executed as part of an executable thread portion using fused-multiply-add hardware that includes overflow detection for the product of FP multipliers, first and second FP adders to add third operand addend mantissas and the products of the FP multipliers with different rounding inputs based on overflow, or no overflow, in the products of the FP multiplier. Final results are selected respectively using overflow detection.
Abstract:
A method and system to selectively move one or more of a plurality threads which are executing in parallel by a plurality of processing cores. In one embodiment, a thread may be moved from executing in one of the plurality of processing cores to executing in another of the plurality of processing cores, the moving based on a performance characteristic associated with the plurality of threads. In another embodiment of the invention, a power state of the plurality of processing cores may be changed to improve a power efficiency associated with the executing of the multiple threads.
Abstract:
Techniques are described for providing an enhanced cache coherency protocol for a multi-core processor that includes a Speculative Request For Ownership Without Data (SRFOWD) for a portion of cache memory. With a SRFOWD, only an acknowledgement message may be provided as an answer to a requesting core. The contents of the affected cache line are not required to be a part of the answer. The enhanced cache coherency protocol may assure that a valid copy of the current cache line exists in case of misspeculation by the requesting core. Thus, an owner of the current copy of the cache line may maintain a copy of the old contents of the cache line. The old contents of the cache line may be discarded if speculation by the requesting core turns out to be correct. Otherwise, in case of misspeculation by the requesting core, the old contents of the cache line may be set back to a valid state.
Abstract:
A method and apparatus for scaling frequency and operating voltage of at least one clock domain of a microprocessor. More particularly, embodiments of the invention relate to techniques to divide a microprocessor into clock domains and control the frequency and operating voltage of each clock domain independently of the others.