Abstract:
Systems and methods relate to efficient memory operations. A single instruction multiple data (SIMD) gather operation is implemented with a gather result buffer located within or in close proximity to memory, to receive or gather multiple data elements from multiple orthogonal locations in a memory, and once the gather result buffer is complete, the gathered data is transferred to a processor register. A SIMD copy operation is performed by executing two or more instructions for copying multiple data elements from multiple orthogonal source addresses to corresponding multiple destination addresses within the memory, without an intermediate copy to a processor register. Thus, the memory operations are performed in a background mode without direction by the processor.
Abstract:
Systems and methods for implementing certain load instructions, such as vector load instructions by cooperation of a main processor and a coprocessor. The load instructions which are identified by the main processor for offloading to the coprocessor are committed in the main processor without receiving corresponding load data. Post-commit, the load instructions are processed in the coprocessor, such that latencies incurred in fetching the load data are hidden from the main processor. By implementing an out-of-order load data buffer associated with an in-order instruction buffer, the coprocessor is also configured to avoid stalls due to long latencies which may be involved in fetching the load data from levels of memory hierarchy, such as L2, L3, L4 caches, main memory, etc.
Abstract:
Matrix multiple operations may use a reduced result matrix to increase the speed and accuracy of the operation. In one example, each higher precision row/column is decomposed into multiple component rows/columns of the base type that can be combined as weighted sums to form the original higher precision row/column. In another example, the decomposition may be independent for each input matrix and decompose to any multiple of the base type. In another example, the base type for each input matrix could be different. In another example, after decomposition, a matrix operation is performed (e.g. matrix multiply, convolutional layer, or possibly other matrix operation) on decomposed base type input matrices to yield a result matrix that contains components of the higher precision results. The results may be combined together to obtain higher-precision results.
Abstract:
Systems and methods relate to a mixed-width single instruction multiple data (SIMD) instruction which has at least a source vector operand comprising data elements of a first bit-width and a destination vector operand comprising data elements of a second bit-width, wherein the second bit-width is either half of or twice the first bit-width. Correspondingly, one of the source or destination vector operands is expressed as a pair of registers, a first register and a second register. The other vector operand is expressed as a single register. Data elements of the first register correspond to even-numbered data elements of the other vector operand expressed as a single register, and data elements of the second register correspond to data elements of the other vector operand expressed as a single register.
Abstract:
Systems and methods relate to performing data movement operations using single instruction multiple data (SIMD) instructions. A first SIMD instruction comprises a first input data vector having a number N of two or more data elements in corresponding N SIMD lanes and a control vector having N control elements in the corresponding N SIMD lanes. A first multi-stage cube network is controllable by the first SIMD instruction, and includes movement elements, with one movement element per SIMD lane, per stage. A movement element selects between one of two data elements based on a corresponding control element and moves the data elements across the stages of the first multi-stage cube network by a zero distance or power-of-two distance between adjacent stages to generate a first output data vector. A second multi-stage cube network can be used in conjunction to generate all possible data movement operations of the input data vector.
Abstract:
A processor-implemented method for multiplication approximation includes receiving inputs to be processed using an artificial intelligence (AI) compute engine. The inputs have a first precision. The AI compute engine is configured for processing in a second precision different from the first precision. A first parameter for the inputs and a second parameter for the AI compute engine are defined. The first parameter and the second parameter respectively indicate a first portion of the first precision and a second portion of the second precision to use for computation by the AI compute engine. The inputs and the second set of compute engine parameters are respectively adapted according to the first parameter and the second parameter to generate a first representation and a second representation. An approximation of an AI workload for the inputs is generated based on the first representation and the second representation.
Abstract:
An aspect of the disclosure relates to a data processing system, including: an input medium configured to include a first set of blocks of data including a first set of block of compressed data and a first set of metadata, respectively; an output medium configured to include a first set of blocks of decompressed data each having a predetermined number of decompressed elements; and a set of single instruction multiple data (SIMD) processors configured to: access the first set of blocks of data from the input medium, respectively; decompress the first set of blocks of compressed data to generate the first set of blocks of decompressed data based on the first set of metadata, respectively; and provide the first set of blocks of decompressed data to the output medium, respectively.
Abstract:
Various embodiments include methods and devices for processing a neural network by an artificial intelligence (AI) processor. Embodiments may include receiving an AI processor operating condition information, dynamically adjusting an AI quantization level for a segment of a neural network in response to the operating condition information, and processing the segment of the neural network quantization using the adjusted AI quantization level.
Abstract:
A clock gating system (CGS) includes a digital power estimator configured to generate indications of a predicted energy consumption per cycle of a clock signal and a maximum energy consumption per cycle of the clock signal. The CGS further includes a voltage-clock gate (VCG) circuit coupled to the digital power estimator. The VCG circuit is configured to gate and un-gate the clock signal based on the indications prior to occurrence of a voltage droop event and using hardware voltage model circuitry of the VCG circuit. The VCG circuit is further configured to gate the clock signal based on an undershoot phase associated with the voltage droop event and to un-gate the clock signal based on an overshoot phase associated with the voltage droop event.
Abstract:
A clock domain crossing interface is described. The clock domain crossing interface includes a transmit clock domain and a receive clock domain using a different clock from the transmit clock domain. The clock domain crossing interface also includes a first-in-first-out (FIFO) buffer coupled between the transmit clock domain and the receive clock domain. The FIFO buffer to store ordered transactions sent from the transmit clock domain to the receive clock domain. The clock domain crossing interface further includes a transmit clock domain event transfer block to notify the receive clock domain of a new transaction pushed onto the FIFO buffer in the transmit clock domain. The clock domain crossing interface also includes a receive clock domain event transfer block to notify the transmit clock domain of a new transaction pulled from the FIFO buffer in the receive clock domain.