SCATTER TO GATHER OPERATION
    1.
    发明申请

    公开(公告)号:US20170371657A1

    公开(公告)日:2017-12-28

    申请号:US15192992

    申请日:2016-06-24

    Abstract: Systems and methods relate to efficient memory operations. A single instruction multiple data (SIMD) gather operation is implemented with a gather result buffer located within or in close proximity to memory, to receive or gather multiple data elements from multiple orthogonal locations in a memory, and once the gather result buffer is complete, the gathered data is transferred to a processor register. A SIMD copy operation is performed by executing two or more instructions for copying multiple data elements from multiple orthogonal source addresses to corresponding multiple destination addresses within the memory, without an intermediate copy to a processor register. Thus, the memory operations are performed in a background mode without direction by the processor.

    COPROCESSOR FOR OUT-OF-ORDER LOADS
    2.
    发明申请
    COPROCESSOR FOR OUT-OF-ORDER LOADS 有权
    用于不合适的负载的共同控制器

    公开(公告)号:US20160092238A1

    公开(公告)日:2016-03-31

    申请号:US14499044

    申请日:2014-09-26

    Abstract: Systems and methods for implementing certain load instructions, such as vector load instructions by cooperation of a main processor and a coprocessor. The load instructions which are identified by the main processor for offloading to the coprocessor are committed in the main processor without receiving corresponding load data. Post-commit, the load instructions are processed in the coprocessor, such that latencies incurred in fetching the load data are hidden from the main processor. By implementing an out-of-order load data buffer associated with an in-order instruction buffer, the coprocessor is also configured to avoid stalls due to long latencies which may be involved in fetching the load data from levels of memory hierarchy, such as L2, L3, L4 caches, main memory, etc.

    Abstract translation: 用于实现某些加载指令的系统和方法,例如通过主处理器和协处理器协作的向量加载指令。 由主处理器识别的用于卸载到协处理器的加载指令在主处理器中提交,而不接收相应的负载数据。 提交后,加载指令在协处理器中进行处理,这样在取出加载数据时产生的延迟从主处理器中隐藏起来。 通过实现与按顺序指令缓冲器相关联的无序负载数据缓冲器,协处理器还被配置为避免由于长时间延迟而导致的延迟,这可能涉及从诸如L2的存储器层级的级别中提取负载数据 ,L3,L4高速缓存,主内存等

    REDUCED RESULT MATRIX
    3.
    发明申请

    公开(公告)号:US20220035891A1

    公开(公告)日:2022-02-03

    申请号:US17390257

    申请日:2021-07-30

    Abstract: Matrix multiple operations may use a reduced result matrix to increase the speed and accuracy of the operation. In one example, each higher precision row/column is decomposed into multiple component rows/columns of the base type that can be combined as weighted sums to form the original higher precision row/column. In another example, the decomposition may be independent for each input matrix and decompose to any multiple of the base type. In another example, the base type for each input matrix could be different. In another example, after decomposition, a matrix operation is performed (e.g. matrix multiply, convolutional layer, or possibly other matrix operation) on decomposed base type input matrices to yield a result matrix that contains components of the higher precision results. The results may be combined together to obtain higher-precision results.

    MIXED-WIDTH SIMD OPERATIONS USING EVEN/ODD REGISTER PAIRS FOR WIDE DATA ELEMENTS
    4.
    发明申请
    MIXED-WIDTH SIMD OPERATIONS USING EVEN/ODD REGISTER PAIRS FOR WIDE DATA ELEMENTS 审中-公开
    使用偶数/ ODD寄存器对进行宽数据元素的混合宽度SIMD操作

    公开(公告)号:US20170024209A1

    公开(公告)日:2017-01-26

    申请号:US14805456

    申请日:2015-07-21

    Abstract: Systems and methods relate to a mixed-width single instruction multiple data (SIMD) instruction which has at least a source vector operand comprising data elements of a first bit-width and a destination vector operand comprising data elements of a second bit-width, wherein the second bit-width is either half of or twice the first bit-width. Correspondingly, one of the source or destination vector operands is expressed as a pair of registers, a first register and a second register. The other vector operand is expressed as a single register. Data elements of the first register correspond to even-numbered data elements of the other vector operand expressed as a single register, and data elements of the second register correspond to data elements of the other vector operand expressed as a single register.

    Abstract translation: 系统和方法涉及混合宽度单指令多数据(SIMD)指令,其具有至少包括第一位宽的数据元素和包含第二位宽的数据元素的目的地向量操作数的源向量操作数,其中 第二个位宽是第一个位宽的一半或两倍。 相应地,源或目标向量操作数之一被表示为一对寄存器,第一寄存器和第二寄存器。 另一个向量操作数表示为单个寄存器。 第一寄存器的数据元素对应于表示为单个寄存器的另一向量操作数的偶数数据元,第二寄存器的数据元对应于表示为单个寄存器的另一向量操作数的数据元。

    SIMD INSTRUCTIONS FOR MULTI-STAGE CUBE NETWORKS
    5.
    发明申请
    SIMD INSTRUCTIONS FOR MULTI-STAGE CUBE NETWORKS 审中-公开
    用于多级网络的SIMD指令

    公开(公告)号:US20170024208A1

    公开(公告)日:2017-01-26

    申请号:US14804190

    申请日:2015-07-20

    CPC classification number: G06F9/30032 G06F9/30036 G06F9/30072 G06F15/8053

    Abstract: Systems and methods relate to performing data movement operations using single instruction multiple data (SIMD) instructions. A first SIMD instruction comprises a first input data vector having a number N of two or more data elements in corresponding N SIMD lanes and a control vector having N control elements in the corresponding N SIMD lanes. A first multi-stage cube network is controllable by the first SIMD instruction, and includes movement elements, with one movement element per SIMD lane, per stage. A movement element selects between one of two data elements based on a corresponding control element and moves the data elements across the stages of the first multi-stage cube network by a zero distance or power-of-two distance between adjacent stages to generate a first output data vector. A second multi-stage cube network can be used in conjunction to generate all possible data movement operations of the input data vector.

    Abstract translation: 系统和方法涉及使用单指令多数据(SIMD)指令执行数据移动操作。 第一SIMD指令包括在相应的N SIMD通道中具有N个两个或更多个数据元素的第一输入数据向量和在相应的N SIMD通道中具有N个控制元素的控制向量。 第一多级立方体网络可由第一SIMD指令控制,并且每个阶段包括每SIMD通道一个移动元件的运动元件。 移动元件基于相应的控制元素选择两个数据元素中的一个元素,并且将数据元素跨越第一多级立方体网络的各级移动零距离或相邻级之间的两倍的距离,以产生第一 输出数据向量。 可以结合使用第二多级立方体网络来产生输入数据向量的所有可能的数据移动操作。

    EFFICIENT MULTIPLICATION APPROXIMATION FOR ARTIFICIAL INTELLIGENT (AI) ENGINES

    公开(公告)号:US20250131248A1

    公开(公告)日:2025-04-24

    申请号:US18493649

    申请日:2023-10-24

    Abstract: A processor-implemented method for multiplication approximation includes receiving inputs to be processed using an artificial intelligence (AI) compute engine. The inputs have a first precision. The AI compute engine is configured for processing in a second precision different from the first precision. A first parameter for the inputs and a second parameter for the AI compute engine are defined. The first parameter and the second parameter respectively indicate a first portion of the first precision and a second portion of the second precision to use for computation by the AI compute engine. The inputs and the second set of compute engine parameters are respectively adapted according to the first parameter and the second parameter to generate a first representation and a second representation. An approximation of an AI workload for the inputs is generated based on the first representation and the second representation.

    SINGLE INSTRUCTION MULTIPLE DATA (SIMD) SPARSE DECOMPRESSION WITH VARIABLE DENSITY

    公开(公告)号:US20240118902A1

    公开(公告)日:2024-04-11

    申请号:US18339797

    申请日:2023-06-22

    CPC classification number: G06F9/3887 G06F9/30178

    Abstract: An aspect of the disclosure relates to a data processing system, including: an input medium configured to include a first set of blocks of data including a first set of block of compressed data and a first set of metadata, respectively; an output medium configured to include a first set of blocks of decompressed data each having a predetermined number of decompressed elements; and a set of single instruction multiple data (SIMD) processors configured to: access the first set of blocks of data from the input medium, respectively; decompress the first set of blocks of compressed data to generate the first set of blocks of decompressed data based on the first set of metadata, respectively; and provide the first set of blocks of decompressed data to the output medium, respectively.

    PROACTIVE CLOCK GATING SYSTEM TO MITIGATE SUPPLY VOLTAGE DROOPS

    公开(公告)号:US20200081479A1

    公开(公告)日:2020-03-12

    申请号:US16563563

    申请日:2019-09-06

    Abstract: A clock gating system (CGS) includes a digital power estimator configured to generate indications of a predicted energy consumption per cycle of a clock signal and a maximum energy consumption per cycle of the clock signal. The CGS further includes a voltage-clock gate (VCG) circuit coupled to the digital power estimator. The VCG circuit is configured to gate and un-gate the clock signal based on the indications prior to occurrence of a voltage droop event and using hardware voltage model circuitry of the VCG circuit. The VCG circuit is further configured to gate the clock signal based on an undershoot phase associated with the voltage droop event and to un-gate the clock signal based on an overshoot phase associated with the voltage droop event.

    AREA EFFICIENT ASYNCHRONOUS FIRST-IN-FIRST-OUT (FIFO) BUFFER FOR HIGH BANDWIDTH DATA TRANSFER USING EVENT TRANSFER BLOCKS

    公开(公告)号:US20250021498A1

    公开(公告)日:2025-01-16

    申请号:US18601341

    申请日:2024-03-11

    Abstract: A clock domain crossing interface is described. The clock domain crossing interface includes a transmit clock domain and a receive clock domain using a different clock from the transmit clock domain. The clock domain crossing interface also includes a first-in-first-out (FIFO) buffer coupled between the transmit clock domain and the receive clock domain. The FIFO buffer to store ordered transactions sent from the transmit clock domain to the receive clock domain. The clock domain crossing interface further includes a transmit clock domain event transfer block to notify the receive clock domain of a new transaction pushed onto the FIFO buffer in the transmit clock domain. The clock domain crossing interface also includes a receive clock domain event transfer block to notify the transmit clock domain of a new transaction pulled from the FIFO buffer in the receive clock domain.

Patent Agency Ranking