-
1.
公开(公告)号:US20240013040A1
公开(公告)日:2024-01-11
申请号:US18474464
申请日:2023-09-26
申请人: Intel Corporation
IPC分类号: G06N3/063 , G06N3/048 , G06N3/0464
CPC分类号: G06N3/063 , G06N3/048 , G06N3/0464
摘要: A drain module may drain activations in an output tensor of a convolution from a PE array that performs the convolution. The drain module may extract activations generated in a collection of PE columns. The activations generated in the PE columns in the collection may be concatenated, e.g., activations generated in the first PE column of the collection may be followed by activations generated in the second PE column of the collection, and so on. The activations in the output tensor may be rearranged into activation vectors. Each activation vector may include activations in different output channels of the deep learning operation. The activations in each activation vector may have the same (X, Y) coordinate in the output tensor. The drain module may determine a memory address for an activation based on the activation's (X, Y, Z) coordinate in the output tensor and write the activation to the memory address.
-
公开(公告)号:US20230229507A1
公开(公告)日:2023-07-20
申请号:US18180415
申请日:2023-03-08
申请人: Intel Corporation
CPC分类号: G06F9/5027 , G06N3/04 , H04L41/16
摘要: Computations in processing elements (PEs) for executing a deep neural network are scheduled via a computation scheduler based on sparsity in input data of the computations to reduce voltage droops. Each PE may compute an input operand and a weight operand in a computation. The computation scheduler may predict the workload of the PE for the computation based on a combined sparsity bitmap, which may be generated based on a sparsity bitmap of the input operand and a sparsity bitmap of the weight operand. The computation scheduler can schedule the starts of the computations in the PEs based on the predicted workloads of the PEs. The computation scheduler may instruct the PE having the highest workload to start the computation first and instruct the other PEs to start computations later. In some embodiments, the computations in the PEs may end in the same clock cycle.
-
公开(公告)号:US20220188638A1
公开(公告)日:2022-06-16
申请号:US17684764
申请日:2022-03-02
申请人: Intel Corporation
摘要: An apparatus for convolution operations is provided. The apparatus includes a PE array, a datastore, writing modules, reading modules, and a controlling module. The PE array performs MAC operations. The datastore includes databanks, each of which stores data to be used by a column of the PE array. The writing modules transfer data from a memory to the datastore. The reading modules transfer data from the datastore to the PE array. Each reading module may transfer data to a particular column of the PE array. The controlling module can determine the rounds of a convolution operation. Each round includes MAC operations based on a weight. The controlling module controls the writing modules and reading modules so that the same data in a databank can be reused in multiple rounds. For different rounds, the controlling module can provide a reading module accesses to different databanks.
-
4.
公开(公告)号:US20240111830A1
公开(公告)日:2024-04-04
申请号:US18534035
申请日:2023-12-08
申请人: Intel Corporation
发明人: Umer Iftikhar Cheema , Robert Simofi , Deepak Abraham Mathaikutty , Arnab Raha , Dinakar Kondru
CPC分类号: G06F17/17 , G06F1/0307
摘要: A non-linear activation function in a neural network may be approximated by one or more linear functions. The input range may be divided into input segments, each of which corresponds to a different exponent in the input range of the activation function and includes input data elements having the exponent. Target accuracies may be assigned to the identified exponents based on a statistics analysis of the input data elements. The target accuracy of an input segment will be used to determine one or more linear functions that approximate the activation function for the input segment. An error of an approximation of the activation function by a linear function for the input segment may be within the target accuracy. The parameters of the linear functions may be stored in a look-up table (LUT). During the execution of the DNN, the LUT may be used to execute the activation function.
-
公开(公告)号:US20220188075A1
公开(公告)日:2022-06-16
申请号:US17688131
申请日:2022-03-07
申请人: Intel Corporation
发明人: Arnab Raha , Mark A. Anders , Raymond Jit-Hung Sung , Debabrata Mohapatra , Deepak Abraham Mathaikutty , Ram K. Krishnamurthy , Himanshu Kaul
摘要: A FPMAC operation has two operands: an input operand and a weight operand. The operands may have a format of FP16, BF16, or INT8. Each operand is split into two portions. The two portions are stored in separate storage units. Then operands are transferred to register files of a PE, with each register file storing bits of an operand sequentially. The PE performs the FPMAC operation based on the operands. The PE may include an FPMAC unit configured to compute an individual partial sum of the PE. The PE may also include an FP adder to accumulate the individual partial sum with other data, such as an output from another PE or an output form another PE array. The FP adder may be fused with the FPMAC unit in a single circuit that can do speculative alignment and has separate critical paths for alignment and normalization.
-
6.
公开(公告)号:US20240160695A1
公开(公告)日:2024-05-16
申请号:US18392618
申请日:2023-12-21
申请人: Intel Corporation
CPC分类号: G06F17/17 , G06F1/0356
摘要: A non-linear activation function may be approximated by linear functions. The input range of the activation function may be divided into input segments. One or more input segments may be selected based on statistical analysis of input data elements in the input range. A parameter of a first linear function that approximates the activation function for at least part of a selected input segment may be stored in a first portion of a first look-up table (LUT). The first portion of the first LUT is dedicated to a first group of post processing engines (PPEs). A parameter of a second linear function that approximates the activation function for at least part of an unselected input segment may be stored in a shared pool of LUT entries, which includes a second portion of the first LUT and a portion of a second LUT and is shared by multiple groups of PPEs.
-
公开(公告)号:US20230394312A1
公开(公告)日:2023-12-07
申请号:US18453715
申请日:2023-08-22
申请人: Intel Corporation
IPC分类号: G06N3/082 , G06N3/0464
CPC分类号: G06N3/082 , G06N3/0464
摘要: Activations (e.g., output activations) or weights of intermediate layers of deep neural networks (DNNs) can be pruned to increase sparsity and reduce the amount of computation required for performing the computations in the layers or subsequent layers. A pruning threshold may be determined, e.g., through an iterative process, and activations or weights having absolute values lower than the pruning threshold may be changed to zero. A first pruning threshold may be used to prune an output tensor or kernel of a layer. The loss in the accuracy of the DNN due to the pruning may be determined. A second pruning threshold may be determined based on the first pruning threshold and the accuracy loss. The DNN may be modified by adding a pruning operation to the layer. The pruning operation can prune output tensors or kernels of the layer based on the second pruning threshold.
-
公开(公告)号:US20230229917A1
公开(公告)日:2023-07-20
申请号:US18184101
申请日:2023-03-15
申请人: Intel Corporation
CPC分类号: G06N3/08 , G06F7/5443
摘要: A compute block can perform hybrid multiply-accumulate (MAC) operations. The compute block may include a weight compressing module and a processing element (PE) array. The weight compression module may select a first group of one or more weights and a second group of one or more weights from a weight tensor of a DNN (deep neural network) layer. A weight in the first group is quantized to a power of two value. A weight in the second group is quantized to an integer. The integer and the exponent of the power of two value may be stored in a memory in lieu of the original values of the weights. A PE in the PE array includes a shifter configured to shift an activation of the layer by the exponent of the power of two value and a multiplier configured to multiplying the integer with another activation of the layer.
-
公开(公告)号:US20230073661A1
公开(公告)日:2023-03-09
申请号:US18055315
申请日:2022-11-14
申请人: Intel Corporation
摘要: An DNN (deep neural network) accelerator may accelerate deep learning, such as convolutions in frontend layers through a scheduler for loading data to be processed. The DNN accelerator may store, in a memory, an input tensor of a convolutional layer in a DNN. The convolutional layer may be the first layer or a layer that is arranged before the one or more other convolutional layers in the DNN such that data processed by the first layer can be efficiently reused across data load rounds. The input tensor includes one or more channels. A channel includes activations arranged in rows and columns. The DNN accelerator may read at least a portion of the input tensor from the memory into a datastore. The datastore includes some databanks. The DNN accelerator may provide a vector of one or more activations to a processing element for operations such as multiplications on the vector.
-
公开(公告)号:US20220075659A1
公开(公告)日:2022-03-10
申请号:US17530156
申请日:2021-11-18
申请人: Intel Corporation
发明人: Debabrata Mohapatra , Arnab Raha , Deepak Abraham Mathaikutty , Raymond Jit-Hung Sung , Cormac Michael Brick
摘要: There is disclosed a system and method of performing an artificial intelligence (AI) inference, including: programming an AI accelerator circuit to solve an AI problem with a plurality of layer-specific register file (RF) size allocations, wherein the AI accelerator circuit comprises processing elements (PEs) with respective associated RFs, wherein the RFs individually are divided into K sub-banks of size B bytes, wherein B and K are integers, and wherein the RFs include circuitry to individually allocate a sub-bank to one of input feature (IF), output feature (OF), or filter weight (FL), and wherein programming the plurality of layer-specific RF size allocations comprises accounting for sparse data within the layer; and causing the AI accelerator circuit to execute the AI problem, including applying the layer-specific RF size allocations at run-time.
-
-
-
-
-
-
-
-
-