-
公开(公告)号:US11561833B1
公开(公告)日:2023-01-24
申请号:US16021866
申请日:2018-06-28
Applicant: Amazon Technologies, Inc.
Inventor: Richard John Heaton , Randy Renfu Huang , Drazen Borkovic , Jindrich Zejda
Abstract: Techniques for operating a computing system to perform neural network operations are disclosed. In one example, a method comprises receiving a neural network model, determining a sequence of neural network operations based on data dependency in the neural network model, and determining a set of instructions to map the sequence of neural network operations to the processing resources of the neural network processor. The method further comprises determining, based on a set of memory access operations included in the set of instructions, a first set of memory references associated with a first location of an external memory to store the input data and a second set of memory references associated with a second location of the external memory to store the output data, and generating an instruction file including the set of instructions, the first set of memory references and the second set of memory references.
-
公开(公告)号:US11442794B1
公开(公告)日:2022-09-13
申请号:US16585575
申请日:2019-09-27
Applicant: Amazon Technologies, Inc.
Inventor: Drazen Borkovic
Abstract: Techniques for synchronizing operations of execution engines of an integrated circuit device are disclosed. A description of a plurality of operations to be performed by the execution engines may be obtained. The plurality of operations may be connected through a plurality of edges. A dependency vector may be generated for each operation of the plurality of operations. The dependency vector of a corresponding operation may include a set of values that are calculated based on the set of values of one or more dependency vectors calculated for one or more immediately preceding operations of the plurality of operations. An event register of a plurality of event registers may be assigned, for each edge of one or more of the plurality of edges, to the corresponding edge based on the dependency vector generated for a start operation associated with the corresponding edge.
-
公开(公告)号:US11221979B1
公开(公告)日:2022-01-11
申请号:US17247016
申请日:2020-11-24
Applicant: Amazon Technologies, Inc.
Inventor: Drazen Borkovic
Abstract: Synchronization of a plurality of aggregate DMA transfers on large number of DMA queues can be achieved using a small number of semaphores. One or more semaphores from M semaphores can be assigned to each aggregate DMA transfer based on round-robin or another suitable method. Each aggregate DMA transfer can comprise N DMA transfers, where M is smaller than N. Each DMA transfer can be assigned to one of the assigned one or more semaphores from the M semaphores. Each DMA engine of N DMA engines can increment the assigned semaphore after performing a respective DMA transfer of the N DMA transfers. A computational engine waiting on completion of a certain aggregate DMA transfer can perform an operation based upon the one or more assigned semaphores reaching respective threshold values.
-
公开(公告)号:US20210158132A1
公开(公告)日:2021-05-27
申请号:US16698461
申请日:2019-11-27
Applicant: Amazon Technologies, Inc.
Inventor: Jeffrey T. Huynh , Ron Diamant , Hongbin Zheng , Yizhi Liu , Animesh Jain , Yida Wang , Vinod Sharma , Richard John Heaton , Randy Renfu Huang , Sundeep Amirineni , Drazen Borkovic
Abstract: A computer-implemented method includes receiving a neural network model for implementation using a processing element array, where the neural network model includes a convolution operation on a set of input feature maps and a set of filters. The method also includes determining, based on the neural network model, that the convolution operation utilizes less than a threshold number of rows in the processing element array for applying a set of filter elements to the set of input feature maps, where the set of filter elements includes one filter element in each filter of the set of filters. The method further includes generating, for the convolution operation and based on the neural network model, a first instruction and a second instruction for execution by respective rows in the processing element array, where the first instruction and the second instruction use different filter elements of a filter in the set of filters.
-
公开(公告)号:US20190179795A1
公开(公告)日:2019-06-13
申请号:US15839157
申请日:2017-12-12
Applicant: Amazon Technologies, Inc.
Inventor: Randy Huang , Ron Diamant , Jindrich Zejda , Drazen Borkovic
Abstract: Provided are systems, methods, and integrated circuits neural network processor that can execute a fast context switch between one neural network and another. In various implementations, a neural network processor can include a plurality of memory banks storing a first set of weight values for a first neural network. When the neural network processor receives first input data, the neural network processor can compute a first result using the first set of weight values and the first input data. While computing the first result, the neural network processor can store, in the memory banks, a second set of weight values for a second neural network. When the neural network processor receives second input data, the neural network processor can compute a second result using the second set of weight values and the second input data, where the computation occurs upon completion of computation of the first result.
-
公开(公告)号:US12210438B1
公开(公告)日:2025-01-28
申请号:US17947949
申请日:2022-09-19
Applicant: Amazon Technologies, Inc.
Inventor: Samuel Jacob , Drazen Borkovic , Yu Zhou , Mohammad El-Shabani
Abstract: Techniques are disclosed for setting a breakpoint for debugging a neural network. User input is received by a debugger program executable by a host processor indicating a target layer of a neural network at which to halt execution of the neural network. The neural network includes a first set of instructions to be executed by a first execution engine and a second set of instructions to be executed by a second execution engine. A first halt point is set within the first set of instructions and a second halt point is set within the second set of instructions. It is then determined that operation of the first execution engine and the second execution engine has halted. It is then determined that the first execution engine has reached the first halt point. The second execution engine is then caused to move through instructions until reaching the second halt point.
-
公开(公告)号:US12045611B1
公开(公告)日:2024-07-23
申请号:US18231024
申请日:2023-08-07
Applicant: Amazon Technologies, Inc.
Inventor: Ron Diamant , Hongbin Zheng , Drazen Borkovic , Haichen Li
Abstract: In one example, a method comprises: receiving input codes, wherein the input codes represent a computational dataflow graph; traversing the computational dataflow graph to identify single-entry-single-exit (SESE) subgraphs of the computational dataflow graph, wherein each SESE subgraph has a sequence of nodes comprising a root node and a child node and representing a sequence of element-wise operators, wherein the root node receives a single input tensor, and wherein the child node outputs a single output tensor; determining a merged operator for each SESE subgraph; and generating executable instructions for the computational dataflow graph to be executed by a hardware accelerator having a first execution unit and a second execution unit, wherein the executable instructions comprise first executable instructions for the merged operators targeted at the first execution unit, and second executable instructions for other operators of the computational dataflow graph targeted at the second execution unit.
-
公开(公告)号:US20230359876A1
公开(公告)日:2023-11-09
申请号:US18352768
申请日:2023-07-14
Applicant: Amazon Technologies, Inc.
Inventor: Jeffrey T. Huynh , Ron Diamant , Hongbin Zheng , Yizhi Liu , Animesh Jain , Yida Wang , Vinod Sharma , Richard John Heaton , Randy Renfu Huang , Sundeep Amirineni , Drazen Borkovic
Abstract: Generating instructions for programming a processing element array to implement a convolution operation can include determining that the convolution operation under-utilizes the processing element array. The convolution operation involves using the processing element array to perform a series of matrix multiplications between a set of filters and a set of input matrices. Each filter comprises a weight matrix. Each input matrix is assigned to a respective row in the processing element array. Under-utilization can be determined through detecting that less than a threshold number of rows would be used concurrently. In response to determining that the convolution operation under-utilizes the processing element array, instructions can be added for modifying the convolution operation to increase the number of rows used concurrently. The added instructions are executable to cause at least one input matrix to be processed in parallel across more rows compared to processing without modifying the convolution operation.
-
公开(公告)号:US11610102B1
公开(公告)日:2023-03-21
申请号:US16698425
申请日:2019-11-27
Applicant: Amazon Technologies, Inc.
Inventor: Jindrich Zejda , Drazen Borkovic
Abstract: Techniques for time-based memory allocation for a neural network inference are disclosed. A description of a neural network comprising a plurality of operations to be executed across a set of accelerators is received. A plurality of interconnect times at a plurality of partition points within the neural network are calculated. Each of the plurality of interconnect times corresponds to a duration of time for transferring an output feature map from one of the set of accelerators to another of the set of accelerators to be used as an input feature map. A partitioning scheme that divides the plurality of operations into a set of subgraphs is determined based on the plurality of interconnect times. Each of the set of subgraphs is assigned to a different accelerator of the set of accelerators in accordance with the partitioning scheme.
-
公开(公告)号:US11354130B1
公开(公告)日:2022-06-07
申请号:US16824404
申请日:2020-03-19
Applicant: Amazon Technologies, Inc.
Inventor: Drazen Borkovic
Abstract: Techniques for detecting a data race condition between multiple execution engines of an integrated circuit device are provided. Computations and data movements involving execution engines of an integrated circuit may be described with a flow graph, where graph nodes represent computation or data movement operations and graph edges represent dependencies between the operations. When a graph has incorrect dependencies, data races may result. To detect data race conditions, compiler-generated vector clocks that track the relationships of operations performed by various execution engines may be used to determine concurrent operations between nodes of different execution engines, and memory access patterns for the operations may be compared to determine if the concurrent operations access the same memory address.
-
-
-
-
-
-
-
-
-