MACHINE LEARNING RUNTIME LIBRARY FOR NEURAL NETWORK ACCELERATION

    公开(公告)号:US20190114533A1

    公开(公告)日:2019-04-18

    申请号:US15785679

    申请日:2017-10-17

    申请人: Xilinx, Inc.

    摘要: Embodiments herein describe techniques for interfacing a neural network application with a neural network accelerator using a library. The neural network application may execute on a host computing system while the neural network accelerator executes on a massively parallel hardware system, e.g., a FPGA. The library operates a pipeline for submitting the tasks received from the neural network application to the neural network accelerator. In one embodiment, the pipeline includes a pre-processing stage, an FPGA execution stage, and a post-processing stage which each correspond to different threads. When receiving a task from the neural network application, the library generates a packet that includes the information required for the different stages in the pipeline to perform the tasks. Because the stages correspond to different threads, the library can process multiple packets in parallel which can increase the utilization of the neural network accelerator on the hardware system.

    HARDWARE ACCELERATION OF MACHINE LEARNING DESIGNS

    公开(公告)号:US20230401480A1

    公开(公告)日:2023-12-14

    申请号:US17806906

    申请日:2022-06-14

    申请人: Xilinx, Inc.

    IPC分类号: G06N20/00

    CPC分类号: G06N20/00

    摘要: Hardware acceleration of machine learning (ML) designs includes translating an ML primitive into an intermediate representation. The intermediate representation is subdivided to specify a functional compute block. The functional compute block is sized according to a compute node primitive adapted for implementing the ML primitive on target hardware. An overlay is generated for the ML primitive, at least in part, by mapping the functional compute block to the compute node primitive. The overlay is synthesizable to implement the ML primitive on the target hardware. The overlay can be scheduled for operation within the target hardware as part of an ML design including the ML primitive.

    Neural network processing system having host controlled kernel acclerators

    公开(公告)号:US11568218B2

    公开(公告)日:2023-01-31

    申请号:US15786288

    申请日:2017-10-17

    申请人: Xilinx, Inc.

    IPC分类号: G06N3/063 G06N3/04

    摘要: A disclosed neural network processing system includes a host computer system, a RAMs coupled to the host computer system, and neural network accelerators coupled to the RAMs, respectively. The host computer system is configured with software that when executed causes the host computer system to write input data and work requests to the RAMS. Each work request specifies a subset of neural network operations to perform and memory locations in a RAM of the input data and parameters. A graph of dependencies among neural network operations is built and additional dependencies added. The operations are partitioned into coarse grain tasks and fine grain subtasks for optimal scheduling for parallel execution. The subtasks are scheduled to accelerator kernels of matching capabilities. Each neural network accelerator is configured to read a work request from the respective RAM and perform the subset of neural network operations on the input data using the parameters.

    Image preprocessing for generalized image processing

    公开(公告)号:US11386644B2

    公开(公告)日:2022-07-12

    申请号:US15786267

    申请日:2017-10-17

    申请人: Xilinx, Inc.

    摘要: An example preprocessor circuit includes: a first buffer configured to store rows of image data and output a row thereof; a second buffer, coupled to the first buffer, including storage locations to store respective image samples of the row output by the first buffer; shift registers; an interconnect network including connections, each connection coupling a respective one of the shift registers to more than one of the storage locations, one or more of the storage locations being coupled to more than one of the connections; and a control circuit configured to load the shift registers with the image samples based on the connections and shift the shift registers to output streams of image samples.

    Circuit arrangements and methods for traversing input feature maps

    公开(公告)号:US11106968B1

    公开(公告)日:2021-08-31

    申请号:US15989075

    申请日:2018-05-24

    申请人: Xilinx, Inc.

    IPC分类号: G06N3/04

    摘要: A circuit arrangement includes a buffer, a height traversal circuit configured to generate a sequence of IFM height values in response to first control signals, a width traversal circuit configured to generate a sequence of IFM width values in response to second control signals, a control circuit, and an address generation circuit. The control circuit is configured to input an OFM height, an OFM width, a kernel height, and a kernel width; generate the first control signals at times based on the OFM height and the kernel height; and generate the second control signals at times based on the OFM width and the kernel width. The address generation circuit is configured to generate a sequence of addresses based on the sequences of IFM height values and IFM width values, provide the sequence of addresses to the buffer, and enable reading from the buffer.

    IMAGE PREPROCESSING FOR GENERALIZED IMAGE PROCESSING

    公开(公告)号:US20190114499A1

    公开(公告)日:2019-04-18

    申请号:US15786267

    申请日:2017-10-17

    申请人: Xilinx, Inc.

    摘要: An example preprocessor circuit for formatting image data into a plurality of streams of image samples includes: a first buffer configured to store a plurality of rows of the image data and output a row of the plurality of rows; a second buffer, coupled to the first buffer, including a plurality of storage locations to store a respective plurality of image samples of the row output by the first buffer; a plurality of shift registers; an interconnect network including a plurality of connections, each connection coupling a respective one of the plurality of shift registers to more than one of the plurality of storage locations, one or more of the plurality of storage locations being coupled to more than one of the plurality of connections; and a control circuit configured to load the plurality of shift registers with the plurality of image samples based on the plurality of connections and shift the plurality of shift registers to output the plurality of streams of image samples.

    Neural network processing system having multiple processors and a neural network accelerator

    公开(公告)号:US11222256B2

    公开(公告)日:2022-01-11

    申请号:US15785685

    申请日:2017-10-17

    申请人: Xilinx, Inc.

    IPC分类号: G06N3/063 G06N3/08 G06N3/04

    摘要: At least one neural network accelerator performs operations of a first subset of layers of a neural network on an input data set, generates an intermediate data set, and stores the intermediate data set in a shared memory queue in a shared memory. A first processor element of a host computer system provides input data to the neural network accelerator and signals the neural network accelerator to perform the operations of the first subset of layers of the neural network on the input data set. A second processor element of the host computer system reads the intermediate data set from the shared memory queue, performs operations of a second subset of layers of the neural network on the intermediate data set, and generates an output data set while the neural network accelerator is performing the operations of the first subset of layers of the neural network on another input data set.