METHODS AND MODULES FOR ACCELERATING INFERENCE VIA DISTRIBUTED DEVICES
Abstract:
Methods and modules for accelerating inference computations in transformer models using edge devices includes partitioning inputs for each layer and synchronizing between transformer layers. A method includes receiving a transformer input, partitioning the transformer input into two or more first-stage divisions, processing each first-stage division into a processed first-stage division, and combining the processed first-stage divisions into a first output. A module includes a computing device for partitioning a transformer input into two or more divisions, transmitting each of the divisions, and receiving processed divisions, as well as two or more transformer processing units, each for receiving a division from the computing device, processing the division into a processed division, and sending the processed division to the computing device.
Information query
Patent Agency Ranking
0/0