-
公开(公告)号:US20250060998A1
公开(公告)日:2025-02-20
申请号:US18452326
申请日:2023-08-18
Applicant: Microsoft Technology Licensing, LLC
Inventor: Amar PHANISHAYEE , . Ankit , Deepak NARAYANAN , Mihail Gavril TARTA
IPC: G06F9/50
Abstract: Systems and methods for optimizing thread allocation in a model serving system include estimating a batch size for inference requests. An optimal configuration is then determined that defines a number of inference instances, a number of threads per inference instance, and a sub-batch size per inference instance for processing a batch of inference requests of the batch size using intra-operator parallelism that minimizes average per-batch latency. The optimal configuration is determined with reference to a plurality of predetermined model profiles that define single-inference average batch latencies for different combinations of thread counts and batch sizes, the predetermined model profiles being used as input to a dynamic programming algorithm that identifies optimal configurations that minimize the average per-batch latency.