AUTOMATIC LATENCY OPTIMIZATION FOR CPU-BASED DNN SERVING

    公开(公告)号:US20250060998A1

    公开(公告)日:2025-02-20

    申请号:US18452326

    申请日:2023-08-18

    Abstract: Systems and methods for optimizing thread allocation in a model serving system include estimating a batch size for inference requests. An optimal configuration is then determined that defines a number of inference instances, a number of threads per inference instance, and a sub-batch size per inference instance for processing a batch of inference requests of the batch size using intra-operator parallelism that minimizes average per-batch latency. The optimal configuration is determined with reference to a plurality of predetermined model profiles that define single-inference average batch latencies for different combinations of thread counts and batch sizes, the predetermined model profiles being used as input to a dynamic programming algorithm that identifies optimal configurations that minimize the average per-batch latency.

Patent Agency Ranking