-
公开(公告)号:US20250037018A1
公开(公告)日:2025-01-30
申请号:US18658919
申请日:2024-05-08
Applicant: Apple Inc.
Inventor: Minsik CHO , Keivan ALIZADEH VAHID , Qichen FU , Saurabh ADYA , Carlo Eduardo Cabanero DEL MUNDO , Mohammad RASTEGARI , Devang K. NAIK , Peter ZATLOUKAL
IPC: G06N20/00
Abstract: The subject technology provides memory-efficient differentiable weight clustering for large language model compression. An apparatus determines a tensor including an attention map between learned weights of a trained machine learning model and corresponding centroids. The apparatus also determines a compressed attention table and a plurality of index lists during compression of the trained machine learning model based on an uniquification of the attention map and sharding of an associated index list. The apparatus determines whether the tensor exists at a destination device during compression of the trained machine learning model using a marshaling layer. The apparatus refrains from copying the tensor to the destination device when the tensor exists at the destination device, or copies the tensor to the destination device when the tensor does not exist at the destination device. The apparatus deploys a compressed machine learning model based on the compression of the trained machine learning model.