MEMORY-EFFICIENT DIFFERENTIABLE WEIGHT CLUSTERING FOR LARGE LANGUAGE MODEL COMPRESSION

    公开(公告)号:US20250037018A1

    公开(公告)日:2025-01-30

    申请号:US18658919

    申请日:2024-05-08

    Applicant: Apple Inc.

    Abstract: The subject technology provides memory-efficient differentiable weight clustering for large language model compression. An apparatus determines a tensor including an attention map between learned weights of a trained machine learning model and corresponding centroids. The apparatus also determines a compressed attention table and a plurality of index lists during compression of the trained machine learning model based on an uniquification of the attention map and sharding of an associated index list. The apparatus determines whether the tensor exists at a destination device during compression of the trained machine learning model using a marshaling layer. The apparatus refrains from copying the tensor to the destination device when the tensor exists at the destination device, or copies the tensor to the destination device when the tensor does not exist at the destination device. The apparatus deploys a compressed machine learning model based on the compression of the trained machine learning model.

Patent Agency Ranking