MEMORY-EFFICIENT DIFFERENTIABLE WEIGHT CLUSTERING FOR LARGE LANGUAGE MODEL COMPRESSION

    公开(公告)号:US20250037018A1

    公开(公告)日:2025-01-30

    申请号:US18658919

    申请日:2024-05-08

    Applicant: Apple Inc.

    Abstract: The subject technology provides memory-efficient differentiable weight clustering for large language model compression. An apparatus determines a tensor including an attention map between learned weights of a trained machine learning model and corresponding centroids. The apparatus also determines a compressed attention table and a plurality of index lists during compression of the trained machine learning model based on an uniquification of the attention map and sharding of an associated index list. The apparatus determines whether the tensor exists at a destination device during compression of the trained machine learning model using a marshaling layer. The apparatus refrains from copying the tensor to the destination device when the tensor exists at the destination device, or copies the tensor to the destination device when the tensor does not exist at the destination device. The apparatus deploys a compressed machine learning model based on the compression of the trained machine learning model.

    CUSTOMIZABLE CHIP FOR AI APPLICATIONS

    公开(公告)号:US20220343135A1

    公开(公告)日:2022-10-27

    申请号:US17860031

    申请日:2022-07-07

    Applicant: Apple Inc.

    Abstract: In one embodiment, a computing device includes an input sensor providing an input data; a programmable logic device (PLD) implementing a convolutional neural network (CNN), wherein: each compute block of the PLD corresponds to one of a multiple of convolutional layers of the CNN, each compute block of the PLD is placed in proximity to at least two memory blocks, a first one of the memory blocks serves as a buffer for the corresponding layer of the CNN, and a second one of the memory blocks stores model-specific parameters for the corresponding layer of the CNN.

Patent Agency Ranking