-
公开(公告)号:US20230306233A1
公开(公告)日:2023-09-28
申请号:US18103428
申请日:2023-01-30
Applicant: QUALCOMM Incorporated
Inventor: Marinus Willem VAN BAALEN , Brian KAHNE , Eric Wayne MAHURIN , Tijmen Pieter Frederik BLANKEVOORT , Andrey KUZMIN , Andrii SKLIAR , Markus NAGEL
IPC: G06N3/04
CPC classification number: G06N3/04
Abstract: A processor-implemented method includes bit shifting a binary representation of a neural network parameter. The neural network parameter has fewer bits, b, than a number of hardware bits, B, supported by hardware that processes the neural network parameter. The bit shifting effectively multiplies the neural network parameter by 2B-b. The method also includes dividing a quantization scale by 2B-b to obtain an updated quantization scale. The method further includes quantizing the bit shifted binary representation with the updated quantization scale to obtain a value for the neural network parameter.