-
公开(公告)号:US20240153495A1
公开(公告)日:2024-05-09
申请号:US18494984
申请日:2023-10-26
Applicant: Google LLC
Inventor: Weiran Wang , Ding Zhao , Shaojin Ding , Hao Zhang , Shuo-yiin Chang , David Johannes Rybach , Tara N. Sainath , Yanzhang He , Ian McGraw , Shankar Kumar
IPC: G10L15/06 , G06F40/284 , G10L15/26
CPC classification number: G10L15/063 , G06F40/284 , G10L15/26
Abstract: A method includes receiving a training dataset that includes one or more spoken training utterances for training an automatic speech recognition (ASR) model. Each spoken training utterance in the training dataset paired with a corresponding transcription and a corresponding target sequence of auxiliary tokens. For each spoken training utterance, the method includes generating a speech recognition hypothesis for a corresponding spoken training utterance, determining a speech recognition loss based on the speech recognition hypothesis and the corresponding transcription, generating a predicted auxiliary token for the corresponding spoken training utterance, and determining an auxiliary task loss based on the predicted auxiliary token and the corresponding target sequence of auxiliary tokens. The method also includes the ASR model jointly on the speech recognition loss and the auxiliary task loss determined for each spoken training utterance.
-
公开(公告)号:US20240347043A1
公开(公告)日:2024-10-17
申请号:US18632237
申请日:2024-04-10
Applicant: Google LLC
Inventor: David Qiu , David Rim , Shaojin Ding , Yanzhang He
IPC: G10L15/06
CPC classification number: G10L15/063
Abstract: A method includes obtaining a plurality of training samples, determining a minimum integer fixed-bit width representing a maximum quantization of an automatic speech recognition (ASR) model, and training the ASR model on the plurality of training samples using a quantity of random noise. The ASR model includes a plurality of weights that each include a respective float value. The quantity of random noise is based on the minimum integer fixed-bit value. After training the ASR model, the method also includes selecting a target integer fixed-bit width greater than or equal to the minimum integer fixed-bit width, and for each respective weight of the plurality of weights, quantizing the respective weight from the respective float value to a respective integer associated with a value of the selected target integer fixed-bit width. The operations also include providing the quantized trained ASR model to a user device.
-
公开(公告)号:US20230326461A1
公开(公告)日:2023-10-12
申请号:US18182925
申请日:2023-03-13
Applicant: Google LLC
Inventor: Shaojin Ding , Yangzhang He , Xin Wang , Weiran Wang , Trevor Strohman , Tara N. Sainath , Rohit Parkash Prabhavalkar , Robert David , Rina Panigrahy , Rami Botros , Qiao Liang , Ian Mcgraw , Ding Zhao , Dongseong Hwang
CPC classification number: G10L15/32 , G10L15/16 , G10L15/22 , G10L2015/223
Abstract: An automated speech recognition (ASR) model includes a first encoder, a first encoder, a second encoder, and a second decoder. The first encoder receives, as input, a sequence of acoustic frames, and generates, at each of a plurality of output steps, a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The first decoder receives, as input, the first higher order feature representation generated by the first encoder, and generates a first probability distribution over possible speech recognition hypotheses. The second encoder receives, as input, the first higher order feature representation generated by the first encoder, and generates a second higher order feature representation for a corresponding first higher order feature frame. The second decoder receives, as input, the second higher order feature representation generated by the second encoder, and generates a second probability distribution over possible speech recognition hypotheses.
-
公开(公告)号:US20230298591A1
公开(公告)日:2023-09-21
申请号:US18123060
申请日:2023-03-17
Applicant: Google LLC
Inventor: Shaojin Ding , Rajeev Rikhye , Qiao Liang , Yanzhang He , Quan Wang , Arun Narayanan , Tom O'Malley , Ian McGraw
Abstract: A computer-implemented method includes receiving a sequence of acoustic frames corresponding to an utterance and generating a reference speaker embedding for the utterance. The method also includes receiving a target speaker embedding for a target speaker and generating feature-wise linear modulation (FiLM) parameters including a scaling vector and a shifting vector based on the target speaker embedding. The method also includes generating an affine transformation output that scales and shifts the reference speaker embedding based on the FiLM parameters. The method also includes generating a classification output indicating whether the utterance was spoken by the target speaker based on the affine transformation output.
-
公开(公告)号:US20230298569A1
公开(公告)日:2023-09-21
申请号:US18186774
申请日:2023-03-20
Applicant: Google LLC
Inventor: Shaojin Ding , Oleg Rybakov , Phoenix Meadowlark , Shivani Agrawal , Yanzhang He , Lukasz Lew
CPC classification number: G10L15/063 , G10L15/16
Abstract: A method for training a model includes obtaining a plurality of training samples. Each respective training sample of the plurality of training samples includes a respective speech utterance and a respective textual utterance representing a transcription of the respective speech utterance. The method includes training, using quantization aware training with native integer operations, an automatic speech recognition (ASR) model on the plurality of training samples. The method also includes quantizing the trained ASR model to an integer target fixed-bit width. The quantized trained ASR model includes a plurality of weights. Each weight of the plurality of weights includes an integer with the target fixed-bit width. The method includes providing the quantized trained ASR model to a user device.
-
-
-
-