-
公开(公告)号:US20240029719A1
公开(公告)日:2024-01-25
申请号:US18340093
申请日:2023-06-23
Applicant: Google LLC
Inventor: Shaan Jagdeep Patrick Bijwadia , Shuo-yiin Chang , Bo Li , Yanzhang He , Tara N. Sainath , Chao Zhang
CPC classification number: G10L15/16 , G10L15/063 , G10L25/93
Abstract: A single E2E multitask model includes a speech recognition model and an endpointer model. The speech recognition model includes an audio encoder configured to encode a sequence of audio frames into corresponding higher-order feature representations, and a decoder configured to generate probability distributions over possible speech recognition hypotheses for the sequence of audio frames based on the higher-order feature representations. The endpointer model is configured to operate between a VAD mode and an EOQ detection mode. During the VAD mode, the endpointer model receives input audio frames, and determines, for each input audio frame, whether the input audio frame includes speech. During the EOQ detection mode, the endpointer model receives latent representations for the sequence of audio frames output from the audio encoder, and determines, for each of the latent representation, whether the latent representation includes final silence.
-
公开(公告)号:US20210142174A1
公开(公告)日:2021-05-13
申请号:US17152918
申请日:2021-01-20
Applicant: Google LLC
Inventor: Shuo-yiin Chang , Bo Li , Gabor Simko , Maria Corolina Parada San Martin , Sean Matthew Shannon
Abstract: A method for training an endpointer model includes short-form speech utterances and long-form speech utterances. The method also includes providing a short-form speech utterance as input to a shared neural network, the shared neural network configured to learn shared hidden representations suitable for both voice activity detection (VAD) and end-of-query (EOQ) detection. The method also includes generating, using a VAD classifier, a sequence of predicted VAD labels and determining a VAD loss by comparing the sequence of predicted VAD labels to a corresponding sequence of reference VAD labels. The method also includes, generating, using an EOQ classifier, a sequence of predicted EOQ labels and determining an EOQ loss by comparing the sequence of predicted EOQ labels to a corresponding sequence of reference EOQ labels. The method also includes training, using a cross-entropy criterion, the endpointer model based on the VAD loss and the EOQ loss.
-
公开(公告)号:US20240290320A1
公开(公告)日:2024-08-29
申请号:US18585020
申请日:2024-02-22
Applicant: Google LLC
Inventor: Wenqian Huang , Hao Zhang , Shankar Kumar , Shuo-yiin Chang , Tara N. Sainath
CPC classification number: G10L15/063 , G06F40/30 , G10L15/26
Abstract: A joint segmenting and ASR model includes an encoder to receive a sequence of acoustic frames and generate, at each of a plurality of output steps, a higher order feature representation for a corresponding acoustic frame. The model also includes a decoder to generate based on the higher order feature representation at each of the plurality of output steps a probability distribution over possible speech recognition hypotheses, and an indication of whether the corresponding output step corresponds to an end of segment (EOS). The model is trained on a set of training samples, each training sample including audio data characterizing multiple segments of long-form speech; and a corresponding transcription of the long-form speech, the corresponding transcription annotated with ground-truth EOS labels obtained via distillation from a language model teacher that receives the corresponding transcription as input and injects the ground-truth EOS labels into the corresponding transcription between semantically complete segments.
-
公开(公告)号:US20200335091A1
公开(公告)日:2020-10-22
申请号:US16809403
申请日:2020-03-04
Applicant: Google LLC
Inventor: Shuo-yiin Chang , Rohit Prakash Prabhavalkar , Gabor Simko , Tara N. Sainath , Bo Li , Yangzhang He
Abstract: A method includes receiving audio data of an utterance and processing the audio data to obtain, as output from a speech recognition model configured to jointly perform speech decoding and endpointing of utterances: partial speech recognition results for the utterance; and an endpoint indication indicating when the utterance has ended. While processing the audio data, the method also includes detecting, based on the endpoint indication, the end of the utterance. In response to detecting the end of the utterance, the method also includes terminating the processing of any subsequent audio data received after the end of the utterance was detected.
-
公开(公告)号:US12094453B2
公开(公告)日:2024-09-17
申请号:US17447285
申请日:2021-09-09
Applicant: Google LLC
Inventor: Jiahui Yu , Chung-cheng Chiu , Bo Li , Shuo-yiin Chang , Tara Sainath , Wei Han , Anmol Gulati , Yanzhang He , Arun Narayanan , Yonghui Wu , Ruoming Pang
IPC: G10L15/06 , G10L15/16 , G10L15/187 , G10L15/22 , G10L15/30
CPC classification number: G10L15/063 , G10L15/16 , G10L15/22 , G10L15/30 , G10L15/187
Abstract: A computer-implemented method of training a streaming speech recognition model that includes receiving, as input to the streaming speech recognition model, a sequence of acoustic frames. The streaming speech recognition model is configured to learn an alignment probability between the sequence of acoustic frames and an output sequence of vocabulary tokens. The vocabulary tokens include a plurality of label tokens and a blank token. At each output step, the method includes determining a first probability of emitting one of the label tokens and determining a second probability of emitting the blank token. The method also includes generating the alignment probability at a sequence level based on the first probability and the second probability. The method also includes applying a tuning parameter to the alignment probability at the sequence level to maximize the first probability of emitting one of the label tokens.
-
公开(公告)号:US20240304181A1
公开(公告)日:2024-09-12
申请号:US18598523
申请日:2024-03-07
Applicant: Google LLC
Inventor: Guru Prakash Arumugam , Shuo-yiin Chang , Shaan Jagdeep Patrick Bijwadia , Weiran Wang , Quan Wang , Rohit Prakash Prabhavalkar , Tara N. Sainath
IPC: G10L15/06
CPC classification number: G10L15/063
Abstract: A method includes receiving a plurality of training samples spanning multiple different domains. Each corresponding training sample includes audio data characterizing an utterance paired with a corresponding transcription of the utterance. The method also includes re-labeling each corresponding training sample of the plurality of training samples by annotating the corresponding transcription of the utterance with one or more speaker tags. Each speaker tag indicates a respective segment of the transcription for speech that was spoken by a particular type of speaker. The method also includes training a multi-domain speech recognition model on the re-labeled training samples to teach the multi-domain speech recognition model to learn to share parameters for recognizing speech across each of the different multiple different domains.
-
公开(公告)号:US20240185844A1
公开(公告)日:2024-06-06
申请号:US18489970
申请日:2023-10-19
Applicant: Google LLC
Inventor: Shuo-yiin Chang
IPC: G10L15/18 , G10L15/183
CPC classification number: G10L15/18 , G10L15/183
Abstract: A method includes receiving a sequence of acoustic frames characterizing an input utterance and generating a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames by an audio encoder of an automatic speech recognition (ASR) model. The method also includes generating a context embedding corresponding to one or more previous transcriptions output by the ASR model by a context encoder of the ASR model and generating, by a prediction network of the ASR model, a dense representation based on a sequence of non-blank symbols output by a final Softmax layer. The method also includes generating, by a joint network of the ASR model, a probability distribution over possible speech recognition hypotheses based on the context embeddings generated by the context encoder, the higher order feature representation generated by the audio encoder, and the dense representation generated by the prediction network.
-
8.
公开(公告)号:US20230306958A1
公开(公告)日:2023-09-28
申请号:US18188632
申请日:2023-03-23
Applicant: Google LLC
Inventor: Chao Zhang , Bo Li , Tara N. Sainath , Trevor Strohman , Sepand Mavandadi , Shuo-yiin Chang , Parisa Haghani
CPC classification number: G10L15/005 , G10L15/16 , G10L15/063
Abstract: A method includes receiving a sequence of acoustic frames as input to an automatic speech recognition (ASR) model. The method also includes generating, by a first encoder, a first higher order feature representation for a corresponding acoustic frame. The method also includes generating, by a second encoder, a second higher order feature representation for a corresponding first higher order feature representation. The method also includes generating, by a language identification (ID) predictor, a language prediction representation based on a concatenation of the first higher order feature representation and the second higher order feature representation. The method also includes generating, by a first decoder, a first probability distribution over possible speech recognition hypotheses based on a concatenation of the second higher order feature representation and the language prediction representation.
-
公开(公告)号:US12211509B2
公开(公告)日:2025-01-28
申请号:US17821160
申请日:2022-08-19
Applicant: Google LLC
Inventor: Chao Zhang , Bo Li , Zhiyun Lu , Tara N. Sainath , Shuo-yiin Chang
Abstract: A speech recognition model includes an encoder network, a prediction network, and a joint network. The encoder network is configured to receive a sequence of acoustic frames characterizing an input utterance; and generate, at each of a plurality of output steps, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The prediction network is configured to: receive a sequence of non-blank symbols output by a final Softmax layer; and generate, at each of the plurality of output steps, a dense representation. The joint network is configured to generate, at each of the plurality of output steps based on the higher order feature representation and the dense representation, a probability distribution over possible speech recognition hypotheses. The joint network includes a stack of gating and bilinear pooling to fuse the dense representation and the higher order feature representation.
-
公开(公告)号:US20240296840A1
公开(公告)日:2024-09-05
申请号:US18592590
申请日:2024-03-01
Applicant: Google LLC
Inventor: Shaan Jagdeep Patrick Bijwadia , Shuo-yiin Chang , Tara N. Sainath , Weiran Wang , Zhong Meng
IPC: G10L15/197 , G10L15/02 , G10L15/06
CPC classification number: G10L15/197 , G10L15/02 , G10L15/063
Abstract: A joint auxiliary task and ASR model includes an encoder to receive a sequence of acoustic frames and generate, at each of a plurality of output steps, a higher-order feature representation for a corresponding acoustic frame. The model also includes a multi-output HAT decoder to generate at each of the plurality of output steps a probability distribution over possible speech recognition hypotheses, and an indication of whether the output step corresponds to an auxiliary token associated with a particular auxiliary task. The model is trained by a JEIT training process based on: a paired training data set including paired audio data and transcriptions, the transcriptions annotated with ground-truth auxiliary tokens associated with the particular auxiliary task; and an unpaired training data set including textual utterances not paired with any corresponding audio data, the textual utterances annotated with the ground-truth auxiliary tokens associated with the particular auxiliary task.
-
-
-
-
-
-
-
-
-