-
公开(公告)号:US20240029719A1
公开(公告)日:2024-01-25
申请号:US18340093
申请日:2023-06-23
Applicant: Google LLC
Inventor: Shaan Jagdeep Patrick Bijwadia , Shuo-yiin Chang , Bo Li , Yanzhang He , Tara N. Sainath , Chao Zhang
CPC classification number: G10L15/16 , G10L15/063 , G10L25/93
Abstract: A single E2E multitask model includes a speech recognition model and an endpointer model. The speech recognition model includes an audio encoder configured to encode a sequence of audio frames into corresponding higher-order feature representations, and a decoder configured to generate probability distributions over possible speech recognition hypotheses for the sequence of audio frames based on the higher-order feature representations. The endpointer model is configured to operate between a VAD mode and an EOQ detection mode. During the VAD mode, the endpointer model receives input audio frames, and determines, for each input audio frame, whether the input audio frame includes speech. During the EOQ detection mode, the endpointer model receives latent representations for the sequence of audio frames output from the audio encoder, and determines, for each of the latent representation, whether the latent representation includes final silence.
-
2.
公开(公告)号:US20230306958A1
公开(公告)日:2023-09-28
申请号:US18188632
申请日:2023-03-23
Applicant: Google LLC
Inventor: Chao Zhang , Bo Li , Tara N. Sainath , Trevor Strohman , Sepand Mavandadi , Shuo-yiin Chang , Parisa Haghani
CPC classification number: G10L15/005 , G10L15/16 , G10L15/063
Abstract: A method includes receiving a sequence of acoustic frames as input to an automatic speech recognition (ASR) model. The method also includes generating, by a first encoder, a first higher order feature representation for a corresponding acoustic frame. The method also includes generating, by a second encoder, a second higher order feature representation for a corresponding first higher order feature representation. The method also includes generating, by a language identification (ID) predictor, a language prediction representation based on a concatenation of the first higher order feature representation and the second higher order feature representation. The method also includes generating, by a first decoder, a first probability distribution over possible speech recognition hypotheses based on a concatenation of the second higher order feature representation and the language prediction representation.
-
公开(公告)号:US12211509B2
公开(公告)日:2025-01-28
申请号:US17821160
申请日:2022-08-19
Applicant: Google LLC
Inventor: Chao Zhang , Bo Li , Zhiyun Lu , Tara N. Sainath , Shuo-yiin Chang
Abstract: A speech recognition model includes an encoder network, a prediction network, and a joint network. The encoder network is configured to receive a sequence of acoustic frames characterizing an input utterance; and generate, at each of a plurality of output steps, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The prediction network is configured to: receive a sequence of non-blank symbols output by a final Softmax layer; and generate, at each of the plurality of output steps, a dense representation. The joint network is configured to generate, at each of the plurality of output steps based on the higher order feature representation and the dense representation, a probability distribution over possible speech recognition hypotheses. The joint network includes a stack of gating and bilinear pooling to fuse the dense representation and the higher order feature representation.
-
公开(公告)号:US20240135923A1
公开(公告)日:2024-04-25
申请号:US18485271
申请日:2023-10-11
Applicant: Google LLC
Inventor: Chao Zhang , Bo Li , Tara N. Sainath , Trevor Strohman , Shuo-yiin Chang
IPC: G10L15/197 , G10L15/00 , G10L15/02
CPC classification number: G10L15/197 , G10L15/005 , G10L15/02
Abstract: A method includes receiving a sequence of acoustic frames as input to a multilingual automated speech recognition (ASR) model configured to recognize speech in a plurality of different supported languages and generating, by an audio encoder of the multilingual ASR, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method also includes generating, by a language identification (LID) predictor of the multilingual ASR, a language prediction representation for a corresponding higher order feature representation. The method also includes generating, by a decoder of the multilingual ASR, a probability distribution over possible speech recognition results based on the corresponding higher order feature representation, a sequence of non-blank symbols, and a corresponding language prediction representation. The decoder includes monolingual output layer having a plurality of output nodes each sharing a plurality of language-specific wordpiece models.
-
-
-