Patent search ap:("GOOGLE LLC") AND inv:"Shuo-yiin Chang" Page 1

1.

发明公开
Unified End-To-End Speech Recognition And Endpointing Using A Switch Connection 审中-公开

公开(公告)号：US20240029719A1

公开(公告)日：2024-01-25

申请号：US18340093

申请日：2023-06-23

Applicant: Google LLC

Inventor： Shaan Jagdeep Patrick Bijwadia , Shuo-yiin Chang , Bo Li , Yanzhang He , Tara N. Sainath , Chao Zhang

IPC: G10L15/16 , G10L15/06 , G10L25/93

CPC classification number: G10L15/16 , G10L15/063 , G10L25/93

Abstract: A single E2E multitask model includes a speech recognition model and an endpointer model. The speech recognition model includes an audio encoder configured to encode a sequence of audio frames into corresponding higher-order feature representations, and a decoder configured to generate probability distributions over possible speech recognition hypotheses for the sequence of audio frames based on the higher-order feature representations. The endpointer model is configured to operate between a VAD mode and an EOQ detection mode. During the VAD mode, the endpointer model receives input audio frames, and determines, for each input audio frame, whether the input audio frame includes speech. During the EOQ detection mode, the endpointer model receives latent representations for the sequence of audio frames output from the audio encoder, and determines, for each of the latent representation, whether the latent representation includes final silence.

2.

发明申请
Unified Endpointer Using Multitask and Multidomain Learning 有权

公开(公告)号：US20210142174A1

公开(公告)日：2021-05-13

申请号：US17152918

申请日：2021-01-20

Applicant: Google LLC

Inventor： Shuo-yiin Chang , Bo Li , Gabor Simko , Maria Corolina Parada San Martin , Sean Matthew Shannon

IPC: G06N3/08 , G06N3/04 , G10L15/16 , G06N20/20 , G06K9/62 , G06N5/04

Abstract: A method for training an endpointer model includes short-form speech utterances and long-form speech utterances. The method also includes providing a short-form speech utterance as input to a shared neural network, the shared neural network configured to learn shared hidden representations suitable for both voice activity detection (VAD) and end-of-query (EOQ) detection. The method also includes generating, using a VAD classifier, a sequence of predicted VAD labels and determining a VAD loss by comparing the sequence of predicted VAD labels to a corresponding sequence of reference VAD labels. The method also includes, generating, using an EOQ classifier, a sequence of predicted EOQ labels and determining an EOQ loss by comparing the sequence of predicted EOQ labels to a corresponding sequence of reference EOQ labels. The method also includes training, using a cross-entropy criterion, the endpointer model based on the VAD loss and the EOQ loss.

3.

发明公开
Semantic Segmentation With Language Models For Long-Form Automatic Speech Recognition 审中-公开

公开(公告)号：US20240290320A1

公开(公告)日：2024-08-29

申请号：US18585020

申请日：2024-02-22

Applicant: Google LLC

Inventor： Wenqian Huang , Hao Zhang , Shankar Kumar , Shuo-yiin Chang , Tara N. Sainath

IPC: G10L15/06 , G06F40/30 , G10L15/26

CPC classification number: G10L15/063 , G06F40/30 , G10L15/26

Abstract: A joint segmenting and ASR model includes an encoder to receive a sequence of acoustic frames and generate, at each of a plurality of output steps, a higher order feature representation for a corresponding acoustic frame. The model also includes a decoder to generate based on the higher order feature representation at each of the plurality of output steps a probability distribution over possible speech recognition hypotheses, and an indication of whether the corresponding output step corresponds to an end of segment (EOS). The model is trained on a set of training samples, each training sample including audio data characterizing multiple segments of long-form speech; and a corresponding transcription of the long-form speech, the corresponding transcription annotated with ground-truth EOS labels obtained via distillation from a language model teacher that receives the corresponding transcription as input and injects the ground-truth EOS labels into the corresponding transcription between semantically complete segments.

4.

发明申请
Joint Endpointing And Automatic Speech Recognition 审中-公开

公开(公告)号：US20200335091A1

公开(公告)日：2020-10-22

申请号：US16809403

申请日：2020-03-04

Applicant: Google LLC

Inventor： Shuo-yiin Chang , Rohit Prakash Prabhavalkar , Gabor Simko , Tara N. Sainath , Bo Li , Yangzhang He

IPC: G10L15/16 , G10L15/14 , G10L15/28 , G10L15/02

Abstract: A method includes receiving audio data of an utterance and processing the audio data to obtain, as output from a speech recognition model configured to jointly perform speech decoding and endpointing of utterances: partial speech recognition results for the utterance; and an endpoint indication indicating when the utterance has ended. While processing the audio data, the method also includes detecting, based on the endpoint indication, the end of the utterance. In response to detecting the end of the utterance, the method also includes terminating the processing of any subsequent audio data received after the end of the utterance was detected.

5.

发明授权
Fast emit low-latency streaming ASR with sequence-level emission regularization utilizing forward and backward probabilities between nodes of an alignment lattice 有权

公开(公告)号：US12094453B2

公开(公告)日：2024-09-17

申请号：US17447285

申请日：2021-09-09

Applicant: Google LLC

Inventor： Jiahui Yu , Chung-cheng Chiu , Bo Li , Shuo-yiin Chang , Tara Sainath , Wei Han , Anmol Gulati , Yanzhang He , Arun Narayanan , Yonghui Wu , Ruoming Pang

IPC: G10L15/06 , G10L15/16 , G10L15/187 , G10L15/22 , G10L15/30

CPC classification number: G10L15/063 , G10L15/16 , G10L15/22 , G10L15/30 , G10L15/187

Abstract: A computer-implemented method of training a streaming speech recognition model that includes receiving, as input to the streaming speech recognition model, a sequence of acoustic frames. The streaming speech recognition model is configured to learn an alignment probability between the sequence of acoustic frames and an output sequence of vocabulary tokens. The vocabulary tokens include a plurality of label tokens and a blank token. At each output step, the method includes determining a first probability of emitting one of the label tokens and determining a second probability of emitting the blank token. The method also includes generating the alignment probability at a sequence level based on the first probability and the second probability. The method also includes applying a tuning parameter to the alignment probability at the sequence level to maximize the first probability of emitting one of the label tokens.

6.

发明公开
CONNECTING DIFFERENT ASR APPLICATION DOMAINS WITH SPEAKER-TAGS 审中-公开

公开(公告)号：US20240304181A1

公开(公告)日：2024-09-12

申请号：US18598523

申请日：2024-03-07

Applicant: Google LLC

Inventor： Guru Prakash Arumugam , Shuo-yiin Chang , Shaan Jagdeep Patrick Bijwadia , Weiran Wang , Quan Wang , Rohit Prakash Prabhavalkar , Tara N. Sainath

IPC: G10L15/06

CPC classification number: G10L15/063

Abstract: A method includes receiving a plurality of training samples spanning multiple different domains. Each corresponding training sample includes audio data characterizing an utterance paired with a corresponding transcription of the utterance. The method also includes re-labeling each corresponding training sample of the plurality of training samples by annotating the corresponding transcription of the utterance with one or more speaker tags. Each speaker tag indicates a respective segment of the transcription for speech that was spoken by a particular type of speaker. The method also includes training a multi-domain speech recognition model on the re-labeled training samples to teach the multi-domain speech recognition model to learn to share parameters for recognizing speech across each of the different multiple different domains.

7.

发明公开
CONTEXT-AWARE END-TO-END ASR FUSION OF CONTEXT, ACOUSTIC AND TEXT PRESENTATIONS 审中-公开

公开(公告)号：US20240185844A1

公开(公告)日：2024-06-06

申请号：US18489970

申请日：2023-10-19

Applicant: Google LLC

Inventor： Shuo-yiin Chang

IPC: G10L15/18 , G10L15/183

CPC classification number: G10L15/18 , G10L15/183

Abstract: A method includes receiving a sequence of acoustic frames characterizing an input utterance and generating a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames by an audio encoder of an automatic speech recognition (ASR) model. The method also includes generating a context embedding corresponding to one or more previous transcriptions output by the ASR model by a context encoder of the ASR model and generating, by a prediction network of the ASR model, a dense representation based on a sequence of non-blank symbols output by a final Softmax layer. The method also includes generating, by a joint network of the ASR model, a probability distribution over possible speech recognition hypotheses based on the context embeddings generated by the context encoder, the higher order feature representation generated by the audio encoder, and the dense representation generated by the prediction network.

8.

发明公开
Streaming End-to-end Multilingual Speech Recognition with Joint Language Identification 审中-公开

公开(公告)号：US20230306958A1

公开(公告)日：2023-09-28

申请号：US18188632

申请日：2023-03-23

Applicant: Google LLC

Inventor： Chao Zhang , Bo Li , Tara N. Sainath , Trevor Strohman , Sepand Mavandadi , Shuo-yiin Chang , Parisa Haghani

IPC: G10L15/00 , G10L15/16 , G10L15/06

CPC classification number: G10L15/005 , G10L15/16 , G10L15/063

Abstract: A method includes receiving a sequence of acoustic frames as input to an automatic speech recognition (ASR) model. The method also includes generating, by a first encoder, a first higher order feature representation for a corresponding acoustic frame. The method also includes generating, by a second encoder, a second higher order feature representation for a corresponding first higher order feature representation. The method also includes generating, by a language identification (ID) predictor, a language prediction representation based on a concatenation of the first higher order feature representation and the second higher order feature representation. The method also includes generating, by a first decoder, a first probability distribution over possible speech recognition hypotheses based on a concatenation of the second higher order feature representation and the language prediction representation.

9.

发明授权
Fusion of acoustic and text representations in RNN-T 有权

公开(公告)号：US12211509B2

公开(公告)日：2025-01-28

申请号：US17821160

申请日：2022-08-19

Applicant: Google LLC

Inventor： Chao Zhang , Bo Li , Zhiyun Lu , Tara N. Sainath , Shuo-yiin Chang

IPC: G10L15/30 , G06N7/01

Abstract: A speech recognition model includes an encoder network, a prediction network, and a joint network. The encoder network is configured to receive a sequence of acoustic frames characterizing an input utterance; and generate, at each of a plurality of output steps, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The prediction network is configured to: receive a sequence of non-blank symbols output by a final Softmax layer; and generate, at each of the plurality of output steps, a dense representation. The joint network is configured to generate, at each of the plurality of output steps based on the higher order feature representation and the dense representation, a probability distribution over possible speech recognition hypotheses. The joint network includes a stack of gating and bilinear pooling to fuse the dense representation and the higher order feature representation.

10.

发明公开
Text Injection For Training Auxiliary Tasks In Speech Recognition Models 审中-公开

公开(公告)号：US20240296840A1

公开(公告)日：2024-09-05

申请号：US18592590

申请日：2024-03-01

Applicant: Google LLC

Inventor： Shaan Jagdeep Patrick Bijwadia , Shuo-yiin Chang , Tara N. Sainath , Weiran Wang , Zhong Meng

IPC: G10L15/197 , G10L15/02 , G10L15/06

CPC classification number: G10L15/197 , G10L15/02 , G10L15/063

Abstract: A joint auxiliary task and ASR model includes an encoder to receive a sequence of acoustic frames and generate, at each of a plurality of output steps, a higher-order feature representation for a corresponding acoustic frame. The model also includes a multi-output HAT decoder to generate at each of the plurality of output steps a probability distribution over possible speech recognition hypotheses, and an indication of whether the output step corresponds to an auxiliary token associated with a particular auxiliary task. The model is trained by a JEIT training process based on: a paired training data set including paired audio data and transcriptions, the transcriptions annotated with ground-truth auxiliary tokens associated with the particular auxiliary task; and an unpaired training data set including textual utterances not paired with any corresponding audio data, the textual utterances annotated with the ground-truth auxiliary tokens associated with the particular auxiliary task.

Search Results

Country/Region

Patent validity

Application date

Publication (announcement) day

applicant

The country/region where the applicant is located

Inventor

IPC

IPC Department

IPC class

IPC subclass

IPC group

IPC team

Appearance classification