Patent search ap:("Google LLC") AND inv:"Han Lu" Page 1

1.

发明申请
Reducing Streaming ASR Model Delay With Self Alignment 有权

公开(公告)号：US20240371379A1

公开(公告)日：2024-11-07

申请号：US18775561

申请日：2024-07-17

Applicant: Google LLC

Inventor： Jaeyoung Kim , Han Lu , Anshuman Tripathi , Qian Zhang , Hasim Sak

IPC: G10L15/26 , G10L15/16

Abstract: A streaming speech recognition model includes an audio encoder configured to receive a sequence of acoustic frames and generate a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The streaming speech recognition model also includes a label encoder configured to receive a sequence of non-blank symbols output by a final softmax layer and generate a dense representation. The streaming speech recognition model also includes a joint network configured to receive the higher order feature representation generated by the audio encoder and the dense representation generated by the label encoder and generate a probability distribution over possible speech recognition hypotheses. Here, the streaming speech recognition model is trained using self-alignment to reduce prediction delay by encouraging an alignment path that is one frame left from a reference forced-alignment frame.

2.

发明公开
Semi-Supervised Training Scheme For Speech Recognition 审中-公开

公开(公告)号：US20240203406A1

公开(公告)日：2024-06-20

申请号：US18065685

申请日：2022-12-14

Applicant: Google LLC

Inventor： Soheil Khorram , Anshuman Tripathi , Kim Jaeyoung , Han Lu , Qian Zhang , Hasim Sak

IPC: G10L15/183 , G10L15/06 , G10L15/22

CPC classification number: G10L15/183 , G10L15/063 , G10L15/22

Abstract: A method includes receiving a sequence of acoustic frames extracted from unlabeled audio samples that correspond to spoken utterances not paired with any corresponding transcriptions. The method also includes generating, using a supervised audio encoder, a target higher order feature representation for a corresponding acoustic frame. The method also includes augmenting the sequence of acoustic frames and generating, as output form an unsupervised audio encoder, a predicted higher order feature representation for a corresponding augmented acoustic frame in the sequence of augmented acoustic frames. The method also includes determining an unsupervised loss term based on the target higher order feature representation and the predicted higher order feature representation and updating parameters of the speech recognition model based on the unsupervised loss term.

3.

发明公开
CLUSTERING AND MINING ACCENTED SPEECH FOR INCLUSIVE AND FAIR SPEECH RECOGNITION 审中-公开

公开(公告)号：US20240290322A1

公开(公告)日：2024-08-29

申请号：US18587860

申请日：2024-02-26

Applicant: Google LLC

Inventor： JAEYOUNG Kim , Han Lu , Soheil Khorram , Anshuman Tripathi , Qian Zhang , Hasim Sak

IPC: G10L15/06

CPC classification number: G10L15/063

Abstract: A method of training an accent recognition model includes receiving a corpus of training utterances spoken across various accents, each training utterance in the corpus including training audio features characterizing the training utterance, and executing a training process to train the accent recognition model on the corpus of training utterances to teach the accent recognition model to learn how to predict accent representations from the training audio features. The accent recognition model includes one or more strided convolution layers, a stack of multi-headed attention layers, and a pooling layer configured to generate a corresponding accent representation.

4.

发明授权
Transformer transducer: one model unifying streaming and non-streaming speech recognition 有权

公开(公告)号：US11741947B2

公开(公告)日：2023-08-29

申请号：US17210465

申请日：2021-03-23

Applicant: Google LLC

Inventor： Anshuman Tripathi , Hasim Sak , Han Lu , Qian Zhang , Jaeyoung Kim

IPC: G10L15/16 , G06N3/04 , G06N3/088 , G10L15/06 , G10L15/197 , G10L15/22 , G10L15/30

CPC classification number: G10L15/16 , G06N3/04 , G06N3/088 , G10L15/063 , G10L15/197 , G10L15/22 , G10L15/30

Abstract: A transformer-transducer model for unifying streaming and non-streaming speech recognition includes an audio encoder, a label encoder, and a joint network. The audio encoder receives a sequence of acoustic frames, and generates, at each of a plurality of time steps, a higher order feature representation for a corresponding acoustic frame. The label encoder receives a sequence of non-blank symbols output by a final softmax layer, and generates, at each of the plurality of time steps, a dense representation. The joint network receives the higher order feature representation and the dense representation at each of the plurality of time steps, and generates a probability distribution over possible speech recognition hypothesis. The audio encoder of the model further includes a neural network having an initial stack of transformer layers trained with zero look ahead audio context, and a final stack of transformer layers trained with a variable look ahead audio context.

5.

发明申请
Reducing Streaming ASR Model Delay With Self Alignment 有权

公开(公告)号：US20220310097A1

公开(公告)日：2022-09-29

申请号：US17644377

申请日：2021-12-15

Applicant: Google LLC

Inventor： Jaeyoung Kim , Han Lu , Anshuman Tripathi , Qian Zhang , Hasim Sak

IPC: G10L15/26 , G10L15/16

Abstract: A streaming speech recognition model includes an audio encoder configured to receive a sequence of acoustic frames and generate a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The streaming speech recognition model also includes a label encoder configured to receive a sequence of non-blank symbols output by a final softmax layer and generate a dense representation. The streaming speech recognition model also includes a joint network configured to receive the higher order feature representation generated by the audio encoder and the dense representation generated by the label encoder and generate a probability distribution over possible speech recognition hypotheses. Here, the streaming speech recognition model is trained using self-alignment to reduce prediction delay by encouraging an alignment path that is one frame left from a reference forced-alignment frame.

6.

发明申请
End-To-End Multi-Talker Overlapping Speech Recognition 有权

公开(公告)号：US20210343273A1

公开(公告)日：2021-11-04

申请号：US16865075

申请日：2020-05-01

Applicant: Google LLC

Inventor： Anshuman Tripathi , Han Lu , Hasim Sak

IPC: G10L15/06 , G10L15/16 , G10L15/04 , G06N3/08 , G06N20/00

Abstract: A method for training a speech recognition model with a loss function includes receiving an audio signal including a first segment corresponding to audio spoken by a first speaker, a second segment corresponding to audio spoken by a second speaker, and an overlapping region where the first segment overlaps the second segment. The overlapping region includes a known start time and a known end time. The method also includes generating a respective masked audio embedding for each of the first and second speakers. The method also includes applying a masking loss after the known end time to the respective masked audio embedding for the first speaker when the first speaker was speaking prior to the known start time, or applying the masking loss prior to the known start time when the first speaker was speaking after the known end time.

7.

发明公开
EVALUATION-BASED SPEAKER CHANGE DETECTION EVALUATION METRICS 审中-公开

公开(公告)号：US20240135934A1

公开(公告)日：2024-04-25

申请号：US18483492

申请日：2023-10-09

Applicant: Google LLC

Inventor： Guanlong Zhao , Quan Wang , Han Lu , Yiling Huang , Jason Pelecanos

IPC: G10L17/06 , G10L17/02 , G10L17/04

CPC classification number: G10L17/06 , G10L17/02 , G10L17/04

Abstract: A method includes obtaining a multi-utterance training sample that includes audio data characterizing utterances spoken by two or more different speakers and obtaining ground-truth speaker change intervals indicating time intervals in the audio data where speaker changes among the two or more different speakers occur. The method also includes processing the audio data to generate a sequence of predicted speaker change tokens using a sequence transduction model. For each corresponding predicted speaker change token, the method includes labeling the corresponding predicted speaker change token as correct when the predicted speaker change token overlaps with one of the ground-truth speaker change intervals. The method also includes determining a precision metric of the sequence transduction model based on a number of the predicted speaker change tokens labeled as correct and a total number of the predicted speaker change tokens in the sequence of predicted speaker change tokens.

8.

发明授权
Contrastive Siamese network for semi-supervised speech recognition 有权

公开(公告)号：US11961515B2

公开(公告)日：2024-04-16

申请号：US17644337

申请日：2021-12-14

Applicant: Google LLC

Inventor： Jaeyoung Kim , Soheil Khorram , Hasim Sak , Anshuman Tripathi , Han Lu , Qian Zhang

IPC: G10L15/16 , G06N3/088 , G10L15/18

CPC classification number: G10L15/16 , G06N3/088 , G10L15/1815

Abstract: A method includes receiving a plurality of unlabeled audio samples corresponding to spoken utterances not paired with corresponding transcriptions. At a target branch of a contrastive Siamese network, the method also includes generating a sequence of encoder outputs for the plurality of unlabeled audio samples and modifying time characteristics of the encoder outputs to generate a sequence of target branch outputs. At an augmentation branch of a contrastive Siamese network, the method also includes performing augmentation on the unlabeled audio samples, generating a sequence of augmented encoder outputs for the augmented unlabeled audio samples, and generating predictions of the sequence of target branch outputs generated at the target branch. The method also includes determining an unsupervised loss term based on target branch outputs and predictions of the sequence of target branch outputs. The method also includes updating parameters of the audio encoder based on the unsupervised loss term.

9.

发明公开
ONE MODEL UNIFYING STREAMING AND NON-STREAMING SPEECH RECOGNITION 审中-公开

公开(公告)号：US20230368779A1

公开(公告)日：2023-11-16

申请号：US18357225

申请日：2023-07-24

Applicant: Google LLC

Inventor： Anshuman Tripathi , Hasim Sak , Han Lu , Qian Zhang , Jaeyoung Kim

IPC: G10L15/16 , G06N3/088 , G10L15/06 , G10L15/22 , G10L15/30 , G06N3/04 , G10L15/197

CPC classification number: G10L15/16 , G06N3/088 , G10L15/063 , G10L15/22 , G10L15/30 , G06N3/04 , G10L15/197

Abstract: A transformer-transducer model for unifying streaming and non-streaming speech recognition includes an audio encoder, a label encoder, and a joint network. The audio encoder receives a sequence of acoustic frames, and generates, at each of a plurality of time steps, a higher order feature representation for a corresponding acoustic frame. The label encoder receives a sequence of non-blank symbols output by a final softmax layer, and generates, at each of the plurality of time steps, a dense representation. The joint network receives the higher order feature representation and the dense representation at each of the plurality of time steps, and generates a probability distribution over possible speech recognition hypothesis. The audio encoder of the model further includes a neural network having an initial stack of transformer layers trained with zero look ahead audio context, and a final stack of transformer layers trained with a variable look ahead audio context.

10.

发明授权
End-to-end multi-talker overlapping speech recognition 有权

公开(公告)号：US11521595B2

公开(公告)日：2022-12-06

申请号：US16865075

申请日：2020-05-01

Applicant: Google LLC

Inventor： Anshuman Tripathi , Han Lu , Hasim Sak

IPC: G10L15/06 , G06N20/00 , G06N3/08 , G10L15/04 , G10L15/16

Abstract: A method for training a speech recognition model with a loss function includes receiving an audio signal including a first segment corresponding to audio spoken by a first speaker, a second segment corresponding to audio spoken by a second speaker, and an overlapping region where the first segment overlaps the second segment. The overlapping region includes a known start time and a known end time. The method also includes generating a respective masked audio embedding for each of the first and second speakers. The method also includes applying a masking loss after the known end time to the respective masked audio embedding for the first speaker when the first speaker was speaking prior to the known start time, or applying the masking loss prior to the known start time when the first speaker was speaking after the known end time.

Search Results

Country/Region

Patent validity

Application date

Publication (announcement) day

applicant

The country/region where the applicant is located

Inventor

IPC

IPC Department

IPC class

IPC subclass

IPC group

IPC team

Appearance classification