Unified End-To-End Speech Recognition And Endpointing Using A Switch Connection

    公开(公告)号:US20240029719A1

    公开(公告)日:2024-01-25

    申请号:US18340093

    申请日:2023-06-23

    Applicant: Google LLC

    CPC classification number: G10L15/16 G10L15/063 G10L25/93

    Abstract: A single E2E multitask model includes a speech recognition model and an endpointer model. The speech recognition model includes an audio encoder configured to encode a sequence of audio frames into corresponding higher-order feature representations, and a decoder configured to generate probability distributions over possible speech recognition hypotheses for the sequence of audio frames based on the higher-order feature representations. The endpointer model is configured to operate between a VAD mode and an EOQ detection mode. During the VAD mode, the endpointer model receives input audio frames, and determines, for each input audio frame, whether the input audio frame includes speech. During the EOQ detection mode, the endpointer model receives latent representations for the sequence of audio frames output from the audio encoder, and determines, for each of the latent representation, whether the latent representation includes final silence.

    Unified Endpointer Using Multitask and Multidomain Learning

    公开(公告)号:US20210142174A1

    公开(公告)日:2021-05-13

    申请号:US17152918

    申请日:2021-01-20

    Applicant: Google LLC

    Abstract: A method for training an endpointer model includes short-form speech utterances and long-form speech utterances. The method also includes providing a short-form speech utterance as input to a shared neural network, the shared neural network configured to learn shared hidden representations suitable for both voice activity detection (VAD) and end-of-query (EOQ) detection. The method also includes generating, using a VAD classifier, a sequence of predicted VAD labels and determining a VAD loss by comparing the sequence of predicted VAD labels to a corresponding sequence of reference VAD labels. The method also includes, generating, using an EOQ classifier, a sequence of predicted EOQ labels and determining an EOQ loss by comparing the sequence of predicted EOQ labels to a corresponding sequence of reference EOQ labels. The method also includes training, using a cross-entropy criterion, the endpointer model based on the VAD loss and the EOQ loss.

    Semantic Segmentation With Language Models For Long-Form Automatic Speech Recognition

    公开(公告)号:US20240290320A1

    公开(公告)日:2024-08-29

    申请号:US18585020

    申请日:2024-02-22

    Applicant: Google LLC

    CPC classification number: G10L15/063 G06F40/30 G10L15/26

    Abstract: A joint segmenting and ASR model includes an encoder to receive a sequence of acoustic frames and generate, at each of a plurality of output steps, a higher order feature representation for a corresponding acoustic frame. The model also includes a decoder to generate based on the higher order feature representation at each of the plurality of output steps a probability distribution over possible speech recognition hypotheses, and an indication of whether the corresponding output step corresponds to an end of segment (EOS). The model is trained on a set of training samples, each training sample including audio data characterizing multiple segments of long-form speech; and a corresponding transcription of the long-form speech, the corresponding transcription annotated with ground-truth EOS labels obtained via distillation from a language model teacher that receives the corresponding transcription as input and injects the ground-truth EOS labels into the corresponding transcription between semantically complete segments.

    Joint Endpointing And Automatic Speech Recognition

    公开(公告)号:US20200335091A1

    公开(公告)日:2020-10-22

    申请号:US16809403

    申请日:2020-03-04

    Applicant: Google LLC

    Abstract: A method includes receiving audio data of an utterance and processing the audio data to obtain, as output from a speech recognition model configured to jointly perform speech decoding and endpointing of utterances: partial speech recognition results for the utterance; and an endpoint indication indicating when the utterance has ended. While processing the audio data, the method also includes detecting, based on the endpoint indication, the end of the utterance. In response to detecting the end of the utterance, the method also includes terminating the processing of any subsequent audio data received after the end of the utterance was detected.

    CONNECTING DIFFERENT ASR APPLICATION DOMAINS WITH SPEAKER-TAGS

    公开(公告)号:US20240304181A1

    公开(公告)日:2024-09-12

    申请号:US18598523

    申请日:2024-03-07

    Applicant: Google LLC

    CPC classification number: G10L15/063

    Abstract: A method includes receiving a plurality of training samples spanning multiple different domains. Each corresponding training sample includes audio data characterizing an utterance paired with a corresponding transcription of the utterance. The method also includes re-labeling each corresponding training sample of the plurality of training samples by annotating the corresponding transcription of the utterance with one or more speaker tags. Each speaker tag indicates a respective segment of the transcription for speech that was spoken by a particular type of speaker. The method also includes training a multi-domain speech recognition model on the re-labeled training samples to teach the multi-domain speech recognition model to learn to share parameters for recognizing speech across each of the different multiple different domains.

    CONTEXT-AWARE END-TO-END ASR FUSION OF CONTEXT, ACOUSTIC AND TEXT PRESENTATIONS

    公开(公告)号:US20240185844A1

    公开(公告)日:2024-06-06

    申请号:US18489970

    申请日:2023-10-19

    Applicant: Google LLC

    Inventor: Shuo-yiin Chang

    CPC classification number: G10L15/18 G10L15/183

    Abstract: A method includes receiving a sequence of acoustic frames characterizing an input utterance and generating a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames by an audio encoder of an automatic speech recognition (ASR) model. The method also includes generating a context embedding corresponding to one or more previous transcriptions output by the ASR model by a context encoder of the ASR model and generating, by a prediction network of the ASR model, a dense representation based on a sequence of non-blank symbols output by a final Softmax layer. The method also includes generating, by a joint network of the ASR model, a probability distribution over possible speech recognition hypotheses based on the context embeddings generated by the context encoder, the higher order feature representation generated by the audio encoder, and the dense representation generated by the prediction network.

    Streaming End-to-end Multilingual Speech Recognition with Joint Language Identification

    公开(公告)号:US20230306958A1

    公开(公告)日:2023-09-28

    申请号:US18188632

    申请日:2023-03-23

    Applicant: Google LLC

    CPC classification number: G10L15/005 G10L15/16 G10L15/063

    Abstract: A method includes receiving a sequence of acoustic frames as input to an automatic speech recognition (ASR) model. The method also includes generating, by a first encoder, a first higher order feature representation for a corresponding acoustic frame. The method also includes generating, by a second encoder, a second higher order feature representation for a corresponding first higher order feature representation. The method also includes generating, by a language identification (ID) predictor, a language prediction representation based on a concatenation of the first higher order feature representation and the second higher order feature representation. The method also includes generating, by a first decoder, a first probability distribution over possible speech recognition hypotheses based on a concatenation of the second higher order feature representation and the language prediction representation.

    Fusion of acoustic and text representations in RNN-T

    公开(公告)号:US12211509B2

    公开(公告)日:2025-01-28

    申请号:US17821160

    申请日:2022-08-19

    Applicant: Google LLC

    Abstract: A speech recognition model includes an encoder network, a prediction network, and a joint network. The encoder network is configured to receive a sequence of acoustic frames characterizing an input utterance; and generate, at each of a plurality of output steps, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The prediction network is configured to: receive a sequence of non-blank symbols output by a final Softmax layer; and generate, at each of the plurality of output steps, a dense representation. The joint network is configured to generate, at each of the plurality of output steps based on the higher order feature representation and the dense representation, a probability distribution over possible speech recognition hypotheses. The joint network includes a stack of gating and bilinear pooling to fuse the dense representation and the higher order feature representation.

    Text Injection For Training Auxiliary Tasks In Speech Recognition Models

    公开(公告)号:US20240296840A1

    公开(公告)日:2024-09-05

    申请号:US18592590

    申请日:2024-03-01

    Applicant: Google LLC

    CPC classification number: G10L15/197 G10L15/02 G10L15/063

    Abstract: A joint auxiliary task and ASR model includes an encoder to receive a sequence of acoustic frames and generate, at each of a plurality of output steps, a higher-order feature representation for a corresponding acoustic frame. The model also includes a multi-output HAT decoder to generate at each of the plurality of output steps a probability distribution over possible speech recognition hypotheses, and an indication of whether the output step corresponds to an auxiliary token associated with a particular auxiliary task. The model is trained by a JEIT training process based on: a paired training data set including paired audio data and transcriptions, the transcriptions annotated with ground-truth auxiliary tokens associated with the particular auxiliary task; and an unpaired training data set including textual utterances not paired with any corresponding audio data, the textual utterances annotated with the ground-truth auxiliary tokens associated with the particular auxiliary task.

Patent Agency Ranking