Unified End-To-End Speech Recognition And Endpointing Using A Switch Connection

    公开(公告)号:US20240029719A1

    公开(公告)日:2024-01-25

    申请号:US18340093

    申请日:2023-06-23

    Applicant: Google LLC

    CPC classification number: G10L15/16 G10L15/063 G10L25/93

    Abstract: A single E2E multitask model includes a speech recognition model and an endpointer model. The speech recognition model includes an audio encoder configured to encode a sequence of audio frames into corresponding higher-order feature representations, and a decoder configured to generate probability distributions over possible speech recognition hypotheses for the sequence of audio frames based on the higher-order feature representations. The endpointer model is configured to operate between a VAD mode and an EOQ detection mode. During the VAD mode, the endpointer model receives input audio frames, and determines, for each input audio frame, whether the input audio frame includes speech. During the EOQ detection mode, the endpointer model receives latent representations for the sequence of audio frames output from the audio encoder, and determines, for each of the latent representation, whether the latent representation includes final silence.

    Streaming End-to-end Multilingual Speech Recognition with Joint Language Identification

    公开(公告)号:US20230306958A1

    公开(公告)日:2023-09-28

    申请号:US18188632

    申请日:2023-03-23

    Applicant: Google LLC

    CPC classification number: G10L15/005 G10L15/16 G10L15/063

    Abstract: A method includes receiving a sequence of acoustic frames as input to an automatic speech recognition (ASR) model. The method also includes generating, by a first encoder, a first higher order feature representation for a corresponding acoustic frame. The method also includes generating, by a second encoder, a second higher order feature representation for a corresponding first higher order feature representation. The method also includes generating, by a language identification (ID) predictor, a language prediction representation based on a concatenation of the first higher order feature representation and the second higher order feature representation. The method also includes generating, by a first decoder, a first probability distribution over possible speech recognition hypotheses based on a concatenation of the second higher order feature representation and the language prediction representation.

    Fusion of acoustic and text representations in RNN-T

    公开(公告)号:US12211509B2

    公开(公告)日:2025-01-28

    申请号:US17821160

    申请日:2022-08-19

    Applicant: Google LLC

    Abstract: A speech recognition model includes an encoder network, a prediction network, and a joint network. The encoder network is configured to receive a sequence of acoustic frames characterizing an input utterance; and generate, at each of a plurality of output steps, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The prediction network is configured to: receive a sequence of non-blank symbols output by a final Softmax layer; and generate, at each of the plurality of output steps, a dense representation. The joint network is configured to generate, at each of the plurality of output steps based on the higher order feature representation and the dense representation, a probability distribution over possible speech recognition hypotheses. The joint network includes a stack of gating and bilinear pooling to fuse the dense representation and the higher order feature representation.

    Universal Monolingual Output Layer for Multilingual Speech Recognition

    公开(公告)号:US20240135923A1

    公开(公告)日:2024-04-25

    申请号:US18485271

    申请日:2023-10-11

    Applicant: Google LLC

    CPC classification number: G10L15/197 G10L15/005 G10L15/02

    Abstract: A method includes receiving a sequence of acoustic frames as input to a multilingual automated speech recognition (ASR) model configured to recognize speech in a plurality of different supported languages and generating, by an audio encoder of the multilingual ASR, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method also includes generating, by a language identification (LID) predictor of the multilingual ASR, a language prediction representation for a corresponding higher order feature representation. The method also includes generating, by a decoder of the multilingual ASR, a probability distribution over possible speech recognition results based on the corresponding higher order feature representation, a sequence of non-blank symbols, and a corresponding language prediction representation. The decoder includes monolingual output layer having a plurality of output nodes each sharing a plurality of language-specific wordpiece models.

Patent Agency Ranking