INTERNAL LANGUAGE MODEL FOR E2E MODELS

    公开(公告)号:US20220139380A1

    公开(公告)日:2022-05-05

    申请号:US17154956

    申请日:2021-01-21

    Abstract: A computer device is provided that includes one or more processors configured to receive an end-to-end (E2E) model that has been trained for automatic speech recognition with training data from a source-domain, and receive an external language model that has been trained with training data from a target-domain. The one or more processors are configured to perform an inference of the probability of an output token sequence given a sequence of input speech features. Performing the inference includes computing an E2E model score, computing an external language model score, and computing an estimated internal language model score for the E2E model. The estimated internal language model score is computed by removing a contribution of an intrinsic acoustic model. The processor is further configured to compute an integrated score based at least on E2E model score, the external language model score, and the estimated internal language model score.

    HYPOTHESIS STITCHER FOR SPEECH RECOGNITION OF LONG-FORM AUDIO

    公开(公告)号:US20240185859A1

    公开(公告)日:2024-06-06

    申请号:US18440912

    申请日:2024-02-13

    CPC classification number: G10L17/02 G10L15/22 G10L15/26 G10L19/022 G10L21/0272

    Abstract: A hypothesis stitcher for speech recognition of long-form audio provides superior performance, such as higher accuracy and reduced computational cost. An example disclosed operation includes: segmenting the audio stream into a plurality of audio segments; identifying a plurality of speakers within each of the plurality of audio segments; performing automatic speech recognition (ASR) on each of the plurality of audio segments to generate a plurality of short-segment hypotheses; merging at least a portion of the short-segment hypotheses into a first merged hypothesis set; inserting stitching symbols into the first merged hypothesis set, the stitching symbols including a window change (WC) symbol; and consolidating, with a network-based hypothesis stitcher, the first merged hypothesis set into a first consolidated hypothesis. Multiple variations are disclosed, including alignment-based stitchers and serialized stitchers, which may operate as speaker-specific stitchers or multi-speaker stitchers, and may further support multiple options for differing hypothesis configurations.

    HYPOTHESIS STITCHER FOR SPEECH RECOGNITION OF LONG-FORM AUDIO

    公开(公告)号:US20220199091A1

    公开(公告)日:2022-06-23

    申请号:US17127938

    申请日:2020-12-18

    Abstract: A hypothesis stitcher for speech recognition of long-form audio provides superior performance, such as higher accuracy and reduced computational cost. An example disclosed operation includes: segmenting the audio stream into a plurality of audio segments; identifying a plurality of speakers within each of the plurality of audio segments; performing automatic speech recognition (ASR) on each of the plurality of audio segments to generate a plurality of short-segment hypotheses; merging at least a portion of the short-segment hypotheses into a first merged hypothesis set; inserting stitching symbols into the first merged hypothesis set, the stitching symbols including a window change (WC) symbol; and consolidating, with a network-based hypothesis stitcher, the first merged hypothesis set into a first consolidated hypothesis. Multiple variations are disclosed, including alignment-based stitchers and serialized stitchers, which may operate as speaker-specific stitchers or multi-speaker stitchers, and may further support multiple options for differing hypothesis configurations.

    SPEAKER ADAPTATION FOR ATTENTION-BASED ENCODER-DECODER

    公开(公告)号:US20210065683A1

    公开(公告)日:2021-03-04

    申请号:US16675515

    申请日:2019-11-06

    Abstract: Embodiments are associated with a speaker-independent attention-based encoder-decoder model to classify output tokens based on input speech frames, the speaker-independent attention-based encoder-decoder model associated with a first output distribution, a speaker-dependent attention-based encoder-decoder model to classify output tokens based on input speech frames, the speaker-dependent attention-based encoder-decoder model associated with a second output distribution, training of the second attention-based encoder-decoder model to classify output tokens based on input speech frames of a target speaker and simultaneously training the speaker-dependent attention-based encoder-decoder model to maintain a similarity between the first output distribution and the second output distribution, and performing automatic speech recognition on speech frames of the target speaker using the trained speaker-dependent attention-based encoder-decoder model.

    TRAINING AND USING A TRANSCRIPT GENERATION MODEL ON A MULTI-SPEAKER AUDIO STREAM

    公开(公告)号:US20230215439A1

    公开(公告)日:2023-07-06

    申请号:US17566861

    申请日:2021-12-31

    CPC classification number: G10L17/04 G10L15/06 G10L15/26

    Abstract: The disclosure herein describes using a transcript generation model for generating a transcript from a multi-speaker audio stream. Audio data including overlapping speech of a plurality of speakers is obtained and a set of frame embeddings are generated from audio data frames of the obtained audio data using an audio data encoder. A set of words and channel change (CC) symbols are generated from the set of frame embeddings using a transcript generation model. The CC symbols are included between pairs of adjacent words that are spoken by different people at the same time. The set of words and CC symbols are transformed into a plurality of transcript lines, wherein words of the set of words are sorted into transcript lines based on the CC symbols, and a multi-speaker transcript is generated based on the plurality of transcript lines. The inclusion of CC symbols by the model enables efficient, accurate multi-speaker transcription.

    DYNAMIC GRADIENT AGGREGATION FOR TRAINING NEURAL NETWORKS

    公开(公告)号:US20220036178A1

    公开(公告)日:2022-02-03

    申请号:US16945715

    申请日:2020-07-31

    Abstract: The disclosure herein describes training a global model based on a plurality of data sets. The global model is applied to each data set of the plurality of data sets and a plurality of gradients is generated based on that application. At least one gradient quality metric is determined for each gradient of the plurality of gradients. Based on the determined gradient quality metrics of the plurality of gradients, a plurality of weight factors is calculated. The plurality of gradients is transformed into a plurality of weighted gradients based on the calculated plurality of weight factors and a global gradient is generated based on the plurality of weighted gradients. The global model is updated based on the global gradient, wherein the updated global model, when applied to a data set, performs a task based on the data set and provides model output based on performing the task.

    SEQUENCE-TO-SEQUENCE SPEECH RECOGNITION WITH LATENCY THRESHOLD

    公开(公告)号:US20210312923A1

    公开(公告)日:2021-10-07

    申请号:US16841542

    申请日:2020-04-06

    Abstract: A computing system including one or more processors configured to receive an audio input. The one or more processors may generate a text transcription of the audio input at a sequence-to-sequence speech recognition model, which may assign a respective plurality of external-model text tokens to a plurality of frames included in the audio input. Each external-model text token may have an external-model alignment within the audio input. Based on the audio input, the one or more processors may generate a plurality of hidden states. Based on the plurality of hidden states, the one or more processors may generate a plurality of output text tokens. Each output text token may have a corresponding output alignment within the audio input. For each output text token, a latency between the output alignment and the external-model alignment may be below a predetermined latency threshold. The one or more processors may output the text transcription.

    TRAINING AND USING A TRANSCRIPT GENERATION MODEL ON A MULTI-SPEAKER AUDIO STREAM

    公开(公告)号:US20240257815A1

    公开(公告)日:2024-08-01

    申请号:US18632277

    申请日:2024-04-10

    CPC classification number: G10L17/04 G10L15/06 G10L15/26

    Abstract: The disclosure herein describes using a transcript generation model for generating a transcript from a multi-speaker audio stream. Audio data including overlapping speech of a plurality of speakers is obtained and a set of frame embeddings are generated from audio data frames of obtained audio data using an audio data encoder. A set of words and channel change (CC) symbols are generated from the set of frame embeddings using a transcript generation model. The CC symbols are included between pairs of adjacent words that are spoken by different people at the same time. The set of words and CC symbols are transformed into a plurality of transcript lines, wherein words of the set of words are sorted into transcript lines based on CC symbols, and a multi-speaker transcript is generated based on the plurality of transcript lines. The inclusion of CC symbols by the model enables efficient, accurate multi-speaker transcription.

    ON-DEVICE STREAMING INVERSE TEXT NORMALIZATION (ITN)

    公开(公告)号:US20230289536A1

    公开(公告)日:2023-09-14

    申请号:US17693267

    申请日:2022-03-11

    CPC classification number: G06F40/56 G06F40/284 G10L15/08

    Abstract: Solutions for on-device streaming inverse text normalization (ITN) include: receiving a stream of tokens, each token representing an element of human speech; tagging, by a tagger that can work in a streaming manner (e.g., a neural network), the stream of tokens with one or more tags of a plurality of tags to produce a tagged stream of tokens, each tag of the plurality of tags representing a different normalization category of a plurality of normalization categories; based on at least a first tag representing a first normalization category, converting, by a first language converter of a plurality of category-specific natural language converters (e.g., weighted finite state transducers, WFSTs), at least one token of the tagged stream of tokens, from a first lexical language form, to a first natural language form; and based on at least the first natural language form, outputting a natural language representation of the stream of tokens.

    HYPOTHESIS STITCHER FOR SPEECH RECOGNITION OF LONG-FORM AUDIO

    公开(公告)号:US20230154468A1

    公开(公告)日:2023-05-18

    申请号:US18157070

    申请日:2023-01-19

    CPC classification number: G10L17/02 G10L15/22 G10L15/26 G10L19/022 G10L21/0272

    Abstract: A hypothesis stitcher for speech recognition of long-form audio provides superior performance, such as higher accuracy and reduced computational cost. An example disclosed operation includes: segmenting the audio stream into a plurality of audio segments; identifying a plurality of speakers within each of the plurality of audio segments; performing automatic speech recognition (ASR) on each of the plurality of audio segments to generate a plurality of short-segment hypotheses; merging at least a portion of the short-segment hypotheses into a first merged hypothesis set; inserting stitching symbols into the first merged hypothesis set, the stitching symbols including a window change (WC) symbol; and consolidating, with a network-based hypothesis stitcher, the first merged hypothesis set into a first consolidated hypothesis. Multiple variations are disclosed, including alignment-based stitchers and serialized stitchers, which may operate as speaker-specific stitchers or multi-speaker stitchers, and may further support multiple options for differing hypothesis configurations.

Patent Agency Ranking