-
公开(公告)号:US20220139380A1
公开(公告)日:2022-05-05
申请号:US17154956
申请日:2021-01-21
Applicant: Microsoft Technology Licensing, LLC
Inventor: Zhong MENG , Sarangarajan PARTHASARATHY , Xie SUN , Yashesh GAUR , Naoyuki KANDA , Liang LU , Xie CHEN , Rui ZHAO , Jinyu LI , Yifan GONG
IPC: G10L15/16 , G06N3/04 , G10L15/06 , G10L15/01 , G10L15/183
Abstract: A computer device is provided that includes one or more processors configured to receive an end-to-end (E2E) model that has been trained for automatic speech recognition with training data from a source-domain, and receive an external language model that has been trained with training data from a target-domain. The one or more processors are configured to perform an inference of the probability of an output token sequence given a sequence of input speech features. Performing the inference includes computing an E2E model score, computing an external language model score, and computing an estimated internal language model score for the E2E model. The estimated internal language model score is computed by removing a contribution of an intrinsic acoustic model. The processor is further configured to compute an integrated score based at least on E2E model score, the external language model score, and the estimated internal language model score.
-
公开(公告)号:US20240185859A1
公开(公告)日:2024-06-06
申请号:US18440912
申请日:2024-02-13
Applicant: Microsoft Technology Licensing, LLC
Inventor: Naoyuki KANDA , Xuankai CHANG , Yashesh GAUR , Xiaofei WANG , Zhong MENG , Takuya YOSHIOKA
IPC: G10L17/02 , G10L15/22 , G10L15/26 , G10L19/022 , G10L21/0272
CPC classification number: G10L17/02 , G10L15/22 , G10L15/26 , G10L19/022 , G10L21/0272
Abstract: A hypothesis stitcher for speech recognition of long-form audio provides superior performance, such as higher accuracy and reduced computational cost. An example disclosed operation includes: segmenting the audio stream into a plurality of audio segments; identifying a plurality of speakers within each of the plurality of audio segments; performing automatic speech recognition (ASR) on each of the plurality of audio segments to generate a plurality of short-segment hypotheses; merging at least a portion of the short-segment hypotheses into a first merged hypothesis set; inserting stitching symbols into the first merged hypothesis set, the stitching symbols including a window change (WC) symbol; and consolidating, with a network-based hypothesis stitcher, the first merged hypothesis set into a first consolidated hypothesis. Multiple variations are disclosed, including alignment-based stitchers and serialized stitchers, which may operate as speaker-specific stitchers or multi-speaker stitchers, and may further support multiple options for differing hypothesis configurations.
-
公开(公告)号:US20220199091A1
公开(公告)日:2022-06-23
申请号:US17127938
申请日:2020-12-18
Applicant: Microsoft Technology Licensing, LLC
Inventor: Naoyuki KANDA , Xuankai CHANG , Yashesh GAUR , Xiaofei WANG , Zhong MENG , Takuya YOSHIOKA
IPC: G10L17/02 , G10L15/22 , G10L15/26 , G10L19/022 , G10L21/0272
Abstract: A hypothesis stitcher for speech recognition of long-form audio provides superior performance, such as higher accuracy and reduced computational cost. An example disclosed operation includes: segmenting the audio stream into a plurality of audio segments; identifying a plurality of speakers within each of the plurality of audio segments; performing automatic speech recognition (ASR) on each of the plurality of audio segments to generate a plurality of short-segment hypotheses; merging at least a portion of the short-segment hypotheses into a first merged hypothesis set; inserting stitching symbols into the first merged hypothesis set, the stitching symbols including a window change (WC) symbol; and consolidating, with a network-based hypothesis stitcher, the first merged hypothesis set into a first consolidated hypothesis. Multiple variations are disclosed, including alignment-based stitchers and serialized stitchers, which may operate as speaker-specific stitchers or multi-speaker stitchers, and may further support multiple options for differing hypothesis configurations.
-
公开(公告)号:US20210065683A1
公开(公告)日:2021-03-04
申请号:US16675515
申请日:2019-11-06
Applicant: Microsoft Technology Licensing, LLC
Inventor: Zhong MENG , Yashesh GAUR , Jinyu LI , Yifan GONG
IPC: G10L15/065 , G10L15/22 , G10L19/00 , G10L15/06
Abstract: Embodiments are associated with a speaker-independent attention-based encoder-decoder model to classify output tokens based on input speech frames, the speaker-independent attention-based encoder-decoder model associated with a first output distribution, a speaker-dependent attention-based encoder-decoder model to classify output tokens based on input speech frames, the speaker-dependent attention-based encoder-decoder model associated with a second output distribution, training of the second attention-based encoder-decoder model to classify output tokens based on input speech frames of a target speaker and simultaneously training the speaker-dependent attention-based encoder-decoder model to maintain a similarity between the first output distribution and the second output distribution, and performing automatic speech recognition on speech frames of the target speaker using the trained speaker-dependent attention-based encoder-decoder model.
-
公开(公告)号:US20230215439A1
公开(公告)日:2023-07-06
申请号:US17566861
申请日:2021-12-31
Applicant: Microsoft Technology Licensing, LLC
Inventor: Naoyuki KANDA , Takuya YOSHIOKA , Zhuo CHEN , Jinyu LI , Yashesh GAUR , Zhong MENG , Xiaofei WANG , Xiong XIAO
Abstract: The disclosure herein describes using a transcript generation model for generating a transcript from a multi-speaker audio stream. Audio data including overlapping speech of a plurality of speakers is obtained and a set of frame embeddings are generated from audio data frames of the obtained audio data using an audio data encoder. A set of words and channel change (CC) symbols are generated from the set of frame embeddings using a transcript generation model. The CC symbols are included between pairs of adjacent words that are spoken by different people at the same time. The set of words and CC symbols are transformed into a plurality of transcript lines, wherein words of the set of words are sorted into transcript lines based on the CC symbols, and a multi-speaker transcript is generated based on the plurality of transcript lines. The inclusion of CC symbols by the model enables efficient, accurate multi-speaker transcription.
-
公开(公告)号:US20220036178A1
公开(公告)日:2022-02-03
申请号:US16945715
申请日:2020-07-31
Applicant: Microsoft Technology Licensing, LLC
Inventor: Dimitrios B. DIMITRIADIS , Kenichi KUMATANI , Robert Peter GMYR , Masaki ITAGAKI , Yashesh GAUR , Nanshan ZENG , Xuedong HUANG
Abstract: The disclosure herein describes training a global model based on a plurality of data sets. The global model is applied to each data set of the plurality of data sets and a plurality of gradients is generated based on that application. At least one gradient quality metric is determined for each gradient of the plurality of gradients. Based on the determined gradient quality metrics of the plurality of gradients, a plurality of weight factors is calculated. The plurality of gradients is transformed into a plurality of weighted gradients based on the calculated plurality of weight factors and a global gradient is generated based on the plurality of weighted gradients. The global model is updated based on the global gradient, wherein the updated global model, when applied to a data set, performs a task based on the data set and provides model output based on performing the task.
-
公开(公告)号:US20210312923A1
公开(公告)日:2021-10-07
申请号:US16841542
申请日:2020-04-06
Applicant: Microsoft Technology Licensing, LLC
Inventor: Yashesh GAUR , Jinyu LI , Liang LU , Hirofumi INAGUMA , Yifan GONG
Abstract: A computing system including one or more processors configured to receive an audio input. The one or more processors may generate a text transcription of the audio input at a sequence-to-sequence speech recognition model, which may assign a respective plurality of external-model text tokens to a plurality of frames included in the audio input. Each external-model text token may have an external-model alignment within the audio input. Based on the audio input, the one or more processors may generate a plurality of hidden states. Based on the plurality of hidden states, the one or more processors may generate a plurality of output text tokens. Each output text token may have a corresponding output alignment within the audio input. For each output text token, a latency between the output alignment and the external-model alignment may be below a predetermined latency threshold. The one or more processors may output the text transcription.
-
公开(公告)号:US20240257815A1
公开(公告)日:2024-08-01
申请号:US18632277
申请日:2024-04-10
Applicant: Microsoft Technology Licensing, LLC
Inventor: Naoyuki KANDA , Takuya YOSHIOKA , Zhuo CHEN , Jinyu LI , Yashesh GAUR , Zhong MENG , Xiaofei WANG , Xiong XIAO
Abstract: The disclosure herein describes using a transcript generation model for generating a transcript from a multi-speaker audio stream. Audio data including overlapping speech of a plurality of speakers is obtained and a set of frame embeddings are generated from audio data frames of obtained audio data using an audio data encoder. A set of words and channel change (CC) symbols are generated from the set of frame embeddings using a transcript generation model. The CC symbols are included between pairs of adjacent words that are spoken by different people at the same time. The set of words and CC symbols are transformed into a plurality of transcript lines, wherein words of the set of words are sorted into transcript lines based on CC symbols, and a multi-speaker transcript is generated based on the plurality of transcript lines. The inclusion of CC symbols by the model enables efficient, accurate multi-speaker transcription.
-
公开(公告)号:US20230289536A1
公开(公告)日:2023-09-14
申请号:US17693267
申请日:2022-03-11
Applicant: Microsoft Technology Licensing, LLC
Inventor: Yashesh GAUR , Nicholas KIBRE , Issac J. ALPHONSO , Jian XUE , Jinyu LI , Piyush BEHRE , Shawn CHANG
IPC: G06F40/56 , G06F40/284 , G10L15/08
CPC classification number: G06F40/56 , G06F40/284 , G10L15/08
Abstract: Solutions for on-device streaming inverse text normalization (ITN) include: receiving a stream of tokens, each token representing an element of human speech; tagging, by a tagger that can work in a streaming manner (e.g., a neural network), the stream of tokens with one or more tags of a plurality of tags to produce a tagged stream of tokens, each tag of the plurality of tags representing a different normalization category of a plurality of normalization categories; based on at least a first tag representing a first normalization category, converting, by a first language converter of a plurality of category-specific natural language converters (e.g., weighted finite state transducers, WFSTs), at least one token of the tagged stream of tokens, from a first lexical language form, to a first natural language form; and based on at least the first natural language form, outputting a natural language representation of the stream of tokens.
-
公开(公告)号:US20230154468A1
公开(公告)日:2023-05-18
申请号:US18157070
申请日:2023-01-19
Applicant: Microsoft Technology Licensing, LLC
Inventor: Naoyuki KANDA , Xuankai CHANG , Yashesh GAUR , Xiaofei WANG , Zhong MENG , Takuya YOSHIOKA
IPC: G10L17/02 , G10L15/22 , G10L15/26 , G10L19/022 , G10L21/0272
CPC classification number: G10L17/02 , G10L15/22 , G10L15/26 , G10L19/022 , G10L21/0272
Abstract: A hypothesis stitcher for speech recognition of long-form audio provides superior performance, such as higher accuracy and reduced computational cost. An example disclosed operation includes: segmenting the audio stream into a plurality of audio segments; identifying a plurality of speakers within each of the plurality of audio segments; performing automatic speech recognition (ASR) on each of the plurality of audio segments to generate a plurality of short-segment hypotheses; merging at least a portion of the short-segment hypotheses into a first merged hypothesis set; inserting stitching symbols into the first merged hypothesis set, the stitching symbols including a window change (WC) symbol; and consolidating, with a network-based hypothesis stitcher, the first merged hypothesis set into a first consolidated hypothesis. Multiple variations are disclosed, including alignment-based stitchers and serialized stitchers, which may operate as speaker-specific stitchers or multi-speaker stitchers, and may further support multiple options for differing hypothesis configurations.