Patent search ap:("Microsoft Technology Licensing Page LLC") AND inv:"Yashesh GAUR"

1.

发明申请
INTERNAL LANGUAGE MODEL FOR E2E MODELS 有权

公开(公告)号：US20220139380A1

公开(公告)日：2022-05-05

申请号：US17154956

申请日：2021-01-21

Applicant: Microsoft Technology Licensing, LLC

Inventor： Zhong MENG , Sarangarajan PARTHASARATHY , Xie SUN , Yashesh GAUR , Naoyuki KANDA , Liang LU , Xie CHEN , Rui ZHAO , Jinyu LI , Yifan GONG

IPC: G10L15/16 , G06N3/04 , G10L15/06 , G10L15/01 , G10L15/183

Abstract: A computer device is provided that includes one or more processors configured to receive an end-to-end (E2E) model that has been trained for automatic speech recognition with training data from a source-domain, and receive an external language model that has been trained with training data from a target-domain. The one or more processors are configured to perform an inference of the probability of an output token sequence given a sequence of input speech features. Performing the inference includes computing an E2E model score, computing an external language model score, and computing an estimated internal language model score for the E2E model. The estimated internal language model score is computed by removing a contribution of an intrinsic acoustic model. The processor is further configured to compute an integrated score based at least on E2E model score, the external language model score, and the estimated internal language model score.

2.

发明公开
HYPOTHESIS STITCHER FOR SPEECH RECOGNITION OF LONG-FORM AUDIO 审中-公开

公开(公告)号：US20240185859A1

公开(公告)日：2024-06-06

申请号：US18440912

申请日：2024-02-13

Applicant: Microsoft Technology Licensing, LLC

Inventor： Naoyuki KANDA , Xuankai CHANG , Yashesh GAUR , Xiaofei WANG , Zhong MENG , Takuya YOSHIOKA

IPC: G10L17/02 , G10L15/22 , G10L15/26 , G10L19/022 , G10L21/0272

CPC classification number: G10L17/02 , G10L15/22 , G10L15/26 , G10L19/022 , G10L21/0272

Abstract: A hypothesis stitcher for speech recognition of long-form audio provides superior performance, such as higher accuracy and reduced computational cost. An example disclosed operation includes: segmenting the audio stream into a plurality of audio segments; identifying a plurality of speakers within each of the plurality of audio segments; performing automatic speech recognition (ASR) on each of the plurality of audio segments to generate a plurality of short-segment hypotheses; merging at least a portion of the short-segment hypotheses into a first merged hypothesis set; inserting stitching symbols into the first merged hypothesis set, the stitching symbols including a window change (WC) symbol; and consolidating, with a network-based hypothesis stitcher, the first merged hypothesis set into a first consolidated hypothesis. Multiple variations are disclosed, including alignment-based stitchers and serialized stitchers, which may operate as speaker-specific stitchers or multi-speaker stitchers, and may further support multiple options for differing hypothesis configurations.

3.

发明申请
HYPOTHESIS STITCHER FOR SPEECH RECOGNITION OF LONG-FORM AUDIO 有权

公开(公告)号：US20220199091A1

公开(公告)日：2022-06-23

申请号：US17127938

申请日：2020-12-18

Applicant: Microsoft Technology Licensing, LLC

Inventor： Naoyuki KANDA , Xuankai CHANG , Yashesh GAUR , Xiaofei WANG , Zhong MENG , Takuya YOSHIOKA

IPC: G10L17/02 , G10L15/22 , G10L15/26 , G10L19/022 , G10L21/0272

Abstract: A hypothesis stitcher for speech recognition of long-form audio provides superior performance, such as higher accuracy and reduced computational cost. An example disclosed operation includes: segmenting the audio stream into a plurality of audio segments; identifying a plurality of speakers within each of the plurality of audio segments; performing automatic speech recognition (ASR) on each of the plurality of audio segments to generate a plurality of short-segment hypotheses; merging at least a portion of the short-segment hypotheses into a first merged hypothesis set; inserting stitching symbols into the first merged hypothesis set, the stitching symbols including a window change (WC) symbol; and consolidating, with a network-based hypothesis stitcher, the first merged hypothesis set into a first consolidated hypothesis. Multiple variations are disclosed, including alignment-based stitchers and serialized stitchers, which may operate as speaker-specific stitchers or multi-speaker stitchers, and may further support multiple options for differing hypothesis configurations.

4.

发明申请
SPEAKER ADAPTATION FOR ATTENTION-BASED ENCODER-DECODER 有权

公开(公告)号：US20210065683A1

公开(公告)日：2021-03-04

申请号：US16675515

申请日：2019-11-06

Applicant: Microsoft Technology Licensing, LLC

Inventor： Zhong MENG , Yashesh GAUR , Jinyu LI , Yifan GONG

IPC: G10L15/065 , G10L15/22 , G10L19/00 , G10L15/06

Abstract: Embodiments are associated with a speaker-independent attention-based encoder-decoder model to classify output tokens based on input speech frames, the speaker-independent attention-based encoder-decoder model associated with a first output distribution, a speaker-dependent attention-based encoder-decoder model to classify output tokens based on input speech frames, the speaker-dependent attention-based encoder-decoder model associated with a second output distribution, training of the second attention-based encoder-decoder model to classify output tokens based on input speech frames of a target speaker and simultaneously training the speaker-dependent attention-based encoder-decoder model to maintain a similarity between the first output distribution and the second output distribution, and performing automatic speech recognition on speech frames of the target speaker using the trained speaker-dependent attention-based encoder-decoder model.

5.

发明公开
TRAINING AND USING A TRANSCRIPT GENERATION MODEL ON A MULTI-SPEAKER AUDIO STREAM 审中-公开

公开(公告)号：US20230215439A1

公开(公告)日：2023-07-06

申请号：US17566861

申请日：2021-12-31

Applicant: Microsoft Technology Licensing, LLC

Inventor： Naoyuki KANDA , Takuya YOSHIOKA , Zhuo CHEN , Jinyu LI , Yashesh GAUR , Zhong MENG , Xiaofei WANG , Xiong XIAO

IPC: G10L17/04 , G10L15/06 , G10L15/26

CPC classification number: G10L17/04 , G10L15/06 , G10L15/26

Abstract: The disclosure herein describes using a transcript generation model for generating a transcript from a multi-speaker audio stream. Audio data including overlapping speech of a plurality of speakers is obtained and a set of frame embeddings are generated from audio data frames of the obtained audio data using an audio data encoder. A set of words and channel change (CC) symbols are generated from the set of frame embeddings using a transcript generation model. The CC symbols are included between pairs of adjacent words that are spoken by different people at the same time. The set of words and CC symbols are transformed into a plurality of transcript lines, wherein words of the set of words are sorted into transcript lines based on the CC symbols, and a multi-speaker transcript is generated based on the plurality of transcript lines. The inclusion of CC symbols by the model enables efficient, accurate multi-speaker transcription.

6.

发明申请
DYNAMIC GRADIENT AGGREGATION FOR TRAINING NEURAL NETWORKS 有权

公开(公告)号：US20220036178A1

公开(公告)日：2022-02-03

申请号：US16945715

申请日：2020-07-31

Applicant: Microsoft Technology Licensing, LLC

Inventor： Dimitrios B. DIMITRIADIS , Kenichi KUMATANI , Robert Peter GMYR , Masaki ITAGAKI , Yashesh GAUR , Nanshan ZENG , Xuedong HUANG

IPC: G06N3/08 , G06N3/04

Abstract: The disclosure herein describes training a global model based on a plurality of data sets. The global model is applied to each data set of the plurality of data sets and a plurality of gradients is generated based on that application. At least one gradient quality metric is determined for each gradient of the plurality of gradients. Based on the determined gradient quality metrics of the plurality of gradients, a plurality of weight factors is calculated. The plurality of gradients is transformed into a plurality of weighted gradients based on the calculated plurality of weight factors and a global gradient is generated based on the plurality of weighted gradients. The global model is updated based on the global gradient, wherein the updated global model, when applied to a data set, performs a task based on the data set and provides model output based on performing the task.

7.

发明申请
SEQUENCE-TO-SEQUENCE SPEECH RECOGNITION WITH LATENCY THRESHOLD 有权

公开(公告)号：US20210312923A1

公开(公告)日：2021-10-07

申请号：US16841542

申请日：2020-04-06

Applicant: Microsoft Technology Licensing, LLC

Inventor： Yashesh GAUR , Jinyu LI , Liang LU , Hirofumi INAGUMA , Yifan GONG

IPC: G10L15/26 , G10L15/16

Abstract: A computing system including one or more processors configured to receive an audio input. The one or more processors may generate a text transcription of the audio input at a sequence-to-sequence speech recognition model, which may assign a respective plurality of external-model text tokens to a plurality of frames included in the audio input. Each external-model text token may have an external-model alignment within the audio input. Based on the audio input, the one or more processors may generate a plurality of hidden states. Based on the plurality of hidden states, the one or more processors may generate a plurality of output text tokens. Each output text token may have a corresponding output alignment within the audio input. For each output text token, a latency between the output alignment and the external-model alignment may be below a predetermined latency threshold. The one or more processors may output the text transcription.

8.

发明公开
TRAINING AND USING A TRANSCRIPT GENERATION MODEL ON A MULTI-SPEAKER AUDIO STREAM 审中-公开

公开(公告)号：US20240257815A1

公开(公告)日：2024-08-01

申请号：US18632277

申请日：2024-04-10

Applicant: Microsoft Technology Licensing, LLC

Inventor： Naoyuki KANDA , Takuya YOSHIOKA , Zhuo CHEN , Jinyu LI , Yashesh GAUR , Zhong MENG , Xiaofei WANG , Xiong XIAO

IPC: G10L17/04 , G10L15/06 , G10L15/26

CPC classification number: G10L17/04 , G10L15/06 , G10L15/26

Abstract: The disclosure herein describes using a transcript generation model for generating a transcript from a multi-speaker audio stream. Audio data including overlapping speech of a plurality of speakers is obtained and a set of frame embeddings are generated from audio data frames of obtained audio data using an audio data encoder. A set of words and channel change (CC) symbols are generated from the set of frame embeddings using a transcript generation model. The CC symbols are included between pairs of adjacent words that are spoken by different people at the same time. The set of words and CC symbols are transformed into a plurality of transcript lines, wherein words of the set of words are sorted into transcript lines based on CC symbols, and a multi-speaker transcript is generated based on the plurality of transcript lines. The inclusion of CC symbols by the model enables efficient, accurate multi-speaker transcription.

9.

发明公开
ON-DEVICE STREAMING INVERSE TEXT NORMALIZATION (ITN) 审中-公开

公开(公告)号：US20230289536A1

公开(公告)日：2023-09-14

申请号：US17693267

申请日：2022-03-11

Applicant: Microsoft Technology Licensing, LLC

Inventor： Yashesh GAUR , Nicholas KIBRE , Issac J. ALPHONSO , Jian XUE , Jinyu LI , Piyush BEHRE , Shawn CHANG

IPC: G06F40/56 , G06F40/284 , G10L15/08

CPC classification number: G06F40/56 , G06F40/284 , G10L15/08

Abstract: Solutions for on-device streaming inverse text normalization (ITN) include: receiving a stream of tokens, each token representing an element of human speech; tagging, by a tagger that can work in a streaming manner (e.g., a neural network), the stream of tokens with one or more tags of a plurality of tags to produce a tagged stream of tokens, each tag of the plurality of tags representing a different normalization category of a plurality of normalization categories; based on at least a first tag representing a first normalization category, converting, by a first language converter of a plurality of category-specific natural language converters (e.g., weighted finite state transducers, WFSTs), at least one token of the tagged stream of tokens, from a first lexical language form, to a first natural language form; and based on at least the first natural language form, outputting a natural language representation of the stream of tokens.

10.

发明公开
HYPOTHESIS STITCHER FOR SPEECH RECOGNITION OF LONG-FORM AUDIO 审中-公开

公开(公告)号：US20230154468A1

公开(公告)日：2023-05-18

申请号：US18157070

申请日：2023-01-19

Applicant: Microsoft Technology Licensing, LLC

Inventor： Naoyuki KANDA , Xuankai CHANG , Yashesh GAUR , Xiaofei WANG , Zhong MENG , Takuya YOSHIOKA

IPC: G10L17/02 , G10L15/22 , G10L15/26 , G10L19/022 , G10L21/0272

CPC classification number: G10L17/02 , G10L15/22 , G10L15/26 , G10L19/022 , G10L21/0272

Abstract: A hypothesis stitcher for speech recognition of long-form audio provides superior performance, such as higher accuracy and reduced computational cost. An example disclosed operation includes: segmenting the audio stream into a plurality of audio segments; identifying a plurality of speakers within each of the plurality of audio segments; performing automatic speech recognition (ASR) on each of the plurality of audio segments to generate a plurality of short-segment hypotheses; merging at least a portion of the short-segment hypotheses into a first merged hypothesis set; inserting stitching symbols into the first merged hypothesis set, the stitching symbols including a window change (WC) symbol; and consolidating, with a network-based hypothesis stitcher, the first merged hypothesis set into a first consolidated hypothesis. Multiple variations are disclosed, including alignment-based stitchers and serialized stitchers, which may operate as speaker-specific stitchers or multi-speaker stitchers, and may further support multiple options for differing hypothesis configurations.

Search Results

Country/Region

Patent validity

Application date

Publication (announcement) day

applicant

The country/region where the applicant is located

Inventor

IPC

IPC Department

IPC class

IPC subclass

IPC group

IPC team

Appearance classification