-
公开(公告)号:US20240362453A1
公开(公告)日:2024-10-31
申请号:US18766038
申请日:2024-07-08
Applicant: Google LLC
Inventor: Anmol Gulati , Weikeng Qin , Zhengdong Zhang , Ruoming Pang , Niki Parmar , Jiahui Yu , Wei Han , Chung-Cheng Chiu , Yu Zhang , Yonghui Wu , Shibo Wang
Abstract: Systems and methods can utilize a conformer model to process a data set for various data processing tasks, including, but not limited to, speech recognition, sound separation, protein synthesis determination, video or other image set analysis, and natural language processing. The conformer model can use feed-forward blocks, a self-attention block, and a convolution block to process data to learn global interactions and relative-offset-based local correlations of the input data.
-
公开(公告)号:US12079703B2
公开(公告)日:2024-09-03
申请号:US17139525
申请日:2020-12-31
Applicant: Google LLC
Inventor: Anmol Gulati , Ruoming Pang , Niki Parmar , Jiahui Yu , Wei Han , Chung-Cheng Chiu , Yu Zhang , Yonghui Wu , Shibo Wang , Weikeng Qin , Zhengdong Zhang
Abstract: Systems and methods can utilize a conformer model to process a data set for various data processing tasks, including, but not limited to, speech recognition, sound separation, protein synthesis determination, video or other image set analysis, and natural language processing. The conformer model can use feed-forward blocks, a self-attention block, and a convolution block to process data to learn global interactions and relative-offset-based local correlations of the input data.
-
公开(公告)号:US20240282294A1
公开(公告)日:2024-08-22
申请号:US18651296
申请日:2024-04-30
Applicant: Google LLC
Inventor: Qingqing Huang , Daniel Sung-Joon Park , Aren Jansen , Timo Immanuel Denk , Yue Li , Ravi Ganti , Dan Ellis , Tao Wang , Wei Han , Joonseok Lee
CPC classification number: G10L15/063 , G10L15/16
Abstract: A corpus of textual data is generated with a machine-learned text generation model. The corpus of textual data includes a plurality of sentences. Each sentence is descriptive of a type of audio. For each of a plurality of audio recordings, the audio recording is processed with a machine-learned audio classification model to obtain training data including the audio recording and one or more sentences of the plurality of sentences closest to the audio recording within a joint audio-text embedding space of the machine-learned audio classification model. The sentence(s) are processed with a machine-learned generation model to obtain an intermediate representation of the one or more sentences. The intermediate representation is processed with a machine-learned cascaded diffusion model to obtain audio data. The machine-learned cascaded diffusion model is trained based on a difference between the audio data and the audio recording.
-
公开(公告)号:US20220415042A1
公开(公告)日:2022-12-29
申请号:US17901224
申请日:2022-09-01
Applicant: Google LLC
Inventor: Jonathon Shlens , Vijay Vasudevan , Jiquan Ngiam , Wei Han , Zhifeng Chen , Brandon Chauloon Yang , Benjamin James Caine , Zhengdong Zhang , Christoph Sprunk , Ouais Alsharif , Junhua Mao , Chen Wu
Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for processing data generated by a sensing system that rotationally senses an environment. In one aspect, a method comprises partitioning a predetermined period of time into a plurality of sub-periods, wherein the predetermined period of time is a period of time for which data generated by the sensing system constitutes a complete rotational sensing of the environment; for each sub-period: receiving current data generated by the sensing system during the sub-period and characterizing a respective partial scene of the environment; processing the current data using an object detection neural network to generate a current object detection output that is specific to the respective partial scene of the environment.
-
公开(公告)号:US20250037426A1
公开(公告)日:2025-01-30
申请号:US18716912
申请日:2022-12-09
Applicant: Google LLC
Inventor: Bowen Zhang , Jiahui Yu , Christopher Fifty , Wei Han , Andrew M. Dai , Ruoming Pang , Fei Sha
IPC: G06V10/764 , G06V10/774
Abstract: A method includes obtaining video datasets each including pairs of a training video and a ground-truth action classification of the training video. The method also includes generating an action recognition model that includes a shared encoder model and action classification heads. A number of the action classifications heads may be equal to a number of the video datasets, and each action classification head may be configured to, based on an output of the shared encoder model, classify training videos sampled from a corresponding video dataset. The method also includes determining, by the action recognition model and for each training video sampled from the video datasets, an inferred action classification. The method further includes determining a loss value based on the inferred action classifications and the ground-truth action classifications, and adjusting parameters of the action recognition model based on the loss value.
-
公开(公告)号:US20220122586A1
公开(公告)日:2022-04-21
申请号:US17447285
申请日:2021-09-09
Applicant: Google LLC
Inventor: Jiahui Yu , Chung-cheng Chiu , Bo Li , Shuo-yiin Chang , Tara Sainath , Wei Han , Anmol Gulati , Yanzhang He , Arun Narayanan , Yonghui Wu , Ruoming Pang
Abstract: A computer-implemented method of training a streaming speech recognition model that includes receiving, as input to the streaming speech recognition model, a sequence of acoustic frames. The streaming speech recognition model is configured to learn an alignment probability between the sequence of acoustic frames and an output sequence of vocabulary tokens. The vocabulary tokens include a plurality of label tokens and a blank token. At each output step, the method includes determining a first probability of emitting one of the label tokens and determining a second probability of emitting the blank token. The method also includes generating the alignment probability at a sequence level based on the first probability and the second probability. The method also includes applying a tuning parameter to the alignment probability at the sequence level to maximize the first probability of emitting one of the label tokens.
-
公开(公告)号:US20240104352A1
公开(公告)日:2024-03-28
申请号:US18012391
申请日:2022-07-28
Applicant: Google LLC
Inventor: Yu Zhang , Yu-An Chung , Wei Han , Chung-Cheng Chiu , Weikeng Qin , Ruoming Pang , Yonghui Wu
IPC: G06N3/0455
CPC classification number: G06N3/0455
Abstract: Provided are improved end-to-end self-supervised pre-training frameworks that leverage a combination of contrastive and masked modeling loss terms. In particular, the present disclosure provides framework that combines contrastive learning and masked modeling, where the former trains the model to discretize input data (e.g., continuous signals such as continuous speech signals) into a finite set of discriminative tokens, and the latter trains the model to learn contextualized representations via solving a masked prediction task consuming the discretized tokens. In contrast to certain existing masked modeling-based pre-training frameworks which rely on an iterative re-clustering and re-training process or other existing frameworks which concatenate two separately trained modules, the proposed framework can enable a model to be optimized in an end-to-end fashion by solving the two self-supervised tasks (the contrastive task and masked modeling) simultaneously.
-
公开(公告)号:US11804212B2
公开(公告)日:2023-10-31
申请号:US17348118
申请日:2021-06-15
Applicant: Google LLC
Inventor: Thibault Doutre , Wei Han , Min Ma , Zhiyun Lu , Chung-Cheng Chiu , Ruoming Pang , Arun Narayanan , Ananya Misra , Yu Zhang , Liangliang Cao
CPC classification number: G10L15/063 , G06N3/045 , G10L15/083 , G10L15/18
Abstract: A method for training a streaming automatic speech recognition student model includes receiving a plurality of unlabeled student training utterances. The method also includes, for each unlabeled student training utterance, generating a transcription corresponding to the respective unlabeled student training utterance using a plurality of non-streaming automated speech recognition (ASR) teacher models. The method further includes distilling a streaming ASR student model from the plurality of non-streaming ASR teacher models by training the streaming ASR student model using the plurality of unlabeled student training utterances paired with the corresponding transcriptions generated by the plurality of non-streaming ASR teacher models.
-
公开(公告)号:US11774596B2
公开(公告)日:2023-10-03
申请号:US17901224
申请日:2022-09-01
Applicant: Google LLC
Inventor: Jonathon Shlens , Vijay Vasudevan , Jiquan Ngiam , Wei Han , Zhifeng Chen , Brandon Chauloon Yang , Benjamin James Caine , Zhengdong Zhang , Christoph Sprunk , Ouais Alsharif , Junhua Mao , Chen Wu
Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for processing data generated by a sensing system that rotationally senses an environment. In one aspect, a method comprises partitioning a predetermined period of time into a plurality of sub-periods, wherein the predetermined period of time is a period of time for which data generated by the sensing system constitutes a complete rotational sensing of the environment; for each sub-period: receiving current data generated by the sensing system during the sub-period and characterizing a respective partial scene of the environment; processing the current data using an object detection neural network to generate a current object detection output that is specific to the respective partial scene of the environment.
-
10.
公开(公告)号:US20240013777A1
公开(公告)日:2024-01-11
申请号:US18320458
申请日:2023-05-19
Applicant: Google LLC
Inventor: Zhiyun Lu , Yu Zhang , Wei Han , Yongqiang Wang , Parisa Haghani , Zhehuai Chen
CPC classification number: G10L15/16 , G10L15/063
Abstract: A method includes obtaining a corpus of unlabeled training data including a plurality of spoken utterances, each corresponding spoken utterance of the plurality of spoken utterances includes audio data characterizing the corresponding spoken utterance. The method also includes receiving a target domain. The method also includes selecting, using a contrastive data selection model, a subset of the utterances from the corpus of unlabeled training data that correspond to the target domain. The method includes training an automatic speech recognition (ASR) model on the subset of utterances.
-
-
-
-
-
-
-
-
-