Diffusion Models for Generation of Audio Data Based on Descriptive Textual Prompts

    公开(公告)号:US20240282294A1

    公开(公告)日:2024-08-22

    申请号:US18651296

    申请日:2024-04-30

    Applicant: Google LLC

    CPC classification number: G10L15/063 G10L15/16

    Abstract: A corpus of textual data is generated with a machine-learned text generation model. The corpus of textual data includes a plurality of sentences. Each sentence is descriptive of a type of audio. For each of a plurality of audio recordings, the audio recording is processed with a machine-learned audio classification model to obtain training data including the audio recording and one or more sentences of the plurality of sentences closest to the audio recording within a joint audio-text embedding space of the machine-learned audio classification model. The sentence(s) are processed with a machine-learned generation model to obtain an intermediate representation of the one or more sentences. The intermediate representation is processed with a machine-learned cascaded diffusion model to obtain audio data. The machine-learned cascaded diffusion model is trained based on a difference between the audio data and the audio recording.

    Co-Training of Action Recognition Machine Learning Models

    公开(公告)号:US20250037426A1

    公开(公告)日:2025-01-30

    申请号:US18716912

    申请日:2022-12-09

    Applicant: Google LLC

    Abstract: A method includes obtaining video datasets each including pairs of a training video and a ground-truth action classification of the training video. The method also includes generating an action recognition model that includes a shared encoder model and action classification heads. A number of the action classifications heads may be equal to a number of the video datasets, and each action classification head may be configured to, based on an output of the shared encoder model, classify training videos sampled from a corresponding video dataset. The method also includes determining, by the action recognition model and for each training video sampled from the video datasets, an inferred action classification. The method further includes determining a loss value based on the inferred action classifications and the ground-truth action classifications, and adjusting parameters of the action recognition model based on the loss value.

    Fast Emit Low-latency Streaming ASR with Sequence-level Emission Regularization

    公开(公告)号:US20220122586A1

    公开(公告)日:2022-04-21

    申请号:US17447285

    申请日:2021-09-09

    Applicant: Google LLC

    Abstract: A computer-implemented method of training a streaming speech recognition model that includes receiving, as input to the streaming speech recognition model, a sequence of acoustic frames. The streaming speech recognition model is configured to learn an alignment probability between the sequence of acoustic frames and an output sequence of vocabulary tokens. The vocabulary tokens include a plurality of label tokens and a blank token. At each output step, the method includes determining a first probability of emitting one of the label tokens and determining a second probability of emitting the blank token. The method also includes generating the alignment probability at a sequence level based on the first probability and the second probability. The method also includes applying a tuning parameter to the alignment probability at the sequence level to maximize the first probability of emitting one of the label tokens.

    Contrastive Learning and Masked Modeling for End-To-End Self-Supervised Pre-Training

    公开(公告)号:US20240104352A1

    公开(公告)日:2024-03-28

    申请号:US18012391

    申请日:2022-07-28

    Applicant: Google LLC

    CPC classification number: G06N3/0455

    Abstract: Provided are improved end-to-end self-supervised pre-training frameworks that leverage a combination of contrastive and masked modeling loss terms. In particular, the present disclosure provides framework that combines contrastive learning and masked modeling, where the former trains the model to discretize input data (e.g., continuous signals such as continuous speech signals) into a finite set of discriminative tokens, and the latter trains the model to learn contextualized representations via solving a masked prediction task consuming the discretized tokens. In contrast to certain existing masked modeling-based pre-training frameworks which rely on an iterative re-clustering and re-training process or other existing frameworks which concatenate two separately trained modules, the proposed framework can enable a model to be optimized in an end-to-end fashion by solving the two self-supervised tasks (the contrastive task and masked modeling) simultaneously.

Patent Agency Ranking