Supervised and unsupervised training with contrastive loss over sequences

    公开(公告)号:US12230249B2

    公开(公告)日:2025-02-18

    申请号:US17655903

    申请日:2022-03-22

    Applicant: Google LLC

    Abstract: A method includes receiving audio data corresponding to an utterance and generating a pair of positive audio data examples. Here, each positive audio data example includes a respective augmented copy of the received audio data. For each respective positive audio data example, the method includes generating a respective sequence of encoder outputs and projecting the respective sequence of encoder outputs for the positive data example into a contrastive loss space. The method also includes determining a L2 distance between each corresponding encoder output in the projected sequences of encoder outputs for the positive audio data examples and determining a per-utterance consistency loss by averaging the L2 distances. The method also includes generating corresponding speech recognition results for each respective positive audio data example. The method also includes updating parameters of the speech recognition model based on a respective supervised loss term and the per-utterance consistency loss.

    RESIDUAL ADAPTERS FOR FEW-SHOT TEXT-TO-SPEECH SPEAKER ADAPTATION

    公开(公告)号:US20240135915A1

    公开(公告)日:2024-04-25

    申请号:US18493770

    申请日:2023-10-23

    Applicant: Google LLC

    CPC classification number: G10L13/027

    Abstract: A method for residual adapters for few-shot text-to-speech speaker adaptation includes obtaining a text-to-speech (TTS) model configured to convert text into representations of synthetic speech, the TTS model pre-trained on an initial training data set. The method further includes augmenting the TTS model with a stack of residual adapters. The method includes receiving an adaption training data set including one or more spoken utterances spoken by a target speaker, each spoken utterance in the adaptation training data set paired with corresponding input text associated with a transcription of the spoken utterance. The method also includes adapting, using the adaption training data set, the TTS model augmented with the stack of residual adapters to learn how to synthesize speech in a voice of the target speaker by optimizing the stack of residual adapters while parameters of the TTS model are frozen.

    Contrastive Learning and Masked Modeling for End-To-End Self-Supervised Pre-Training

    公开(公告)号:US20240104352A1

    公开(公告)日:2024-03-28

    申请号:US18012391

    申请日:2022-07-28

    Applicant: Google LLC

    CPC classification number: G06N3/0455

    Abstract: Provided are improved end-to-end self-supervised pre-training frameworks that leverage a combination of contrastive and masked modeling loss terms. In particular, the present disclosure provides framework that combines contrastive learning and masked modeling, where the former trains the model to discretize input data (e.g., continuous signals such as continuous speech signals) into a finite set of discriminative tokens, and the latter trains the model to learn contextualized representations via solving a masked prediction task consuming the discretized tokens. In contrast to certain existing masked modeling-based pre-training frameworks which rely on an iterative re-clustering and re-training process or other existing frameworks which concatenate two separately trained modules, the proposed framework can enable a model to be optimized in an end-to-end fashion by solving the two self-supervised tasks (the contrastive task and masked modeling) simultaneously.

    END-TO-END SPEECH WAVEFORM GENERATION THROUGH DATA DENSITY GRADIENT ESTIMATION

    公开(公告)号:US20230252974A1

    公开(公告)日:2023-08-10

    申请号:US18010438

    申请日:2021-09-02

    Applicant: Google LLC

    CPC classification number: G10L13/08 G10L21/0208

    Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for generating waveforms conditioned on phoneme sequences. In one aspect, a method comprises: obtaining a phoneme sequence; processing the phoneme sequence using an encoder neural network to generate a hidden representation of the phoneme sequence; generating, from the hidden representation, a conditioning input; initializing a current waveform output; and generating a final waveform output that defines an utterance of the phoneme sequence by a speaker by updating the current waveform output at each of a plurality of iterations, wherein each iteration corresponds to a respective noise level, and wherein the updating comprises, at each iteration: processing (i) the current waveform output and (ii) the conditioning input using a noise estimation neural network to generate a noise output; and updating the current waveform output using the noise output and the noise level for the iteration.

    Dynamic adjustment of delivery location based on user location

    公开(公告)号:US11687871B2

    公开(公告)日:2023-06-27

    申请号:US16865970

    申请日:2020-05-04

    Applicant: Google LLC

    Inventor: Yu Zhang

    CPC classification number: G06Q10/08355

    Abstract: A user places an order on a merchant website associated with a merchant system via a user computing device. The user selects an option for delivery to the user computing device location within a delivery area during a delivery time window and authorizes a delivery system to log the location of the user computing device during and/or a period of time before the delivery time window. When the delivery time window arrives, the delivery system provides a delivery route to a delivery agent computing device. When the delivery agent arrives at the user computing device's location, the user receives an alert that the delivery agent has arrived and receives a package from the delivery agent. If the user does not remain within the delivery area, the user may cancel the order and the delivery, may reschedule the delivery, and/or may accept delivery of the order to a fixed shipping address.

    Two-Level Text-To-Speech Systems Using Synthetic Training Data

    公开(公告)号:US20230018384A1

    公开(公告)日:2023-01-19

    申请号:US17305809

    申请日:2021-07-14

    Applicant: Google LLC

    Abstract: A method includes obtaining training data including a plurality of training audio signals and corresponding transcripts. Each training audio signal is spoken by a target speaker in a first accent/dialect. For each training audio signal of the training data, the method includes generating a training synthesized speech representation spoken by the target speaker in a second accent/dialect different than the first accent/dialect and training a text-to-speech (TTS) system based on the corresponding transcript and the training synthesized speech representation. The method also includes receiving an input text utterance to be synthesized into speech in the second accent/dialect. The method also includes obtaining conditioning inputs that include a speaker embedding and an accent/dialect identifier that identifies the second accent/dialect. The method also includes generating an output audio waveform corresponding to a synthesized speech representation of the input text sequence that clones the voice of the target speaker in the second accent/dialect.

    Building a Text-to-Speech System from a Small Amount of Speech Data

    公开(公告)号:US20220068256A1

    公开(公告)日:2022-03-03

    申请号:US17005974

    申请日:2020-08-28

    Applicant: Google LLC

    Abstract: A method of building a text-to-speech (TTS) system from a small amount of speech data includes receiving a first plurality of recorded speech samples from an assortment of speakers and a second plurality of recorded speech samples from a target speaker where the assortment of speakers does not include the target speaker. The method further includes training a TTS model using the first plurality of recorded speech samples from the assortment of speakers. Here, the trained TTS model is configured to output synthetic speech as an audible representation of a text input. The method also includes re-training the trained TTS model using the second plurality of recorded speech samples from the target speaker combined with the first plurality of recorded speech samples from the assortment of speakers. Here, the re-trained TTS model is configured to output synthetic speech resembling speaking characteristics of the target speaker.

Patent Agency Ranking