Two-Level Speech Prosody Transfer
    12.
    发明申请

    公开(公告)号:US20230064749A1

    公开(公告)日:2023-03-02

    申请号:US18054604

    申请日:2022-11-11

    Applicant: Google LLC

    Abstract: A method includes receiving an input text utterance to be synthesized into expressive speech having an intended prosody and a target voice and generating, using a first text-to-speech (TTS) model, an intermediate synthesized speech representation for the input text utterance. The intermediate synthesized speech representation possesses the intended prosody. The method also includes providing the intermediate synthesized speech representation to a second TTS model that includes an encoder portion and a decoder portion. The encoder portion is configured to encode the intermediate synthesized speech representation into an utterance embedding that specifies the intended prosody. The decoder portion is configured to process the input text utterance and the utterance embedding to generate an output audio signal of expressive speech that has the intended prosody specified by the utterance embedding and speaker characteristics of the target voice.

    Two-level speech prosody transfer
    13.
    发明授权

    公开(公告)号:US11514888B2

    公开(公告)日:2022-11-29

    申请号:US16992410

    申请日:2020-08-13

    Applicant: Google LLC

    Abstract: A method includes receiving an input text utterance to be synthesized into expressive speech having an intended prosody and a target voice and generating, using a first text-to-speech (TTS) model, an intermediate synthesized speech representation tor the input text utterance. The intermediate synthesized speech representation possesses the intended prosody. The method also includes providing the intermediate synthesized speech representation to a second TTS model that includes an encoder portion and a decoder portion. The encoder portion is configured to encode the intermediate synthesized speech representation into an utterance embedding that specifies the intended prosody. The decoder portion is configured to process the input text utterance and the utterance embedding to generate an output audio signal of expressive speech that has the intended prosody specified by the utterance embedding and speaker characteristics of the target voice.

    Training neural networks to generate structured embeddings

    公开(公告)号:US11494695B2

    公开(公告)日:2022-11-08

    申请号:US16586223

    申请日:2019-09-27

    Applicant: Google LLC

    Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training a machine learning model to generate embeddings of inputs to the machine learning model, the machine learning model having an encoder that generates the embeddings from the inputs and a decoder that generates outputs from the generated embeddings, wherein the embedding is partitioned into a sequence of embedding partitions that each includes one or more dimensions of the embedding, the operations comprising: for a first embedding partition in the sequence of embedding partitions: performing initial training to train the encoder and a decoder replica corresponding to the first embedding partition; for each particular embedding partition that is after the first embedding partition in the sequence of embedding partitions: performing incremental training to train the encoder and a decoder replica corresponding to the particular partition.

    Clockwork hierarchical variational encoder

    公开(公告)号:US11264010B2

    公开(公告)日:2022-03-01

    申请号:US16678981

    申请日:2019-11-08

    Applicant: Google LLC

    Abstract: A method for providing a frame-based mel spectral representation of speech includes receiving a text utterance having at least one word, and selecting a mel spectral embedding for the text utterance. Each word in the text utterance has at least one syllable and each syllable has at least one phoneme. For each phoneme, using the selected mel spectral embedding, the method also includes: predicting a duration of the corresponding phoneme by encoding linguistic features of the corresponding phoneme with a corresponding syllable embedding for the syllable that includes the corresponding phoneme; and generating a plurality of fixed-length predicted mel-frequency spectrogram frames based on the predicted duration for the corresponding phoneme. Each fixed-length predicted mel-frequency spectrogram frame representing mel-spectral information of the corresponding phoneme.

    TRAINING NEURAL NETWORKS TO GENERATE STRUCTURED EMBEDDINGS

    公开(公告)号:US20210097427A1

    公开(公告)日:2021-04-01

    申请号:US16586223

    申请日:2019-09-27

    Applicant: Google LLC

    Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training a machine learning model to generate embeddings of inputs to the machine learning model, the machine learning model having an encoder that generates the embeddings from the inputs and a decoder that generates outputs from the generated embeddings, wherein the embedding is partitioned into a sequence of embedding partitions that each includes one or more dimensions of the embedding, the operations comprising: for a first embedding partition in the sequence of embedding partitions: performing initial training to train the encoder and a decoder replica corresponding to the first embedding partition; for each particular embedding partition that is after the first embedding partition in the sequence of embedding partitions: performing incremental training to train the encoder and a decoder replica corresponding to the particular partition.

    TRAINING NEURAL NETWORKS TO GENERATE STRUCTURED EMBEDDINGS

    公开(公告)号:US20230060886A1

    公开(公告)日:2023-03-02

    申请号:US18049995

    申请日:2022-10-26

    Applicant: Google LLC

    Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training a machine learning model to generate embeddings of inputs to the machine learning model, the machine learning model having an encoder that generates the embeddings from the inputs and a decoder that generates outputs from the generated embeddings, wherein the embedding is partitioned into a sequence of embedding partitions that each includes one or more dimensions of the embedding, the operations comprising: for a first embedding partition in the sequence of embedding partitions: performing initial training to train the encoder and a decoder replica corresponding to the first embedding partition; for each particular embedding partition that is after the first embedding partition in the sequence of embedding partitions: performing incremental training to train the encoder and a decoder replica corresponding to the particular partition.

Patent Agency Ranking