-
公开(公告)号:US20230043916A1
公开(公告)日:2023-02-09
申请号:US17848831
申请日:2022-06-24
Applicant: Amazon Technologies, Inc.
Inventor: Roberto Barra Chicote , Vatsal Aggarwal , Andrew Paul Breen , Javier Gonzalez Hernandez , Nishant Prateek
IPC: G10L13/10 , G06F40/30 , G10L13/033 , G10L13/047
Abstract: During text-to-speech processing, a speech model creates synthesized speech that corresponds to input data. The speech model may include an encoder for encoding the input data into a context vector and a decoder for decoding the context vector into spectrogram data. The speech model may further include a voice decoder that receives vocal characteristic data representing a desired vocal characteristic of synthesized speech. The voice decoder may process the vocal characteristic data to determine configuration data, such as weights, for use by the speech decoder.
-
公开(公告)号:US20210287656A1
公开(公告)日:2021-09-16
申请号:US16818542
申请日:2020-03-13
Applicant: Amazon Technologies, Inc.
Inventor: Antonio Bonafonte , Panagiotis Agis Oikonomou Filandras , Bartosz Perz , Arent van Korlaar , Ioannis Douratsos , Jonas Felix Ananda Rohnke , Elena Sokolova , Andrew Paul Breen , Nikhil Sharma
IPC: G10L13/10 , G10L15/22 , G10L15/18 , G10L13/047 , G10L15/16
Abstract: A speech-processing system receives both text data and natural-understanding data (e.g., a domain, intent, and/or entity) related to a command represented in the text data. The system uses the natural-understanding data to vary vocal characteristics in determining spectrogram data corresponding to the text data based on the natural-understanding data.
-
公开(公告)号:US10692484B1
公开(公告)日:2020-06-23
申请号:US16007757
申请日:2018-06-13
Applicant: Amazon Technologies, Inc.
Inventor: Thomas Edward Merritt , Adam Franciszek Nadolski , Nishant Prateek , Bartosz Putrycz , Roberto Barra Chicote , Vatsal Aggarwal , Andrew Paul Breen
IPC: G10L13/04 , G10L13/08 , G10L25/24 , G10L25/60 , G10L13/047
Abstract: A speech model is trained using multi-task learning. A first task may correspond to how well predicted audio matches training audio; a second task may correspond to a metric of perceived audio quality. The speech model may include, during training, layers related to the second task that are discarded at runtime.
-
公开(公告)号:US11823655B2
公开(公告)日:2023-11-21
申请号:US17836330
申请日:2022-06-09
Applicant: Amazon Technologies, Inc.
Inventor: Antonio Bonafonte , Panagiotis Agis Oikonomou Filandras , Bartosz Perz , Arent van Korlaar , Ioannis Douratsos , Jonas Felix Ananda Rohnke , Elena Sokolova , Andrew Paul Breen , Nikhil Sharma
IPC: G10L13/10 , G10L13/047 , G10L15/16 , G10L15/18 , G10L15/22
CPC classification number: G10L13/10 , G10L13/047 , G10L15/16 , G10L15/1815 , G10L15/22 , G10L2015/223
Abstract: A speech-processing system receives both text data and natural-understanding data (e.g., a domain, intent, and/or entity) related to a command represented in the text data. The system uses the natural-understanding data to vary vocal characteristics in determining spectrogram data corresponding to the text data based on the natural-understanding data.
-
公开(公告)号:US11545134B1
公开(公告)日:2023-01-03
申请号:US16709792
申请日:2019-12-10
Applicant: Amazon Technologies, Inc.
Inventor: Marcello Federico , Robert Enyedi , Yaser Al-Onaizan , Roberto Barra-Chicote , Andrew Paul Breen , Ritwik Giri , Mehmet Umut Isik , Arvindh Krishnaswamy , Hassan Sawaf
IPC: G10L13/08 , G10L15/22 , G11B20/10 , G06F3/16 , G10L13/10 , G06F40/47 , G10L25/90 , G10L15/06 , G10L13/00 , G10L15/26 , G06V40/16
Abstract: Techniques for the generation of dubbed audio for an audio/video are described. An exemplary approach is to receive a request to generate dubbed speech for an audio/visual file; and in response to the request to: extract speech segments from an audio track of the audio/visual file associated with identified speakers; translate the extracted speech segments into a target language; determine a machine learning model per identified speaker, the trained machine learning models to be used to generate a spoken version of the translated, extracted speech segments based on the identified speaker; generate, per translated, extracted speech segment, a spoken version of the translated, extracted speech segments using a trained machine learning model that corresponds to the identified speaker of the translated, extracted speech segment and prosody information for the extracted speech segments; and replace the extracted speech segments from the audio track of the audio/visual file with the spoken versions spoken version of the translated, extracted speech segments to generate a modified audio track.
-
公开(公告)号:US11017763B1
公开(公告)日:2021-05-25
申请号:US16712466
申请日:2019-12-12
Applicant: Amazon Technologies, Inc.
Inventor: Vatsal Aggarwal , Nishant Prateek , Roberto Barra Chicote , Andrew Paul Breen
IPC: G10L15/22 , G10L15/26 , G10L13/08 , G10L13/047 , G10L13/033
Abstract: During text-to-speech processing, a sequence-to-sequence neural network model may process text data and determine corresponding spectrogram data. A normalizing flow component may then process this spectrogram data to predict corresponding phase data. An inverse Fourier transform may then be performed on the spectrogram and phase data to create an audio waveform that includes speech corresponding to the text.
-
公开(公告)号:US10706837B1
公开(公告)日:2020-07-07
申请号:US16007811
申请日:2018-06-13
Applicant: Amazon Technologies, Inc.
Inventor: Roberto Barra Chicote , Adam Franciszek Nadolski , Thomas Edward Merritt , Bartosz Putrycz , Andrew Paul Breen
IPC: G10L13/033 , G10L13/04 , G10L13/10
Abstract: A speech model includes a sub-model corresponding to a vocal attribute. The speech model generates an output waveform using a sample model, which receives text data, and a conditioning model, which receives text metadata and produces a prosody output for use by the sample model. If, during training or runtime, a different vocal attribute is desired or needed, the sub-model is re-trained or switched to a different sub-model corresponding to the different vocal attribute.
-
公开(公告)号:US20240013770A1
公开(公告)日:2024-01-11
申请号:US18206301
申请日:2023-06-06
Applicant: Amazon Technologies, Inc.
Inventor: Jaime Lorenzo Trueba , Thomas Renaud Drugman , Viacheslav Klimkov , Srikanth Ronanki , Thomas Edward Merritt , Andrew Paul Breen , Roberto Barra-Chicote
IPC: G10L13/047
CPC classification number: G10L13/047
Abstract: During text-to-speech processing, a speech model creates output audio data, including speech, that corresponds to input text data that includes a representation of the speech. A spectrogram estimator estimates a frequency spectrogram of the speech; the corresponding frequency-spectrogram data is used to condition the speech model. A plurality of acoustic features corresponding to different segments of the input text data, such as phonemes, syllable-level features, and/or word-level features, may be separately encoded into context vectors; the spectrogram estimator uses these separate context vectors to create the frequency spectrogram.
-
公开(公告)号:US20230058658A1
公开(公告)日:2023-02-23
申请号:US17882691
申请日:2022-08-08
Applicant: Amazon Technologies, Inc.
Inventor: Jaime Lorenzo Trueba , Thomas Renaud Drugman , Viacheslav Klimkov , Srikanth Ronanki , Thomas Edward Merritt , Andrew Paul Breen , Roberto Barra-Chicote
Abstract: During text-to-speech processing, a speech model creates output audio data, including speech, that corresponds to input text data that includes a representation of the speech. A spectrogram estimator estimates a frequency spectrogram of the speech; the corresponding frequency-spectrogram data is used to condition the speech model. A plurality of acoustic features corresponding to different segments of the input text data, such as phonemes, syllable-level features, and/or word-level features, may be separately encoded into context vectors; the spectrogram estimator uses these separate context vectors to create the frequency spectrogram.
-
公开(公告)号:US11410639B2
公开(公告)日:2022-08-09
申请号:US16922590
申请日:2020-07-07
Applicant: Amazon Technologies, Inc.
Inventor: Jaime Lorenzo Trueba , Thomas Renaud Drugman , Viacheslav Klimkov , Srikanth Ronanki , Thomas Edward Merritt , Andrew Paul Breen , Roberto Barra-Chicote
Abstract: During text-to-speech processing, a speech model creates output audio data, including speech, that corresponds to input text data that includes a representation of the speech. A spectrogram estimator estimates a frequency spectrogram of the speech; the corresponding frequency-spectrogram data is used to condition the speech model. A plurality of acoustic features corresponding to different segments of the input text data, such as phonemes, syllable-level features, and/or word-level features, may be separately encoded into context vectors; the spectrogram estimator uses these separate context vectors to create the frequency spectrogram.
-
-
-
-
-
-
-
-
-