Method and system for text-to-speech synthesis of streaming text

    公开(公告)号:US12249313B2

    公开(公告)日:2025-03-11

    申请号:US17914010

    申请日:2020-10-27

    Applicant: GOOGLE LLC

    Abstract: A method and system is disclosed for speech synthesis of streaming text. At a text-to-speech (“ITS) system, a real-time streaming text string having a starting point and an ending point may be received, and a first sub-string comprising a first portion of the text string received from an initial point to a first trigger point may be accumulated. The initial point is no earlier than the starting point and is prior to the first trigger point, and the first trigger point is no further than the ending point. A punctuation model of the ITS system may be applied to the first sub-string to generate a pre-processed first sub-string comprising the first sub-string with added grammatical punctuation as determined by the punctuation model. TTS synthesis processing may be applied to at least the pre-processed first sub-string to generate first synthesized speech, and audio play out of the first synthesized speech produced.

    Robust Direct Speech-to-Speech Translation
    3.
    发明公开

    公开(公告)号:US20240273311A1

    公开(公告)日:2024-08-15

    申请号:US18626745

    申请日:2024-04-04

    Applicant: Google LLC

    CPC classification number: G06F40/58 G10L13/02 G10L13/10 G10L19/16

    Abstract: A direct speech-to-speech translation (S2ST) model includes an encoder configured to receive an input speech representation that to an utterance spoken by a source speaker in a first language and encode the input speech representation into a hidden feature representation. The S2ST model also includes an attention module configured to generate a context vector that attends to the hidden representation encoded by the encoder. The S2ST model also includes a decoder configured to receive the context vector generated by the attention module and predict a phoneme representation that corresponds to a translation of the utterance in a second different language. The S2ST model also includes a synthesizer configured to receive the context vector and the phoneme representation and generate a translated synthesized speech representation that corresponds to a translation of the utterance spoken in the different second language.

    Robust direct speech-to-speech translation

    公开(公告)号:US11960852B2

    公开(公告)日:2024-04-16

    申请号:US17644351

    申请日:2021-12-15

    Applicant: Google LLC

    CPC classification number: G06F40/58 G10L13/02 G10L13/10 G10L19/16

    Abstract: A direct speech-to-speech translation (S2ST) model includes an encoder configured to receive an input speech representation that to an utterance spoken by a source speaker in a first language and encode the input speech representation into a hidden feature representation. The S2ST model also includes an attention module configured to generate a context vector that attends to the hidden representation encoded by the encoder. The S2ST model also includes a decoder configured to receive the context vector generated by the attention module and predict a phoneme representation that corresponds to a translation of the utterance in a second different language. The S2ST model also includes a synthesizer configured to receive the context vector and the phoneme representation and generate a translated synthesized speech representation that corresponds to a translation of the utterance spoken in the different second language.

    Robust Direct Speech-to-Speech Translation

    公开(公告)号:US20230013777A1

    公开(公告)日:2023-01-19

    申请号:US17644351

    申请日:2021-12-15

    Applicant: Google LLC

    Abstract: A direct speech-to-speech translation (S2ST) model includes an encoder configured to receive an input speech representation that to an utterance spoken by a source speaker in a first language and encode the input speech representation into a hidden feature representation. The S2ST model also includes an attention module configured to generate a context vector that attends to the hidden representation encoded by the encoder. The S2ST model also includes a decoder configured to receive the context vector generated by the attention module and predict a phoneme representation that corresponds to a translation of the utterance in a second different language. The S2ST model also includes a synthesizer configured to receive the context vector and the phoneme representation and generate a translated synthesized speech representation that corresponds to a translation of the utterance spoken in the different second language.

Patent Agency Ranking