Audio synthesis method and apparatus, computer readable medium, and electronic device

    公开(公告)号:US12106746B2

    公开(公告)日:2024-10-01

    申请号:US17703136

    申请日:2022-03-24

    发明人: Shilun Lin

    摘要: This application discloses a method, an apparatus, a computer readable medium, and an electronic device for audio synthesis. The method includes: acquiring mixed language text information comprising text characters corresponding to at least two language types; performing text coding processing on the mixed language text information based on the at least two language types, to obtain an intermediate semantic coding feature of the mixed language text information; acquiring a target tone feature corresponding to a target tone subject, and performing decoding processing on the intermediate semantic coding feature based on the target tone feature to obtain an acoustic feature; and performing acoustic coding processing on the acoustic feature to obtain an audio corresponding to the mixed language text information.

    SYSTEMS AND METHODS FOR RECONSTRUCTING VIDEO DATA USING CONTEXTUALLY-AWARE MULTI-MODAL GENERATION DURING SIGNAL LOSS

    公开(公告)号:US20240321260A1

    公开(公告)日:2024-09-26

    申请号:US18126212

    申请日:2023-03-24

    IPC分类号: G10L13/027 G10L13/08 H04N7/15

    摘要: A device may receive video data that includes a text transcript, audio sequences, and image frames, and may detect a network fluctuation. The device may process the text transcript to generate a new phrase, and may generate a response phoneme based on the new phrase. The device may generate a text embedding based on the response phoneme, and may process the audio sequences to generate a target voice sequence. The device may generate an audio embedding based on the target voice sequence, and may process the image frames to generate a target image sequence. The device may generate an image embedding based on the target image sequence, and may combine the embeddings to generate an embedding input vector. The device may generate a final voice response and a final video based on the embedding input vector, and may provide the video data, the final voice response, and the final video.