Invention Publication
- Patent Title: FUSED ACOUSTIC AND TEXT ENCODING FOR MULTIMODAL BILINGUAL PRETRAINING AND SPEECH TRANSLATION
-
Application No.: US17533687Application Date: 2021-11-23
-
Publication No.: US20230169281A1Publication Date: 2023-06-01
- Inventor: Renjie ZHENG , Junkun CHEN , Mingbo MA , Liang HUANG
- Applicant: Baidu USA, LLC
- Applicant Address: US CA Sunnyvale
- Assignee: Baidu USA LLC
- Current Assignee: Baidu USA LLC
- Current Assignee Address: US CA Sunnyvale
- Main IPC: G06F40/58
- IPC: G06F40/58 ; G10L15/06 ; G10L15/28

Abstract:
Representation learning for text and speech has improved many language-related tasks. However, existing methods only learn from one input modality, while a unified representation for both speech and text is needed for tasks such as end-to-end speech translation. Consequently, these methods cannot exploit various large-scale text and speech data and their performance is limited by the scarcity of parallel speech translation data. To address these problems, embodiments of a fused acoustic and text masked language model (FAT-MLM) are disclosed. FAT-MLM embodiments jointly learn a unified representation for both acoustic and text input from various types of corpora including parallel data for speech recognition and machine translation, and pure speech and text data. Within this cross-modal representation learning framework, an end-to-end model is further presented for fused acoustic and text speech translation. Experiments show that by fine-tuning from FAT-MLM, the speech translation model embodiments substantially improve translation quality.
Public/Granted literature
- US12050882B2 Fused acoustic and text encoding for multimodal bilingual pretraining and speech translation Public/Granted day:2024-07-30
Information query