FUSED ACOUSTIC AND TEXT ENCODING FOR MULTIMODAL BILINGUAL PRETRAINING AND SPEECH TRANSLATION

Invention Publication

US20230169281A1 FUSED ACOUSTIC AND TEXT ENCODING FOR MULTIMODAL BILINGUAL PRETRAINING AND SPEECH TRANSLATION 审中-公开

Please log in to see more content

Patent Title: FUSED ACOUSTIC AND TEXT ENCODING FOR MULTIMODAL BILINGUAL PRETRAINING AND SPEECH TRANSLATION
Application No.: US17533687

Application Date: 2021-11-23
Publication No.: US20230169281A1

Publication Date: 2023-06-01
Inventor: Renjie ZHENG , Junkun CHEN , Mingbo MA , Liang HUANG
Applicant: Baidu USA, LLC
Applicant Address: US CA Sunnyvale
Assignee: Baidu USA LLC
Current Assignee: Baidu USA LLC
Current Assignee Address: US CA Sunnyvale
Main IPC: G06F40/58
IPC: G06F40/58 ; G10L15/06 ; G10L15/28

FUSED ACOUSTIC AND TEXT ENCODING FOR MULTIMODAL BILINGUAL PRETRAINING AND SPEECH TRANSLATION

Abstract:

Representation learning for text and speech has improved many language-related tasks. However, existing methods only learn from one input modality, while a unified representation for both speech and text is needed for tasks such as end-to-end speech translation. Consequently, these methods cannot exploit various large-scale text and speech data and their performance is limited by the scarcity of parallel speech translation data. To address these problems, embodiments of a fused acoustic and text masked language model (FAT-MLM) are disclosed. FAT-MLM embodiments jointly learn a unified representation for both acoustic and text input from various types of corpora including parallel data for speech recognition and machine translation, and pure speech and text data. Within this cross-modal representation learning framework, an end-to-end model is further presented for fused acoustic and text speech translation. Experiments show that by fine-tuning from FAT-MLM, the speech translation model embodiments substantially improve translation quality.

Public/Granted literature

US12050882B2 Fused acoustic and text encoding for multimodal bilingual pretraining and speech translation Public/Granted day:2024-07-30

Information query

Global Dossier Espacenet

IPC分类:

G	物理
G06	计算；推算或计数
G06F	电数字数据处理（基于特定计算模型的计算机系统入G06N）
G06F40/00	处理自然语言数据（语音分析或综合，语音识别G10L）
G06F40/40	.自然语言的处理或翻译(自然语言分析入G06F40/20；语义分析入G06F40/30)
G06F40/58	..使用机器翻译，例如用于多语言检索，用于客户端设备的服务器端翻译或实时翻译。