-
1.
公开(公告)号:US20220198144A1
公开(公告)日:2022-06-23
申请号:US17127734
申请日:2020-12-18
Applicant: Google LLC
Inventor: Yinfei Yang , Ziyi Yang , Daniel Matthew Cer
IPC: G06F40/284 , G06N20/00 , G06N3/04
Abstract: The present disclosure provides a novel sentence-level representation learning method Conditional Masked Language Modeling (CMLM) for training on large scale unlabeled corpora. CMLM outperforms the previous state-of-the-art English sentence embedding models, including those trained with (semi-)supervised signals. For multilingual representations learning, it is shown that co-training CMLM with bitext retrieval and cross-lingual NLI fine-tuning achieves state-of-the-art performance. It is also shown that multilingual representations have the same language bias and principal component removal (PCR) can eliminate the bias by separating language identity information from semantics.
-
2.
公开(公告)号:US11769011B2
公开(公告)日:2023-09-26
申请号:US17127734
申请日:2020-12-18
Applicant: Google LLC
Inventor: Yinfei Yang , Ziyi Yang , Daniel Matthew Cer
IPC: G06F40/284 , G06N3/04 , G06N20/00
CPC classification number: G06F40/284 , G06N3/04 , G06N20/00
Abstract: The present disclosure provides a novel sentence-level representation learning method Conditional Masked Language Modeling (CMLM) for training on large scale unlabeled corpora. CMLM outperforms the previous state-of-the-art English sentence embedding models, including those trained with (semi-)supervised signals. For multilingual representations learning, it is shown that co-training CMLM with bitext retrieval and cross-lingual natural language inference (NL) fine-tuning achieves state-of-the-art performance. It is also shown that multilingual representations have the same language bias and principal component removal (PCR) can eliminate the bias by separating language identity information from semantics.
-