Universal Language Segment Representations Learning with Conditional Masked Language Model

    公开(公告)号:US20220198144A1

    公开(公告)日:2022-06-23

    申请号:US17127734

    申请日:2020-12-18

    Applicant: Google LLC

    Abstract: The present disclosure provides a novel sentence-level representation learning method Conditional Masked Language Modeling (CMLM) for training on large scale unlabeled corpora. CMLM outperforms the previous state-of-the-art English sentence embedding models, including those trained with (semi-)supervised signals. For multilingual representations learning, it is shown that co-training CMLM with bitext retrieval and cross-lingual NLI fine-tuning achieves state-of-the-art performance. It is also shown that multilingual representations have the same language bias and principal component removal (PCR) can eliminate the bias by separating language identity information from semantics.

    Universal language segment representations learning with conditional masked language model

    公开(公告)号:US11769011B2

    公开(公告)日:2023-09-26

    申请号:US17127734

    申请日:2020-12-18

    Applicant: Google LLC

    CPC classification number: G06F40/284 G06N3/04 G06N20/00

    Abstract: The present disclosure provides a novel sentence-level representation learning method Conditional Masked Language Modeling (CMLM) for training on large scale unlabeled corpora. CMLM outperforms the previous state-of-the-art English sentence embedding models, including those trained with (semi-)supervised signals. For multilingual representations learning, it is shown that co-training CMLM with bitext retrieval and cross-lingual natural language inference (NL) fine-tuning achieves state-of-the-art performance. It is also shown that multilingual representations have the same language bias and principal component removal (PCR) can eliminate the bias by separating language identity information from semantics.

Patent Agency Ranking