Universal Language Segment Representations Learning with Conditional Masked Language Model

    公开(公告)号:US20220198144A1

    公开(公告)日:2022-06-23

    申请号:US17127734

    申请日:2020-12-18

    Applicant: Google LLC

    Abstract: The present disclosure provides a novel sentence-level representation learning method Conditional Masked Language Modeling (CMLM) for training on large scale unlabeled corpora. CMLM outperforms the previous state-of-the-art English sentence embedding models, including those trained with (semi-)supervised signals. For multilingual representations learning, it is shown that co-training CMLM with bitext retrieval and cross-lingual NLI fine-tuning achieves state-of-the-art performance. It is also shown that multilingual representations have the same language bias and principal component removal (PCR) can eliminate the bias by separating language identity information from semantics.

    SOFT KNOWLEDGE PROMPTS FOR LANGUAGE MODELS
    3.
    发明公开

    公开(公告)号:US20240273294A1

    公开(公告)日:2024-08-15

    申请号:US18166806

    申请日:2023-02-09

    Applicant: Google LLC

    CPC classification number: G06F40/295 G06N3/0455 G06N3/084

    Abstract: The technology employs soft knowledge prompts (KPs) to inject relevant world knowledge into language models. This includes training KPs via self-supervised learning on data from one or more knowledge bases. KPs are task independent and can function as an external memory of the language models. KPs may be entity-centric, meaning that each prompt primarily encodes information about one entity from a given knowledge base. A method includes identifying a KP in response to a received input text, concatenating that KP to a sequence of word embeddings of the input text, applying the concatenated information to a trained language model, predicting an object entity name, computing a cross-entropy loss, and updating the identified KP based on the computed cross-entropy loss.

    Universal language segment representations learning with conditional masked language model

    公开(公告)号:US11769011B2

    公开(公告)日:2023-09-26

    申请号:US17127734

    申请日:2020-12-18

    Applicant: Google LLC

    CPC classification number: G06F40/284 G06N3/04 G06N20/00

    Abstract: The present disclosure provides a novel sentence-level representation learning method Conditional Masked Language Modeling (CMLM) for training on large scale unlabeled corpora. CMLM outperforms the previous state-of-the-art English sentence embedding models, including those trained with (semi-)supervised signals. For multilingual representations learning, it is shown that co-training CMLM with bitext retrieval and cross-lingual natural language inference (NL) fine-tuning achieves state-of-the-art performance. It is also shown that multilingual representations have the same language bias and principal component removal (PCR) can eliminate the bias by separating language identity information from semantics.

Patent Agency Ranking