-
1.
公开(公告)号:US20220198144A1
公开(公告)日:2022-06-23
申请号:US17127734
申请日:2020-12-18
Applicant: Google LLC
Inventor: Yinfei Yang , Ziyi Yang , Daniel Matthew Cer
IPC: G06F40/284 , G06N20/00 , G06N3/04
Abstract: The present disclosure provides a novel sentence-level representation learning method Conditional Masked Language Modeling (CMLM) for training on large scale unlabeled corpora. CMLM outperforms the previous state-of-the-art English sentence embedding models, including those trained with (semi-)supervised signals. For multilingual representations learning, it is shown that co-training CMLM with bitext retrieval and cross-lingual NLI fine-tuning achieves state-of-the-art performance. It is also shown that multilingual representations have the same language bias and principal component removal (PCR) can eliminate the bias by separating language identity information from semantics.
-
公开(公告)号:US20250045316A1
公开(公告)日:2025-02-06
申请号:US18788178
申请日:2024-07-30
Applicant: Google LLC
Inventor: Jinhyuk Lee , Zhuyun Dai , Xiaoqi Ren , Iftekhar Naim , Yi Luan , Blair Yuxin Chen , Siddhartha Reddy Jonnalagadda , Ming-Wei Chang , Daniel Matthew Cer , Gustavo Adolfo Hernandez Abrego , Jeremy Robert Cole , Colin Hearne Evans , Yuzhe Zhao , Pranay Bhatia , Rajvi Kapadia , Riham Hassan Abdel-Moneim Mansour , Raphael Dominik Hoffman , Simon Kunio Tokumine , Scott Bradley Huffman , Stephen Zachary Karukas , Michael Yiupun Kwong , Shu Zheng , Yan Qiao , Lukas Rutishauser , Anand Rajan Iyer
Abstract: An example method includes providing, to a sequence model (i) a plurality of few-shot prompts, wherein each prompt comprises a demonstration passage, a demonstration task, and a demonstration query, wherein the demonstration task describes a type of retrieval, and wherein the demonstration query is relevant to the demonstration task, and (ii) a plurality of passages sampled from a corpus of passages. The method also includes receiving, from the sequence model and for the plurality of passages and based on the plurality of few-shot prompts, a respective plurality of predicted task-query pairs, the sequence model having been prompted to predict a task based on an input passage, and predict an output query relevant to the predicted task. The method further includes generating a synthetic training dataset comprising the plurality of passages and the respective plurality of predicted task-query pairs. The method also includes providing the synthetic training dataset.
-
公开(公告)号:US20240273294A1
公开(公告)日:2024-08-15
申请号:US18166806
申请日:2023-02-09
Applicant: Google LLC
Inventor: Siamak Shakeri , Cicero Nogueira dos Santos , Daniel Matthew Cer , Zhe Dong , Jianmo Ni , Yun-Hsuan Sung , John Nham
IPC: G06F40/295 , G06N3/0455 , G06N3/084
CPC classification number: G06F40/295 , G06N3/0455 , G06N3/084
Abstract: The technology employs soft knowledge prompts (KPs) to inject relevant world knowledge into language models. This includes training KPs via self-supervised learning on data from one or more knowledge bases. KPs are task independent and can function as an external memory of the language models. KPs may be entity-centric, meaning that each prompt primarily encodes information about one entity from a given knowledge base. A method includes identifying a KP in response to a received input text, concatenating that KP to a sequence of word embeddings of the input text, applying the concatenated information to a trained language model, predicting an object entity name, computing a cross-entropy loss, and updating the identified KP based on the computed cross-entropy loss.
-
公开(公告)号:US20240020546A1
公开(公告)日:2024-01-18
申请号:US17863840
申请日:2022-07-13
Applicant: Google LLC
Inventor: Tu Thanh Vu , Daniel Matthew Cer , Noah Constant , Brian David Lester , Rami Al-Rfou
IPC: G06N5/02
CPC classification number: G06N5/022
Abstract: Systems and methods for prompt tuning can utilize previously-learned prompts for the initialization of tuning for prompts on different tasks that may differ from the task associated with the previously-learned prompt. The prompt being utilized for initialization can be a generic prompt and/or may be a prompt selected based on a determined similarity between two or more task embeddings.
-
5.
公开(公告)号:US11769011B2
公开(公告)日:2023-09-26
申请号:US17127734
申请日:2020-12-18
Applicant: Google LLC
Inventor: Yinfei Yang , Ziyi Yang , Daniel Matthew Cer
IPC: G06F40/284 , G06N3/04 , G06N20/00
CPC classification number: G06F40/284 , G06N3/04 , G06N20/00
Abstract: The present disclosure provides a novel sentence-level representation learning method Conditional Masked Language Modeling (CMLM) for training on large scale unlabeled corpora. CMLM outperforms the previous state-of-the-art English sentence embedding models, including those trained with (semi-)supervised signals. For multilingual representations learning, it is shown that co-training CMLM with bitext retrieval and cross-lingual natural language inference (NL) fine-tuning achieves state-of-the-art performance. It is also shown that multilingual representations have the same language bias and principal component removal (PCR) can eliminate the bias by separating language identity information from semantics.
-
-
-
-