Patent search ap:("Adobe Inc.") AND inv:"Cesa Salaam" Page 1

1.

发明公开
GENERATING SYNTHETIC CODE-SWITCHED DATA FOR TRAINING LANGUAGE MODELS 审中-公开

公开(公告)号：US20230259718A1

公开(公告)日：2023-08-17

申请号：US17651555

申请日：2022-02-17

Applicant: Adobe Inc.

Inventor： Cesa Salaam , Seunghyun Yoon , Trung Huu Bui , Franck Dernoncourt

IPC: G06F40/58 , G06F40/47 , G06N3/04 , G06N3/08

CPC classification number: G06F40/58 , G06F40/47 , G06N3/0454 , G06N3/08

Abstract: Techniques for training a language model for code switching content are disclosed. Such techniques include, in some embodiments, generating a dataset, which includes identifying one or more portions within textual content in a first language, the identified one or more portions each including one or more of offensive content or non-offensive content; translating the identified one or more salient portions to a second language; and reintegrating the translated one or more portions into the textual content to generate code-switched textual content. In some cases, the textual content in the first language includes offensive content and non-offensive content, the identified one or more portions include the offensive content, and the translated one or more portions include a translated version of the offensive content. In some embodiments, the code-switched textual content is at least part of a synthetic dataset usable to train a language model, such as a multilingual classification model.

2.

发明授权
Generating synthetic code-switched data for training language models 有权

公开(公告)号：US12242820B2

公开(公告)日：2025-03-04

申请号：US17651555

申请日：2022-02-17

Applicant: Adobe Inc.

Inventor： Cesa Salaam , Seunghyun Yoon , Trung Huu Bui , Franck Dernoncourt

IPC: G10L15/22 , G06F40/47 , G06F40/58 , G06N3/045 , G06N3/08

Abstract: Techniques for training a language model for code switching content are disclosed. Such techniques include, in some embodiments, generating a dataset, which includes identifying one or more portions within textual content in a first language, the identified one or more portions each including one or more of offensive content or non-offensive content; translating the identified one or more salient portions to a second language; and reintegrating the translated one or more portions into the textual content to generate code-switched textual content. In some cases, the textual content in the first language includes offensive content and non-offensive content, the identified one or more portions include the offensive content, and the translated one or more portions include a translated version of the offensive content. In some embodiments, the code-switched textual content is at least part of a synthetic dataset usable to train a language model, such as a multilingual classification model.

Patent Agency Ranking