Multilingual model training using parallel corpora, crowdsourcing, and accurate monolingual models

    公开(公告)号:US12236205B2

    公开(公告)日:2025-02-25

    申请号:US17131624

    申请日:2020-12-22

    Abstract: A data processing system for generating training data for a multilingual NLP model implements obtaining a corpus including first and second content items. The first content items are English-language textual content, and the second content items are translations of the first content items in one or more non-English target languages. The system further implements selecting a first content item from the first content items, generating a plurality of candidate labels for the first content item by analyzing the first content item with a plurality of first English-language NLP models, selecting a first label from the plurality of candidate labels, generating first training data by associating the first label with the first content item, generating second training data by associating the first label with a second content item of the second content items, and training a pretrained multilingual NLP model with the first training data and the second training data.

    Multilingual content recommendation pipeline

    公开(公告)号:US12124812B2

    公开(公告)日:2024-10-22

    申请号:US17510850

    申请日:2021-10-26

    CPC classification number: G06F40/56 G06F40/284 G06F40/47

    Abstract: A data processing system implements obtaining first textual content in a first language from a first client device; determining that the first language is supported by a first machine learning model; obtaining a guard list of prohibited terms associated with the first language; determining that the textual content does not include one or more prohibited terms associated based on the guard list; providing the first textual content as an input to the first machine learning model responsive to the textual content not including the one or more prohibited terms; analyzing the first textual content with the first machine learning model to obtain a first content recommendation; obtaining a first content recommendation policy that identifies content associated with the first language that may not be provided as a content recommendation; determining that the first content recommendation is not prohibited; and providing the first content recommendation to the first client device.

    Image classification modeling while maintaining data privacy compliance

    公开(公告)号:US12001514B2

    公开(公告)日:2024-06-04

    申请号:US18047324

    申请日:2022-10-18

    CPC classification number: G06F18/217 G06F18/254 G06F21/6218 G06N20/00

    Abstract: The present disclosure relates to processing operations that execute image classification training for domain-specific traffic, where training operations are entirely compliant with data privacy regulations and policies. Image classification model training, as described herein, is configured to classify meaningful image categories in domain-specific scenarios where there is unknown data traffic and strict data compliance requirements that result in privacy-limited image data sets. Iterative image classification training satisfies data compliance requirements through a combination of online image classification training and offline image classification training. This results in tuned image recognition classifiers that have improved accuracy and efficiency over general image recognition classifiers when working with domain-specific data traffic. One or more image recognition classifiers are independently trained and tuned to detect an image class for image classification. Training of independent image recognition classifiers is also utilized for training and tuning of deeper learning models for image classification.

    SCALABLE RETRIEVAL SYSTEM FOR SUGGESTING TEXTUAL CONTENT

    公开(公告)号:US20230161825A1

    公开(公告)日:2023-05-25

    申请号:US17530982

    申请日:2021-11-19

    CPC classification number: G06F16/953 G06N20/00

    Abstract: A data processing system implements receiving query text for a search query for textual content recommendation. The query text includes one or more words indicating a type of textual content items being sought. The system implements analyzing the query text using a first machine learning (ML) model to obtain encoded query text, where the first ML model is trained to identify features within the query text and to generate the encoded query text by mapping the features to a hyper-dimensional latent space (HDLS). The system implements identifying one or more content items in a database of encoded content items mapped to the HDLS that satisfy the search query by comparing attributes of the encoded query text with attributes of the encoded content items to identify content items that are closest to the encoded query text within the HDLS, and causing the one or more content items to be displayed.

    IMAGE CLASSIFICATION MODELING WHILE MAINTAINING DATA PRIVACY COMPLIANCE

    公开(公告)号:US20200265153A1

    公开(公告)日:2020-08-20

    申请号:US16276908

    申请日:2019-02-15

    Abstract: The present disclosure relates to processing operations that execute image classification training for domain-specific traffic, where training operations are entirely compliant with data privacy regulations and policies. Image classification model training, as described herein, is configured to classify meaningful image categories in domain-specific scenarios where there is unknown data traffic and strict data compliance requirements that result in privacy-limited image data sets. Iterative image classification training satisfies data compliance requirements through a combination of online image classification training and offline image classification training. This results in tuned image recognition classifiers that have improved accuracy and efficiency over general image recognition classifiers when working with domain-specific data traffic. One or more image recognition classifiers are independently trained and tuned to detect an image class for image classification. Training of independent image recognition classifiers is also utilized for training and tuning of deeper learning models for image classification.

    Method and system of retrieving assets from personalized asset libraries

    公开(公告)号:US12242491B2

    公开(公告)日:2025-03-04

    申请号:US17716653

    申请日:2022-04-08

    Abstract: A system and method and for retrieving assets from a personalized asset library includes receiving a search query for searching for assets in one or more asset libraries, the one or more asset libraries including a personalized asset library; encoding the search query into embedding representations via a trained query representation machine-learning (ML) model; comparing, via a matching unit, the query embedding representations to a plurality of asset representations, each of the plurality of asset representations being a representation of one of the plurality of candidate assets; identifying, based on the comparison, at least one of the plurality of the candidate assets as a search result for the search query; and providing the identified plurality of candidate assets for display as the search result. The plurality of asset representations for the one or more assets in the personalized content library are generated automatically without human labeling.

    MULTILINGUAL SUPPORT FOR NATURAL LANGUAGE PROCESSING APPLICATIONS

    公开(公告)号:US20230274096A1

    公开(公告)日:2023-08-31

    申请号:US17681250

    申请日:2022-02-25

    CPC classification number: G06F40/49 G06F40/284 G06F40/242 G06F40/253 G06N20/00

    Abstract: A data processing system implements obtaining textual content in a first language from a first client device and segmenting the textual content into a plurality of first tokens. The system also implements translating the first tokens from the first language to a second language using a bilingual dictionary, extracting features information from the second tokens to create a features vector, providing the feature vector to a first natural language processing model trained to analyze textual input in the second language and to output contextual information indicating one or more topics or subject matter of the first textual content, and providing the contextual information to a first machine learning model configured to analyze the contextual information and to identify one or more content items predicted to be relevant to the contextual information. The system further implements providing the information identifying the one or more content items to the first client device.

Patent Agency Ranking