-
公开(公告)号:US12242568B2
公开(公告)日:2025-03-04
申请号:US17903798
申请日:2022-09-06
Applicant: Oracle International Corporation
Inventor: Ariel Gedaliah Kobren , Swetasudha Panda , Michael Louis Wick , Qinlan Shen , Jason Anthony Peck
IPC: G06F18/214 , G06F40/56
Abstract: Techniques are disclosed for augmenting data sets used for training machine learning models and for generating predictions by trained machine learning models. These techniques may increase a number and diversity of examples within an initial training dataset of sentences by extracting a subset of words from the existing training dataset of sentences. The techniques may conserve scarce sample data in few-shot situations by training a data generation model using general data obtained from a general data source.
-
公开(公告)号:US20240289685A1
公开(公告)日:2024-08-29
申请号:US18176380
申请日:2023-02-28
Applicant: Oracle International Corporation
Inventor: Michael Louis Wick , Ariel Kobren , Swetasudha Panda , John Sullivan
IPC: G06N20/00
CPC classification number: G06N20/00
Abstract: Machine learning model performance may be determined on unlabeled out of distribution data. A source data set may be obtained for training a machine learning model. Unbiased estimates may be determined for baseline performance indicators of the machine learning model applied to a target dataset without ground truth labels using importance sampling weights. Performance metrics may then be determined using the baselined performance indicators and provided.
-
公开(公告)号:US20230409969A1
公开(公告)日:2023-12-21
申请号:US18176374
申请日:2023-02-28
Applicant: Oracle International Corporation
Inventor: Swetasudha Panda , Ariel Kobren , Michael Louis Wick , Qinlan Shen
IPC: G06N20/00
CPC classification number: G06N20/00
Abstract: Bias in a language model generated through fine tuning of a pre-trained language model may be mitigated, whether the bias may be incorporated in the pre-trained language model or in fine-tuning data. A pre-trained language model may be fine-tuned using downstream training data. Prior to tuning, elements within the downstream data may be identified that either match or serve as proxies for one or more identity elements associated with training bias sensitivity. Proxy elements may be identified using an analysis of distributions of the downstream elements and distributions of identity elements. Once the elements are identified, instances of the identified elements may be replaced in the downstream data with one or more masking element to generate masked downstream data. A fine-tuned language model with reduced bias may then be generated from the pre-trained language model by tuning the pre-trained language model using the masked downstream data.
-
公开(公告)号:US20230394371A1
公开(公告)日:2023-12-07
申请号:US18453929
申请日:2023-08-22
Applicant: Oracle International Corporation
IPC: G06N20/00
Abstract: Fairness of a trained classifier may be ensured by generating a data set for training, the data set generated using input data points of a feature space including multiple dimensions and according to different parameters including an amount of label bias, a control for discrepancy between rarity of features, and an amount of selection bias. Unlabeled data points of the input data comprising unobserved ground truths are labeled according to the amount of label bias and the input data sampled according to the amount of selection bias and the control for the discrepancy between the rarity of features. The classifier is then trained using the sampled and labeled data points as well as additional unlabeled data points. The trained classifier is then usable to determine unbiased classifications of one or more labels for one or more other data sets.
-
公开(公告)号:US11488579B2
公开(公告)日:2022-11-01
申请号:US16890263
申请日:2020-06-02
Applicant: Oracle International Corporation
IPC: G10L15/01 , G10L15/06 , G10L15/197
Abstract: A method of evaluating a language model using negative data may include accessing a first language model that is trained using a first training corpus, and accessing a second language model. The second language model may be configured to generate outputs that are less grammatical than outputs generated by the first language model. The method may also include training the second language model using a second training corpus, and generating output text from the second language model. The method may further include testing the first language model using the output text from the second language model.
-
公开(公告)号:US20210374361A1
公开(公告)日:2021-12-02
申请号:US16890097
申请日:2020-06-02
Applicant: Oracle International Corporation
Inventor: Michael Louis Wick , Jean-Baptiste Frederic George Tristan , Adam Craig Pocock , Katherine Silverstein
IPC: G06F40/58
Abstract: A method for training a language model using negative data may include accessing a first training corpus comprising positive training data and accessing a second training corpus comprising negative training data. The method may further include training a first language model using at least the first training corpus, the second training corpus, and a maximum likelihood function. The maximum likelihood function may maximize the likelihood of the first language model predicting the positive training data while minimizing the likelihood of the first language model predicting the negative training data.
-
公开(公告)号:US11017151B2
公开(公告)日:2021-05-25
申请号:US16833276
申请日:2020-03-27
Applicant: Oracle International Corporation
IPC: H03M7/00 , G06F40/137 , G06N7/00 , H03M7/30 , G06N5/04 , G06F17/16 , G06F40/146
Abstract: A scalable hierarchical coreference method that employs a homomorphic compression scheme that supports addition and partial subtraction to more efficiently represent the data and the evolving intermediate results of probabilistic inference. The method may encode the features underlying conditional random field models of coreference resolution so that cosine similarities can be efficiently computed. The method may be applied to compressing features and intermediate inference results for conditional random fields. The method may allow compressed representations to be added and subtracted in a way that preserves the cosine similarities.
-
公开(公告)号:US12106050B2
公开(公告)日:2024-10-01
申请号:US17589662
申请日:2022-01-31
Applicant: Oracle International Corporation
Inventor: Swetasudha Panda , Ariel Kobren , Michael Louis Wick , Stephen Green
IPC: G06F40/279 , G06N20/00
CPC classification number: G06F40/279 , G06N20/00
Abstract: Debiasing pre-trained sentence encoders with probabilistic dropouts may be performed by various systems, services, or applications. A sentence may be received, where the words of the sentence may be provided as tokens to an encoder of a machine learning model. A token-wise correlation using semantic orientation may be determined to determine a bias score for the tokens in the input sentence. A probability of dropout that for tokens in the input sentence may be determined from the bias scores. The machine learning model may be trained or tuned based on the probabilities of dropout for the tokens in the input sentence.
-
公开(公告)号:US20240168934A1
公开(公告)日:2024-05-23
申请号:US18426100
申请日:2024-01-29
Applicant: Oracle International Corporation
IPC: G06F16/22 , G06F17/18 , G06F18/22 , G06F18/231
CPC classification number: G06F16/2228 , G06F17/18 , G06F18/22 , G06F18/231
Abstract: A first set and a second set are identified as operands for a set operation of a similarity analysis task iteration. Using respective minimum hash information arrays and contributor count arrays of the two sets, a minimum hash information array and contributor count array of a derived set resulting from the set operation is generated. An entry in the contributor count array of the derived set indicates the number of child sets of the derived set that meet a criterion with respect to a corresponding entry in the minimum hash information array of the derived set. The generated minimum hash information array and the contributor count array are stored as part of input for a subsequent iteration. After a termination criterion of the task is met, output of the task is stored.
-
公开(公告)号:US20230401285A1
公开(公告)日:2023-12-14
申请号:US17903796
申请日:2022-09-06
Applicant: Oracle International Corporation
Inventor: Ariel Gedaliah Kobren , Swetasudha Panda , Michael Louis Wick , Qinlan Shen , Jason Anthony Peck
IPC: G06K9/62
CPC classification number: G06K9/6256 , G06K9/6262
Abstract: Techniques are disclosed for augmenting data sets used for training machine learning models and for generating predictions by trained machine learning models. The techniques generate synthesized data from sample data and train a machine learning model using the synthesized data to augment a sample data set. Embodiments selectively partition the sample data set and synthesized data into a training data and a validation data, which are used to generate and select machine learning models.
-
-
-
-
-
-
-
-
-