Enforcing Fairness on Unlabeled Data to Improve Modeling Performance

    公开(公告)号:US20250068979A1

    公开(公告)日:2025-02-27

    申请号:US18942116

    申请日:2024-11-08

    Abstract: Fairness of a trained classifier may be ensured by generating a data set for training, the data set generated using input data points of a feature space including multiple dimensions and according to different parameters including an amount of label bias, a control for discrepancy between rarity of features, and an amount of selection bias. Unlabeled data points of the input data comprising unobserved ground truths are labeled according to the amount of label bias and the input data sampled according to the amount of selection bias and the control for the discrepancy between the rarity of features. The classifier is then trained using the sampled and labeled data points as well as additional 10 unlabeled data points. The trained classifier is then usable to determine unbiased classifications of one or more labels for one or more other data sets.

    Similarity analysis using enhanced MinHash

    公开(公告)号:US11921687B2

    公开(公告)日:2024-03-05

    申请号:US16436770

    申请日:2019-06-10

    CPC classification number: G06F16/2228 G06F17/18 G06F18/22 G06F18/231

    Abstract: A first set and a second set are identified as operands for a set operation of a similarity analysis task iteration. Using respective minimum hash information arrays and contributor count arrays of the two sets, a minimum hash information array and contributor count array of a derived set resulting from the set operation is generated. An entry in the contributor count array of the derived set indicates the number of child sets of the derived set that meet a criterion with respect to a corresponding entry in the minimum hash information array of the derived set. The generated minimum hash information array and the contributor count array are stored as part of input for a subsequent iteration. After a termination criterion of the task is met, output of the task is stored.

    Enforcing Fairness on Unlabeled Data to Improve Modeling Performance

    公开(公告)号:US20200372406A1

    公开(公告)日:2020-11-26

    申请号:US16781945

    申请日:2020-02-04

    Abstract: Fairness of a trained classifier may be ensured by generating a data set for training, the data set generated using input data points of a feature space including multiple dimensions and according to different parameters including an amount of label bias, a control for discrepancy between rarity of features, and an amount of selection bias. Unlabeled data points of the input data comprising unobserved ground truths are labeled according to the amount of label bias and the input data sampled according to the amount of selection bias and the control for the discrepancy between the rarity of features. The classifier is then trained using the sampled and labeled data points as well as additional unlabeled data points. The trained classifier is then usable to determine unbiased classifications of one or more labels for one or more other data sets.

    Enforcing fairness on unlabeled data to improve modeling performance

    公开(公告)号:US12175344B2

    公开(公告)日:2024-12-24

    申请号:US18453929

    申请日:2023-08-22

    Abstract: Fairness of a trained classifier may be ensured by generating a data set for training, the data set generated using input data points of a feature space including multiple dimensions and according to different parameters including an amount of label bias, a control for discrepancy between rarity of features, and an amount of selection bias. Unlabeled data points of the input data comprising unobserved ground truths are labeled according to the amount of label bias and the input data sampled according to the amount of selection bias and the control for the discrepancy between the rarity of features. The classifier is then trained using the sampled and labeled data points as well as additional unlabeled data points. The trained classifier is then usable to determine unbiased classifications of one or more labels for one or more other data sets.

    Debiasing Pre-trained Sentence Encoders With Probabilistic Dropouts

    公开(公告)号:US20240419900A1

    公开(公告)日:2024-12-19

    申请号:US18817147

    申请日:2024-08-27

    Abstract: Debiasing pre-trained sentence encoders with probabilistic dropouts may be performed by various systems, services, or applications. A sentence may be received, where the words of the sentence may be provided as tokens to an encoder of a machine learning model. A token-wise correlation using semantic orientation may be determined to determine a bias score for the tokens in the input sentence. A probability of dropout that for tokens in the input sentence may be determined from the bias scores. The machine learning model may be trained or tuned based on the probabilities of dropout for the tokens in the input sentence.

    AUGMENTING DATA SETS FOR MACHINE LEARNING MODELS

    公开(公告)号:US20230032208A1

    公开(公告)日:2023-02-02

    申请号:US17389900

    申请日:2021-07-30

    Abstract: Techniques are disclosed for augmenting data sets used for training machine learning models and for generating predictions by trained machine learning models. These techniques may increase a number (and diversity) of examples within an initial training dataset of sentences by extracting a subset of words from the existing training dataset of sentences. The extracted subset includes no stopwords and fewer content words than found in the initial training dataset. The remaining words may be re-ordered. Using the extracted and re-ordered subset of words, the dataset generation model produces a second set of sentences that are different from the first set. The second set of sentences may be used to increase a number of examples in classes with few examples.

    Debiasing Pre-trained Sentence Encoders With Probabilistic Dropouts

    公开(公告)号:US20220245339A1

    公开(公告)日:2022-08-04

    申请号:US17589662

    申请日:2022-01-31

    Abstract: Debiasing pre-trained sentence encoders with probabilistic dropouts may be performed by various systems, services, or applications. A sentence may be received, where the words of the sentence may be provided as tokens to an encoder of a machine learning model. A token-wise correlation using semantic orientation may be determined to determine a bias score for the tokens in the input sentence. A probability of dropout that for tokens in the input sentence may be determined from the bias scores. The machine learning model may be trained or tuned based on the probabilities of dropout for the tokens in the input sentence.

Patent Agency Ranking