UNIFY95: META-LEARNING CONTAMINATION THRESHOLDS FROM UNIFIED ANOMALY SCORES

    公开(公告)号:US20240095580A1

    公开(公告)日:2024-03-21

    申请号:US17994530

    申请日:2022-11-28

    CPC classification number: G06N20/00

    Abstract: Herein is a universal anomaly threshold based on several labeled datasets and transformation of anomaly scores from one or more anomaly detectors. In an embodiment, a computer meta-learns from each anomaly detection algorithm and each labeled dataset as follows. A respective anomaly detector based on the anomaly detection algorithm is trained based on the dataset. The anomaly detector infers respective anomaly scores for tuples in the dataset. The following are ensured in the anomaly scores from the anomaly detector: i) regularity that an anomaly score of zero cannot indicate an anomaly and ii) normality that an inclusive range of zero to one contains the anomaly scores from the anomaly detector. A respective anomaly threshold is calculated for the anomaly scores from the anomaly detector. After all meta-learning, a universal anomaly threshold is calculated as an average of the anomaly thresholds. An anomaly is detected based on the universal anomaly threshold.

    AUTOMATED DATASET DRIFT DETECTION
    22.
    发明申请

    公开(公告)号:US20230139718A1

    公开(公告)日:2023-05-04

    申请号:US17513760

    申请日:2021-10-28

    Abstract: Herein are acceleration and increased reliability based on classification and scoring techniques for machine learning that compare two similar datasets of different ages to detect data drift without a predefined drift threshold. Various subsets are randomly sampled from the datasets. The subsets are combined in various ways to generate subsets of various age mixtures. In an embodiment, ages are permuted and drift is detected based on whether or not fitness scores indicate that an age binary classifier is confused. In an embodiment, an anomaly detector measures outlier scores of two subsets of different age mixtures. Drift is detected when the outlier scores diverge. In a two-arm bandit embodiment, iterations randomly alternate between both datasets based on respective probabilities that are adjusted by a bandit reward based on outlier scores from an anomaly detector. Drift is detected based on the probability of the younger dataset.

    EFFICIENT AND ACCURATE REGIONAL EXPLANATION TECHNIQUE FOR NLP MODELS

    公开(公告)号:US20220309360A1

    公开(公告)日:2022-09-29

    申请号:US17212163

    申请日:2021-03-25

    Abstract: Herein are techniques for topic modeling and content perturbation that provide machine learning (ML) explainability (MLX) for natural language processing (NLP). A computer hosts an ML model that infers an original inference for each of many text documents that contain many distinct terms. To each text document (TD) is assigned, based on terms in the TD, a topic that contains a subset of the distinct terms. In a perturbed copy of each TD, a perturbed subset of the distinct terms is replaced. For the perturbed copy of each TD, the ML model infers a perturbed inference. For TDs of a topic, the computer detects that a difference between original inferences of the TDs of the topic and perturbed inferences of the TDs of the topic exceeds a threshold. Based on terms in the TDs of the topic, the topic is replaced with multiple, finer-grained new topics. After sufficient topic modeling, a regional explanation of the ML model is generated.

    FAST, APPROXIMATE CONDITIONAL DISTRIBUTION SAMPLING

    公开(公告)号:US20220261400A1

    公开(公告)日:2022-08-18

    申请号:US17179265

    申请日:2021-02-18

    Abstract: Techniques are described for fast approximate conditional sampling by randomly sampling a dataset and then performing a nearest neighbor search on the pre-sampled dataset to reduce the data over which the nearest neighbor search must be performed and, according to an embodiment, to effectively reduce the number of nearest neighbors that are to be found within the random sample. Furthermore, KD-Tree-based stratified sampling is used to generate a representative sample of a dataset. KD-Tree-based stratified sampling may be used to identify the random sample for fast approximate conditional sampling, which reduces variance in the resulting data sample. As such, using KD-Tree-based stratified sampling to generate the random sample for fast approximate conditional sampling ensures that any nearest neighbor selected, for a target data instance, from the random sample is likely to be among the nearest neighbors of the target data instance within the unsampled dataset.

    POST-HOC EXPLANATION OF MACHINE LEARNING MODELS USING GENERATIVE ADVERSARIAL NETWORKS

    公开(公告)号:US20220198277A1

    公开(公告)日:2022-06-23

    申请号:US17131387

    申请日:2020-12-22

    Abstract: Herein are generative adversarial networks to ensure realistic local samples and surrogate models to provide machine learning (ML) explainability (MLX). Based on many features, an embodiment trains an ML model. The ML model inferences an original inference for original feature values respectively for many features. Based on the same features, a generator model is trained to generate realistic local samples that are distinct combinations of feature values for the features. A surrogate model is trained based on the generator model and based on the original inference by the ML model and/or the original feature values that the original inference is based on. Based on the surrogate model, the ML model is explained. The local samples may be weighted based on semantic similarity to the original feature values, which may facilitate training the surrogate model and/or ranking the relative importance of the features. Local sample weighting may be based on populating a random forest with the local samples.

    GENERALIZED EXPECTATION MAXIMIZATION

    公开(公告)号:US20220027777A1

    公开(公告)日:2022-01-27

    申请号:US16935313

    申请日:2020-07-22

    Abstract: Techniques are described that extend supervised machine-learning algorithms for use with semi-supervised training. Random labels are assigned to unlabeled training data, and the data is split into k partitions. During a label-training iteration, each of these k partitions is combined with the labeled training data, and the combination is used train a single instance of the machine-learning model. Each of these trained models are then used to predict labels for data points in the k−1 partitions of previously-unlabeled training data that were not used to train of the model. Thus, every data point in the previously-unlabeled training data obtains k−1 predicted labels. For each data point, these labels are aggregated to obtain a composite label prediction for the data point. After the labels are determined via one or more label-training iterations, a machine-learning model is trained on data with the resulting composite label predictions and on the labeled data set.

Patent Agency Ranking