-
公开(公告)号:US20230153394A1
公开(公告)日:2023-05-18
申请号:US17528305
申请日:2021-11-17
Applicant: Oracle International Corporation
Inventor: Ritesh Ahuja , Anatoly Yakovlev , Venkatanathan Varadarajan , Sandeep R. Agrawal , Hesam Fathi Moghadam , Sanjay Jinturkar , Nipun Agarwal
CPC classification number: G06K9/6227 , G06K9/6257 , G06K9/6265 , G06K9/6298 , G06N20/00
Abstract: Herein are timeseries preprocessing, model selection, and hyperparameter tuning techniques for forecasting development based on temporal statistics of a timeseries and a single feed-forward pass through a machine learning (ML) pipeline. In an embodiment, a computer hosts and operates the ML pipeline that automatically measures temporal statistic(s) of a timeseries. ML algorithm selection, cross validation, and hyperparameters tuning is based on the temporal statistics of the timeseries. The result from the ML pipeline is a rigorously trained and production ready ML model that is validated to have increased accuracy for multiple prediction horizons. Based on the temporal statistics, efficiency is achieved by asymmetry of investment of computer resources in the tuning and training of the most promising ML algorithm(s). Compared to other approaches, this ML pipeline produces a more accurate ML model for a given amount of computer resources and consumes fewer computer resources to achieve a given accuracy.
-
12.
公开(公告)号:US20220138504A1
公开(公告)日:2022-05-05
申请号:US17083536
申请日:2020-10-29
Applicant: Oracle International Corporation
Inventor: Hesam Fathi Moghadam , Anatoly Yakovlev , Sandeep Agrawal , Venkatanathan Varadarajan , Robert Hopkins , Matteo Casserini , Milos Vasic , Sanjay Jinturkar , Nipun Agarwal
Abstract: In an embodiment based on computer(s), an ML model is trained to detect outliers. The ML model calculates anomaly scores that include a respective anomaly score for each item in a validation dataset. The anomaly scores are automatically organized by sorting and/or clustering. Based on the organized anomaly scores, a separation is measured that indicates fitness of the ML model. In an embodiment, a computer performs two-clustering of anomaly scores into a first organization that consists of a first normal cluster of anomaly scores and a first anomaly cluster of anomaly scores. The computer performs three-clustering of the same anomaly scores into a second organization that consists of a second normal cluster of anomaly scores, a second anomaly cluster of anomaly scores, and a middle cluster of anomaly scores. A distribution difference between the first organization and the second organization is measured. An ML model is processed based on the distribution difference.
-
公开(公告)号:US11238035B2
公开(公告)日:2022-02-01
申请号:US16814855
申请日:2020-03-10
Applicant: Oracle International Corporation
Inventor: Hamed Ahmadi , Jian Wen , Shrikumar Hariharasubrahmanian , Sanjay Jinturkar , Nipun Agarwal
IPC: G06F16/245 , G06F16/22
Abstract: Techniques are described herein for indexing personal information in columnar data storage format based files. In an embodiment, row groups of rows that comprise a plurality of columns are stored in a set of files. Each column of a row group is stored in a chunk of column pages in the set of files. A regular expression index that indexes a particular column in the set of files is stored for each row group. The regular expression index identifies column pages in the chunk of the particular column that include a particular column value that satisfies a regular expression specified in a query. The regular expression specified in the query in evaluated against the particular column using the regular expression index.
-
公开(公告)号:US20210390466A1
公开(公告)日:2021-12-16
申请号:US17086204
申请日:2020-10-30
Applicant: Oracle International Corporation
Inventor: Venkatanathan Varadarajan , Sandeep R. Agrawal , Hesam Fathi Moghadam , Anatoly Yakovlev , Ali Moharrer , Jingxiao Cai , Sanjay Jinturkar , Nipun Agarwal , Sam Idicula , Nikan Chavoshi
Abstract: A proxy-based automatic non-iterative machine learning (PANI-ML) pipeline is described, which predicts machine learning model configuration performance and outputs an automatically-configured machine learning model for a target training dataset. Techniques described herein use one or more proxy models—which implement a variety of machine learning algorithms and are pre-configured with tuned hyperparameters—to estimate relative performance of machine learning model configuration parameters at various stages of the PANI-ML pipeline. The PANI-ML pipeline implements a radically new approach of rapidly narrowing the search space for machine learning model configuration parameters by performing algorithm selection followed by algorithm-specific adaptive data reduction (i.e., row- and/or feature-wise dataset sampling), and then hyperparameter tuning. Furthermore, because of the one-pass nature of the PANI-ML pipeline and because each stage of the pipeline has convergence criteria by design, the whole PANI-ML pipeline has a novel convergence property that stops the configuration search after one pass.
-
公开(公告)号:US20220309360A1
公开(公告)日:2022-09-29
申请号:US17212163
申请日:2021-03-25
Applicant: Oracle International Corporation
Inventor: Zahra Zohrevand , Tayler Hetherington , Karoon Rashedi Nia , Yasha Pushak , Sanjay Jinturkar , Nipun Agarwal
Abstract: Herein are techniques for topic modeling and content perturbation that provide machine learning (ML) explainability (MLX) for natural language processing (NLP). A computer hosts an ML model that infers an original inference for each of many text documents that contain many distinct terms. To each text document (TD) is assigned, based on terms in the TD, a topic that contains a subset of the distinct terms. In a perturbed copy of each TD, a perturbed subset of the distinct terms is replaced. For the perturbed copy of each TD, the ML model infers a perturbed inference. For TDs of a topic, the computer detects that a difference between original inferences of the TDs of the topic and perturbed inferences of the TDs of the topic exceeds a threshold. Based on terms in the TDs of the topic, the topic is replaced with multiple, finer-grained new topics. After sufficient topic modeling, a regional explanation of the ML model is generated.
-
公开(公告)号:US11451670B2
公开(公告)日:2022-09-20
申请号:US17123235
申请日:2020-12-16
Applicant: Oracle International Corporation
Inventor: Hamed Ahmadi , Ali Moharrer , Venkatanathan Varadarajan , Vaseem Akram , Nishesh Rai , Reema Hingorani , Sanjay Jinturkar , Nipun Agarwal
Abstract: Herein are machine learning (ML) techniques for unsupervised training with a corpus of signaling system 7 (SS7) messages having a diversity of called and calling parties, operation codes (opcodes) and transaction types, numbering plans and nature of address indicators, and mobile country codes and network codes. In an embodiment, a computer stores SS7 messages that are not labeled as anomalous or non-anomalous. Each SS7 message contains an opcode and other fields. For each SS7 message, the opcode of the SS7 message is stored into a respective feature vector (FV) of many FVs that are based on respective unlabeled SS7 messages. The FVs contain many distinct opcodes. Based on the FVs that contain many distinct opcodes and that are based on respective unlabeled SS7 messages, an ML model such as a reconstructive model such as an autoencoder is unsupervised trained to detect an anomalous SS7 message.
-
公开(公告)号:US20220261400A1
公开(公告)日:2022-08-18
申请号:US17179265
申请日:2021-02-18
Applicant: Oracle International Corporation
Inventor: Yasha Pushak , Tayler Hetherington , Karoon Rashedi Nia , Zahra Zohrevand , Sanjay Jinturkar , Nipun Agarwal
IPC: G06F16/2458 , G06N20/00
Abstract: Techniques are described for fast approximate conditional sampling by randomly sampling a dataset and then performing a nearest neighbor search on the pre-sampled dataset to reduce the data over which the nearest neighbor search must be performed and, according to an embodiment, to effectively reduce the number of nearest neighbors that are to be found within the random sample. Furthermore, KD-Tree-based stratified sampling is used to generate a representative sample of a dataset. KD-Tree-based stratified sampling may be used to identify the random sample for fast approximate conditional sampling, which reduces variance in the resulting data sample. As such, using KD-Tree-based stratified sampling to generate the random sample for fast approximate conditional sampling ensures that any nearest neighbor selected, for a target data instance, from the random sample is likely to be among the nearest neighbors of the target data instance within the unsampled dataset.
-
公开(公告)号:US20220198277A1
公开(公告)日:2022-06-23
申请号:US17131387
申请日:2020-12-22
Applicant: Oracle International Corporation
Inventor: Karoon Rashedi Nia , Tayler Hetherington , Zahra Zohrevand , Yasha Pushak , Sanjay Jinturkar , Nipun Agarwal
Abstract: Herein are generative adversarial networks to ensure realistic local samples and surrogate models to provide machine learning (ML) explainability (MLX). Based on many features, an embodiment trains an ML model. The ML model inferences an original inference for original feature values respectively for many features. Based on the same features, a generator model is trained to generate realistic local samples that are distinct combinations of feature values for the features. A surrogate model is trained based on the generator model and based on the original inference by the ML model and/or the original feature values that the original inference is based on. Based on the surrogate model, the ML model is explained. The local samples may be weighted based on semantic similarity to the original feature values, which may facilitate training the surrogate model and/or ranking the relative importance of the features. Local sample weighting may be based on populating a random forest with the local samples.
-
19.
公开(公告)号:US20220121955A1
公开(公告)日:2022-04-21
申请号:US17071285
申请日:2020-10-15
Applicant: Oracle International Corporation
Inventor: Nikan Chavoshi , Anatoly Yakovlev , Hesam Fathi Moghadam , Venkatanathan Varadarajan , Sandeep Agrawal , Ali Moharrer , Jingxiao Cai , Sanjay Jinturkar , Nipun Agarwal
Abstract: Herein, a computer generates and evaluates many preprocessor configurations for a window preprocessor that transforms a training timeseries dataset for an ML model. With each preprocessor configuration, the window preprocessor is configured. The window preprocessor then converts the training timeseries dataset into a configuration-specific point-based dataset that is based on the preprocessor configuration. The ML model is trained based on the configuration-specific point-based dataset to calculate a score for the preprocessor configuration. Based on the scores of the many preprocessor configurations, an optimal preprocessor configuration is selected for finally configuring the window preprocessor, after which, the window preprocessor can optimally transform a new timeseries dataset such as in an offline or online production environment such as for real-time processing of a live streaming timeseries.
-
公开(公告)号:US20240281455A1
公开(公告)日:2024-08-22
申请号:US18444454
申请日:2024-02-16
Applicant: Oracle International Corporation
Inventor: Youssef Mohamed Saied , Mohamed Ridha Chahed , Anatoly Yakovlev , Sandeep R. Agrawal , Sanjay Jinturkar , Nipun Agarwal
CPC classification number: G06F16/285 , G06F16/2282
Abstract: Disclosed is an improved approach to implement anomaly detection, where an ensemble detection mechanism is provided. An improvement is provided for the KNN algorithm where scaling is applied to permit efficient detection of multiple categories of anomalies. Further extensions are used to optimize local anomaly detection.
-
-
-
-
-
-
-
-
-