Distributed random forest training with a predictor trained to balance tasks

    公开(公告)号:US11625640B2

    公开(公告)日:2023-04-11

    申请号:US16152578

    申请日:2018-10-05

    IPC分类号: G06N20/00 G06N7/00 G06N5/00

    摘要: In one embodiment, a device distributes sets of training records from a training dataset for a random forest-based classifier among a plurality of workers of a computing cluster. Each worker determines whether it can perform a node split operation locally on the random forest by comparing a number of training records at the worker to a predefined threshold. The device determines, for each of the split operations, a data size and entropy measure of the training records to be used for the split operation. The device applies a machine learning-based predictor to the determined data size and entropy measure of the training records to be used for the split operation, to predict its completion time. The device coordinates the workers of the computing cluster to perform the node split operations in parallel such that the node split operations in a given batch are grouped based on their predicted completion times.

    DEVICE DETECTION IN NETWORK TELEMETRY WITH TLS FINGERPRINTING

    公开(公告)号:US20210152526A1

    公开(公告)日:2021-05-20

    申请号:US16686364

    申请日:2019-11-18

    IPC分类号: H04L29/06 H04L12/26

    摘要: In one embodiment, a traffic analysis service obtains telemetry data regarding encrypted traffic associated with a particular device in the network, wherein the telemetry data comprises Transport Layer Security (TLS) features of the traffic. The service determines, based on the TLS features from the obtained telemetry data, a set of one or more TLS fingerprints for the traffic associated with the particular device. The service calculates a measure of similarity between the set of one or more TLS fingerprints for the traffic associated with the particular device and a set of one or more TLS fingerprints of traffic associated with a second device. The service determines, based on the measure of similarity, that the particular device and the second device were operated by the same user.

    Scalable training of random forests for high precise malware detection

    公开(公告)号:US10885469B2

    公开(公告)日:2021-01-05

    申请号:US15722412

    申请日:2017-10-02

    摘要: In one embodiment, a device trains a machine learning-based malware classifier using a first randomly selected subset of samples from a training dataset. The classifier comprises a random decision forest. The device identifies, using at least a portion of the training dataset as input to the malware classifier, a set of misclassified samples from the training dataset that the malware classifier misclassifies. The device retrains the malware classifier using a second randomly selected subset of samples from the training dataset and the identified set of misclassified samples. The device adjusts prediction labels of individual leaves of the random decision forest of the retrained malware classifier based in part on decision changes in the forest that result from assessing the entire training dataset with the classifier. The device sends the malware classifier with the adjusted prediction labels for deployment into a network.

    Bayesian tree aggregation in decision forests to increase detection of rare malware

    公开(公告)号:US10728271B2

    公开(公告)日:2020-07-28

    申请号:US16437417

    申请日:2019-06-11

    摘要: In one embodiment, a computing device provides a feature vector as input to a random decision forest comprising a plurality of decision trees trained using a training dataset, each decision tree being configured to output a classification label prediction for the input feature vector. For each of the decision trees, the computing device determines a conditional probability of the decision tree based on a true classification label and the classification label prediction from the decision tree for the input feature vector. The computing device generates weightings for the classification label predictions from the decision trees based on the determined conditional probabilities. The computing device applies a final classification label to the feature vector based on the weightings for the classification label predictions from the decision trees.

    Multi-Modal Models for Detecting Malicious Emails

    公开(公告)号:US20240333733A1

    公开(公告)日:2024-10-03

    申请号:US18127501

    申请日:2023-03-28

    IPC分类号: H04L9/40 G06V10/82

    摘要: In some aspects, the techniques described herein relate to a method for detecting malicious emails, the method including: receiving an email, wherein the email is associated with a markup payload; determining, based on the markup payload, text data associated with the email; determining, using the text data and a first machine learning model, a first representation of the email representing text associated with the email; rendering the email to generate image data that represents a rendering of the email; determining, using the image data and a second machine learning model, a second representation of the email that represents at least the rendering of the email; and determining a prediction for the email based on the first representation and the second representation, wherein the prediction represents whether the email is predicted to be malicious based on the first representation and the second representation.

    MALWARE DETECTION USING INVERSE IMBALANCE SUBSPACE SEARCHING

    公开(公告)号:US20220191244A1

    公开(公告)日:2022-06-16

    申请号:US17117942

    申请日:2020-12-10

    IPC分类号: H04L29/06

    摘要: Inverse imbalance subspace searching techniques are used to detect potential malware among samples of network communication data. A large number of samples of network communication data, such as proxy log data and/or network flows, are received and analyzed by a malware detection system. A number of the samples are associated with known malware, while other unlabeled samples are either benign or may be associated with unknown malware. An inverse imbalance subspace search may be performed, in which the sample sets are divided into subsets based on random feature thresholds, and each subset is evaluated based on the ratio of known malware samples to unlabeled samples. Unlabeled samples within subsets having high malware sample ratios may be identified, aggregated, and processed as potential malware.

    Device detection in network telemetry with TLS fingerprinting

    公开(公告)号:US11245675B2

    公开(公告)日:2022-02-08

    申请号:US16686364

    申请日:2019-11-18

    IPC分类号: H04L29/06 H04L12/26

    摘要: In one embodiment, a traffic analysis service obtains telemetry data regarding encrypted traffic associated with a particular device in the network, wherein the telemetry data comprises Transport Layer Security (TLS) features of the traffic. The service determines, based on the TLS features from the obtained telemetry data, a set of one or more TLS fingerprints for the traffic associated with the particular device. The service calculates a measure of similarity between the set of one or more TLS fingerprints for the traffic associated with the particular device and a set of one or more TLS fingerprints of traffic associated with a second device. The service determines, based on the measure of similarity, that the particular device and the second device were operated by the same user.

    MULTIPLE INSTANCE LEARNING MODELS FOR CYBERSECURITY USING JAVASCRIPT OBJECT NOTATION (JSON) TRAINING DATA

    公开(公告)号:US20230376836A1

    公开(公告)日:2023-11-23

    申请号:US17749740

    申请日:2022-05-20

    IPC分类号: G06N20/00 H04L9/40

    CPC分类号: G06N20/00 H04L63/1441

    摘要: Techniques and architecture are described for converting tree structured data such as, for example, JavaScript Object Notation (JSON) data, into multiple feature vectors to train multiple instance learning (MIL) models for providing cybersecurity in networks. In particular, a data set is provided, wherein the data set comprises a sample configured as a hierarchal tree. The sample is converted into a set of path and value pairs, e.g., flattened into a set of path and value pairs, where the path is a sequence of field names and array indices encoding a position of a value. Each path and value pair of the set of path and value pairs is converted into a respective feature vector to form a set of feature vectors. The set of feature vectors is used to train a multiple instance learning (MIL) model, wherein each feature vector has a same, fixed length.

    Malware detection using inverse imbalance subspace searching

    公开(公告)号:US11799904B2

    公开(公告)日:2023-10-24

    申请号:US17117942

    申请日:2020-12-10

    IPC分类号: H04L9/40

    摘要: Inverse imbalance subspace searching techniques are used to detect potential malware among samples of network communication data. A large number of samples of network communication data, such as proxy log data and/or network flows, are received and analyzed by a malware detection system. A number of the samples are associated with known malware, while other unlabeled samples are either benign or may be associated with unknown malware. An inverse imbalance subspace search may be performed, in which the sample sets are divided into subsets based on random feature thresholds, and each subset is evaluated based on the ratio of known malware samples to unlabeled samples. Unlabeled samples within subsets having high malware sample ratios may be identified, aggregated, and processed as potential malware.