Abstract:
Techniques herein perform workload-balanced graph partitioning. Each graph partition is distributed to a respective computer. Each computer applies a workload-estimation function to its partition to calculate a numeric workload-value that indicates how much computation the partition needs. Each computer sends its numeric workload-value to a master computer. The master compares the highest and lowest numeric workload-values. If the difference exceeds a threshold, the master detects how much work should overloaded-computers offload to under-utilized computers. To each overloaded-computer, the master sends a directive with a balancing numeric workload-value that indicates how much computation to offload and an identifier of an under-utilized computer to receive the offload. Based on this directive and the workload-estimation function, an overloaded-computer selects a portion of its partition that corresponds to the balancing numeric workload-value, removes that portion from its partition, and transfers the portion to the under-utilized computer, which adds the portion to its partition.
Abstract:
Techniques for generating and transferring bulk messages from one computing device to another computing device in a cluster are provided. Each computing device in a cluster is assigned a different set of nodes of a graph. A first computing device may be assigned a particular node that is neighbors with multiple other nodes that are assigned to one or more other computing devices in the cluster. When processing graph-related code at the first computing device, information about the neighbors may be required. The first computing device receives a bulk message from one of the other computing devices. The bulk message contains information about at least a subset of the neighbors. Therefore, the first computing device is not required to send multiple messages for information about the subset of neighbors. In fact, the first computing device is not required to send any message for the information.
Abstract:
Techniques for storing and processing graph data in a database system are provided. Graph data (or a portion thereof) that is stored in persistent storage is loaded into memory to generate an instance of a particular graph. The instance is consistent as of a particular point in time. Graph analysis operations are performed on the instance. The instance may be used by multiple users to perform graph analysis operations. Subsequent changes to the graph are stored separate from the instance. Later, the changes may be applied to the instance (or a copy thereof) to refresh the instance.
Abstract:
In a computer, each of multiple anomaly detectors infers an anomaly score for each of many tuples. For each tuple, a synthetic label is generated that indicates for each anomaly detector: the anomaly detector, the anomaly score inferred by the anomaly detector for the tuple and, for each of multiple contamination factors, the contamination factor and, based on the contamination factor, a binary class of the anomaly score. For each particular anomaly detector excluding a best anomaly detector, a similarity score is measured for each contamination factor. The similarity score indicates how similar, between the particular anomaly detector and the best anomaly detector, are the binary classes of labels with that contamination factor. For each contamination factor, a combined similarity score is calculated based on the similarity scores for the contamination factor. Based on a contamination factor that has the highest combined similarity score, the computer detects that an additional anomaly detector is inaccurate.
Abstract:
An estimator is provided that can be used to get an estimate of final graph size and peak memory usage of the graph during loading, based on sampling of the graph data and using machine learning (ML) techniques. A data sampler samples the data from files or databases and estimates some statistics about the final graph. The sampler also samples some information about property data. Given the sampled statistics gathered and estimated by the data sampler, a graph size estimator estimates how much memory is required by the graph processing engine to load the graph. The final graph size represents how much memory will be used to keep the final graph structures in memory once loading is completed. The peak memory usage represents the memory usage upper bound that is reached by the graph processing engine during loading.
Abstract:
Herein are graph machine learning explainability (MLX) techniques for invalid traffic detection. In an embodiment, a computer generates a graph that contains: a) domain vertices that represent network domains that received requests and b) address vertices that respectively represent network addresses from which the requests originated. Based on the graph, domain embeddings are generated that respectively encode the domain vertices. Based on the domain embeddings, multidomain embeddings are generated that respectively encode the network addresses. The multidomain embeddings are organized into multiple clusters of multidomain embeddings. A particular cluster is detected as suspicious. In an embodiment, an unsupervised trained graph model generates the multidomain embeddings. Based on the clusters of multidomain embeddings, feature importances are unsupervised trained. Based on the feature importances, an explanation is automatically generated for why an object is or is not suspicious. The explained object may be a cluster or other batch of network addresses or a single network address.
Abstract:
Techniques for selecting machine-learned (ML) models using diversity criteria are provided. In one technique, for each ML model of multiple ML models, output data is generated based on input data to the ML model. Multiple pairs of ML models are identified, where each ML model in the multiple pairs is from the multiple ML models. For each pair of ML models in the multiple pairs of ML models: (1) first output data that was previously generated by a first ML model in the pair is identified; (2) second output data that was previously generated by a second ML model in the pair is identified; (3) a diversity value that is based on the first and second output data is generated; and (4) the diversity value is added to a set of diversity values. A subset of the multiple ML models is selected based on the set of diversity values.
Abstract:
From many features and many multidimensional points, a computer generates exploratory training configurations. Each point contains a value for each of the features. Each exploratory training configuration identifies a random subset of the features and a random subset of the points. A performance score is generated for each of the exploratory training configurations. A feature weight is generated for each of the features that is based on the performance scores of the exploratory training configurations whose random subset of features contains the feature. A point weight is generated for each of the points that is based on the performance scores of the exploratory training configurations whose random subset of the many points contains the point. A machine learning model is trained using an optimized training corpus that consists of a subset of the many features based on feature weight and a subset of the many points based on point weight.
Abstract:
A graph processing engine is provided for executing a graph query comprising a parent query and a subquery nested within the parent query. The subquery is an existential subquery, uses a reference to one or more correlated variables from the parent query, is inlined in the parent query pattern matching, does not have a post-processing phase, does not contain any global aggregation operations, uses a reference to at most one non-correlated variable, and does not include any filters on a non-correlated variable. Executing the graph query comprises initiating execution of the parent query, responsive to the parent query matching the one or more correlated variables in an intermediate result set, executing the subquery by applying a neighbor pattern matching operator that checks for existence of an edge, and resuming execution of the parent query based on results of the neighbor pattern matching operation.
Abstract:
In a computer, each of multiple anomaly detectors infers an anomaly score for each of many tuples. For each tuple, a synthetic label is generated that indicates for each anomaly detector: the anomaly detector, the anomaly score inferred by the anomaly detector for the tuple and, for each of multiple contamination factors, the contamination factor and, based on the contamination factor, a binary class of the anomaly score. For each particular anomaly detector excluding a best anomaly detector, a similarity score is measured for each contamination factor. The similarity score indicates how similar, between the particular anomaly detector and the best anomaly detector, are the binary classes of labels with that contamination factor. For each contamination factor, a combined similarity score is calculated based on the similarity scores for the contamination factor. Based on a contamination factor that has the highest combined similarity score, the computer detects that an additional anomaly detector is inaccurate.