Abstract:
Determine first count of first records storing first value in first field, second count of second records storing second value in second field, third count of third records storing third value in third field. Determine count threshold using first, second and third counts, dispersion measure based on dispersion of values stored in second field by first records and other dispersion measure based on other dispersion of values stored in third field by first records. Train machine-learning model to determine dispersion measure threshold based on dispersion and other dispersion measures. If first count is greater than count threshold, and dispersion measure is greater than dispersion measure threshold, create match index based on first and second fields. Receive prospective record storing first value in first field, second value in second field. Use match index to identify record storing first value in first field, second value in second field as matching prospective record.
Abstract:
A system receives a record which includes a string and separates the string into a number of tokens, including a token and another token. The system identifies a pattern that includes an entity, another entity, and a number of entities that equals the number of tokens, and another pattern that includes the same number of entities as the number of tokens. The system determines a combined probability that combines a probability based on the number of entries in the entity's dictionary which stores the token, and another probability based on a number of character types in the other entity that match characters in the other token. If the combined probability associated with the pattern is greater than another combined probability associated with the other pattern, the system matches the record to a system record based on recognizing the token as the entity and the other token as the other entity.
Abstract:
Some embodiments of the present invention include a method for determining a dense subset from a group of records using a graphical representation of the group of records, the graphical representation having nodes and edges, a node associated with a record from the group of records, an edge connecting two nodes associated with two related records, wherein a node is associated with a weight corresponding to a number of edges connected to the node, wherein a record is added to the dense subset based on its associated node having a highest weight and a density that satisfies a density threshold, the density being based on the content of the dense subset, and wherein the content of the dense subset is to be processed as including duplicate records.
Abstract:
A system creates graph of nodes connected by edges. Each node represents corresponding value of corresponding attribute and is associated with count of corresponding value. Each edge is associated with count of instances that values represented by corresponding connected nodes are associated with each other. The system identifies each node associated with first count as first set of keys, and deletes each node associated with first count. The system identifies each edge associated with second count as second set of keys, and deletes each edge associated with second count. The system identifies each node associated with third count as third set of keys, and deletes each node associated with third count. The system identifies each edge associated with fourth count as fourth set of keys, and deletes each edge associated with fourth count. The system uses each set of keys to search and match records.
Abstract:
A system determines a first volume of out-calls of a first out-call type made by a software container that is executing an application during a time period. The system determines a second volume of out-calls of a second out-call type made by the software container. The system determines a first ratio of the first volume to a combined volume of out-calls of all out-call types made by the software container. The system determines a second ratio of the second volume to the combined volume of out-calls of all out-call types made by the software container. The system determines a measure by comparing the first ratio to a third ratio associated with the first out-call type, and by comparing the second ratio to a fourth ratio associated with the second out-call type. The system identifies any behavior or any application type associated with the application, based on the measure.
Abstract:
A system tokenizes values stored in a field by multiple records. The system creates a trie from the tokenized values, each branch in the trie labeled with one of the tokenized values, each node storing a count indicating the number of the multiple records associated with a tokenized value sequence beginning from a root of the trie. The system tokenizes a value stored in the field by a prospective record. Beginning from the root of the trie, the system identifies each node corresponding to a token value sequence for the prospective record's tokenized value. Beginning from the most recently identified node for the prospective record's token value sequence, the system identifies each extending node which stores a count that satisfies a threshold, each identified extending node corresponding to another token value sequence. The system uses the other token value sequence to identify one of the multiple records that matches the prospective record.
Abstract:
Systems and methods are provided for matching snippets of search results to clusters of objects. A system adds a data snippet of a search result to a cluster of objects. The system calculates a confidence score for the add based on the recency, a job title, an email address, and/or a phone number associated with the data snippet. The system stores the add in the customer accessible database if the confidence score is sufficiently high for the add to be stored in the customer accessible database. The system generates a notice for review if the confidence score is not sufficiently high for the add to be stored in the customer accessible database.
Abstract:
A system and method for inferring reporting relationships from contact records. Contact records from a single company are identified, and each record is ranked based on the title. A probabilistic analysis is used to compare the number of contacts on the current level with the number of contacts on a lower level, and make a guess as to reporting relationships between contacts on the different levels. If a confidence score of a guessed reporting relationship is high enough, the reporting relationship is accepted, and contact records updated.
Abstract:
Systems and methods are provided for matching snippets of search results to clusters of objects. A system adds a data snippet of a search result to a cluster of objects. The system calculates a confidence score for the add based on the recency, a job title, an email address, and/or a phone number associated with the data snippet. The system stores the add in the customer accessible database if the confidence score is sufficiently high for the add to be stored in the customer accessible database. The system generates a notice for review if the confidence score is not sufficiently high for the add to be stored in the customer accessible database.
Abstract:
A system and method for inferring reporting relationships from contact records. Contact records from a single company are identified, and each record is ranked based on the title. A probabilistic analysis is used to compare the number of contacts on the current level with the number of contacts on a lower level, and make a guess as to reporting relationships between contacts on the different levels. If a confidence score of a guessed reporting relationship is high enough, the reporting relationship is accepted, and contact records updated.