Abstract:
A system tokenizes values stored in a field by multiple records. The system creates a trie from the tokenized values, each branch in the trie labeled with one of the tokenized values, each node storing a count indicating the number of the multiple records associated with a tokenized value sequence beginning from a root of the trie. The system tokenizes a value stored in the field by a prospective record. Beginning from the root of the trie, the system identifies each node corresponding to a token value sequence for the prospective record's tokenized value. Beginning from the most recently identified node for the prospective record's token value sequence, the system identifies each extending node which stores a count that satisfies a threshold, each identified extending node corresponding to another token value sequence. The system uses the other token value sequence to identify one of the multiple records that matches the prospective record.
Abstract:
A training dataset having training instances is determined. Each training instance comprises first and second records and a second record and a label indicate whether there is a match between the first and second records. A matching score vector is determined for each such training instance, and comprises components storing match scores for extracted features from field values in the first and second records. Based on matching score vectors and a match objective function, match score thresholds are determined for the extracted features. Match rule(s) each of which comprises predicate(s) are generated. Each predicate makes a predication on whether two records match by comparing a match score derived from the two records against a match score threshold.
Abstract:
In various embodiments, a system of synchronizing data is described. The system may store data associated with a plurality of data vendors. The system may synchronize the stored data with data from a first data vendor. The received data may be parsed by identifying data values indicated by associated metadata, and modifying the data values based on a universal data format. The system may also receive synchronization requests from a user of the service. The synchronization requests may indicate requested data and a list of processing operations. The requested data may correspond to data received from multiple data vendors. The system may perform the list of processing operations and return the data. Accordingly, the system may manage data received from multiple data vendors even if the data vendors have different synchronization conditions and provide the data in different formats. The data may be analyzed and output together to a user.
Abstract:
A system tokenizes values stored in a field by multiple records. The system creates a trie from the tokenized values, each branch in the trie labeled with one of the tokenized values, each node storing a count indicating the number of the multiple records associated with a tokenized value sequence beginning from a root of the trie. The system tokenizes a value stored in the field by a prospective record. Beginning from the root of the trie, the system identifies each node corresponding to a token value sequence for the prospective record's tokenized value. Beginning from the most recently identified node for the prospective record's token value sequence, the system identifies each extending node which stores a count that satisfies a threshold, each identified extending node corresponding to another token value sequence. The system uses the other token value sequence to identify one of the multiple records that matches the prospective record.
Abstract:
Some embodiments of the present invention include a method for identifying duplicate records from a group of records in a database system. The method includes generating a cluster of records from a group of records based on one or more keys; splitting the cluster of records into multiple subsets of records with each subset of records having fewer number of records than the cluster of records, wherein the splitting the cluster of records into multiple subsets of records is based on a number of records in the cluster of records exceeding a threshold; causing duplicate sets of records in each of the subsets of records to be identified, wherein a duplicate set of records includes one or more records, and wherein when a duplicate set of records includes two or more records, the two or more records are duplicates of one another; merging all of the duplicate sets of records identified from the multiple subsets of records forming a first group of duplicate sets of records; and forming a representative set of records based on selecting a representative record from each of the duplicate sets in the first group of duplicate sets of records.
Abstract:
Integrating third-party vendors' APIs is described. A system identifies a current call from a client computing system to an API associated with a third-party vendor, the current call including a configuration file for calling the API. The system determines whether a previous call was made to the API. The system determines whether part of the configuration file in the current call matches a corresponding part of a configuration file in the previous call, in response to a determination that a previous call was made to the API. The system uses a previously parsed configuration set, associated with the part of the configuration file in the current call, to configure a request in the current call and/or a response to the current call, in response to a determination that the configuration file in the current call matches the configuration file in the previous call.
Abstract:
Some embodiments of the present invention include a method for determining duplicate records in multiple objects and may include combining records associated with a first object with records associated with a second object to generate a third object, wherein the first object is related to the second object; performing de-duplication on the third object to generate a combined group of duplicate sets; and from the combined group of duplicate sets, identifying at least one duplicate set associated with both the first object and the second object based on the duplicate set having at least one record associated with the first object and at least one record associated with the second object.
Abstract:
A system identifies a first number of distinct values stored in a first field by a dataset of records. The system identifies a second number of distinct values stored in a second field by the dataset of records. The system creates a trie from values stored in a field by multiple records, the field corresponding to the first field or the second field, based on comparing the first number to the second number. The system associates a node in the trie with one of the multiple records, based on a value stored in the field by the record. The system identifies a branch sequence in the trie as a key for a prospective record, based on a prospective value stored in a corresponding field by the prospective record. The system uses the key for the prospective record to identify one of the multiple records that matches the prospective record.
Abstract:
A method for linking records from different datasets based on record similarities is described. The method includes ingesting a first dataset, including a first set of records with a first set of fields, wherein the first dataset is associated with a first vendor and a first type of data, and a second dataset, including a second set of records with a second set of fields, wherein the second dataset is associated with a second vendor and a second type of data; determining that a first record from the first set of records is similar to a second record from the second set of records based on similarities between fields in the first and second set of fields; and linking the first and second records in response to determining that the similarity, wherein the first and second vendors are different and/or the first and second types of data are different.
Abstract:
Systems and methods are provided for creating indices and loading key-value pairs for NoSQL databases. Attributes are created that correspond to records in a NoSQL database based on corresponding record fields. An index is created based on the attributes. A memory is loaded with attributes that correspond to a subset of the index as keys in a key-value pair and identifiers that correspond to records that correspond to the attributes as values in the key-value pair. The attributes that correspond to the subset of the index are sorted in the memory. Any duplicate attributes are identified from the sorted attributes in the memory. Any identifiers that correspond to any duplicate attributes also identify records in the NoSQL database to be evaluated as potential duplicate records.