OPTIMIZED SUBSET PROCESSING FOR DE-DUPLICATION

    公开(公告)号:US20170242891A1

    公开(公告)日:2017-08-24

    申请号:US15052556

    申请日:2016-02-24

    CPC classification number: G06F16/24556 G06F7/32 G06F16/2455 G06F16/285

    Abstract: Some embodiments of the present invention include a method for identifying duplicate records from a group of records in a database system. The method includes generating a cluster of records from a group of records based on one or more keys; splitting the cluster of records into multiple subsets of records with each subset of records having fewer number of records than the cluster of records, wherein the splitting the cluster of records into multiple subsets of records is based on a number of records in the cluster of records exceeding a threshold; causing duplicate sets of records in each of the subsets of records to be identified, wherein a duplicate set of records includes one or more records, and wherein when a duplicate set of records includes two or more records, the two or more records are duplicates of one another; merging all of the duplicate sets of records identified from the multiple subsets of records forming a first group of duplicate sets of records; and forming a representative set of records based on selecting a representative record from each of the duplicate sets in the first group of duplicate sets of records.

    AUGMENTING MATCH INDICES
    4.
    发明申请

    公开(公告)号:US20180165354A1

    公开(公告)日:2018-06-14

    申请号:US15590371

    申请日:2017-05-09

    CPC classification number: G06F16/31 G06F16/90335

    Abstract: System creates three tries based on values stored in first three fields by records. System associates node in third trie with record, based on value stored in third field by record. System associates node with first dispersion measure, based on values stored in first field by records associated with node, and with second dispersion measure, based on values stored in second field by records associated with node. System identifies branch sequence in third trie as key for prospective record, based on value stored in third field by prospective record. System uses key to identify a subset of records that match prospective record. If a count of the subset exceeds threshold, the system identifies other branch sequence in first trie or second trie as other key for prospective record, based on first dispersion measure and second dispersion measure. System uses the key and the other key to identify at least one record that matches prospective record.

    OPTIMIZED MATCH KEYS FOR FIELDS WITH PREFIX STRUCTURE

    公开(公告)号:US20180165294A1

    公开(公告)日:2018-06-14

    申请号:US15374924

    申请日:2016-12-09

    CPC classification number: G06F16/1727 G06F16/164 G06F16/9027

    Abstract: The system tokenizes values stored by records' fields, creates trie from tokenized values, each branch labeled with tokenized value, each node storing count indicating number of records associated with tokenized value sequence beginning from trie root. The system tokenizes value stored by record field, identifies nodes, beginning from trie root, corresponding to token value sequence associated with tokenized value, until node is identified that stores count that is less than node threshold. The system identifies branch sequence comprising each identified node as record's key, and associates key with node storing count less than node threshold, and record with key. The system tokenizes prospective value stored by prospective record's field, identifies nodes, beginning from trie root, corresponding to another token value sequence associated with tokenized prospective value, until another node is identified that stores another count that is less than node threshold. The system identifies other node's key as prospective record's key, identifies existing record that matches prospective record by using prospective record's key.

    LINKING RECORDS BETWEEN DATASETS TO AUGMENT QUERY RESULTS

    公开(公告)号:US20190188313A1

    公开(公告)日:2019-06-20

    申请号:US15844311

    申请日:2017-12-15

    Abstract: A method for linking records from different datasets based on record similarities is described. The method includes ingesting a first dataset, including a first set of records with a first set of fields, wherein the first dataset is associated with a first vendor and a first type of data, and a second dataset, including a second set of records with a second set of fields, wherein the second dataset is associated with a second vendor and a second type of data; determining that a first record from the first set of records is similar to a second record from the second set of records based on similarities between fields in the first and second set of fields; and linking the first and second records in response to determining that the similarity, wherein the first and second vendors are different and/or the first and second types of data are different.

    METADATA DRIVEN DATASET MANAGEMENT
    7.
    发明申请

    公开(公告)号:US20190163786A1

    公开(公告)日:2019-05-30

    申请号:US15828118

    申请日:2017-11-30

    Abstract: A method for configuring the operation of the software of a data as a service (DAAS) system during run time is described. The configuring includes at least one of configuring ingestion of a vendor dataset to produce an ingested dataset and which analysis operations to perform on the vendor dataset to produce an analyzed dataset, and the configuring also includes at least one of how to search the vendor dataset based on a search query from a customer to allow the customer to locate a new record from the vendor dataset and how to match records in the vendor dataset with a match query from the customer to provide an updated record to the customer.

    MATCH INDEX CREATION
    8.
    发明申请

    公开(公告)号:US20180165281A1

    公开(公告)日:2018-06-14

    申请号:US15496905

    申请日:2017-04-25

    Abstract: A system identifies a first number of distinct values stored in a first field by a dataset of records. The system identifies a second number of distinct values stored in a second field by the dataset of records. The system creates a trie from values stored in a field by multiple records, the field corresponding to the first field or the second field, based on comparing the first number to the second number. The system associates a node in the trie with one of the multiple records, based on a value stored in the field by the record. The system identifies a branch sequence in the trie as a key for a prospective record, based on a prospective value stored in a corresponding field by the prospective record. The system uses the key for the prospective record to identify one of the multiple records that matches the prospective record.

    BULK DEDUPLICATION DETECTION
    9.
    发明申请

    公开(公告)号:US20170242868A1

    公开(公告)日:2017-08-24

    申请号:US15052382

    申请日:2016-02-24

    CPC classification number: G06F17/30303 G06F7/32 G06F17/30489 G06F17/30598

    Abstract: Some embodiments of the present invention include a system and method for removing duplicate records from a group of records in a database system. The method includes generating a first cluster of records from the group of records, generating a second cluster of records from the group of records, identifying sets of duplicate records in the first cluster of records, and identifying sets of duplicate records in the second cluster of records. The method also includes merging at least two sets of duplicate records associated with both the first cluster and the second cluster of records to form a merged set of duplicate records. The merging is performed based on the at least two sets of duplicate records having a common record. Duplicate records in the group of records may then be removed by removing duplicate records from the merged set of duplicate records.

Patent Agency Ranking