TRIE-BASED NORMALIZATION OF FIELD VALUES FOR MATCHING

    公开(公告)号:US20190236178A1

    公开(公告)日:2019-08-01

    申请号:US15884732

    申请日:2018-01-31

    CPC classification number: G06F16/2365 G06F16/24575 G06F16/2468

    Abstract: A system tokenizes values stored in a field by multiple records. The system creates a trie from the tokenized values, each branch in the trie labeled with one of the tokenized values, each node storing a count indicating the number of the multiple records associated with a tokenized value sequence beginning from a root of the trie. The system tokenizes a value stored in the field by a prospective record. Beginning from the root of the trie, the system identifies each node corresponding to a token value sequence for the prospective record's tokenized value. Beginning from the most recently identified node for the prospective record's token value sequence, the system identifies each extending node which stores a count that satisfies a threshold, each identified extending node corresponding to another token value sequence. The system uses the other token value sequence to identify one of the multiple records that matches the prospective record.

    MACHINE LEARNT MATCH RULES
    2.
    发明申请

    公开(公告)号:US20190236460A1

    公开(公告)日:2019-08-01

    申请号:US15882134

    申请日:2018-01-29

    CPC classification number: G06N5/025 G06F16/951 G06N20/00

    Abstract: A training dataset having training instances is determined. Each training instance comprises first and second records and a second record and a label indicate whether there is a match between the first and second records. A matching score vector is determined for each such training instance, and comprises components storing match scores for extracted features from field values in the first and second records. Based on matching score vectors and a match objective function, match score thresholds are determined for the extracted features. Match rule(s) each of which comprises predicate(s) are generated. Each predicate makes a predication on whether two records match by comparing a match score derived from the two records against a match score threshold.

    Trie-based normalization of field values for matching

    公开(公告)号:US11016959B2

    公开(公告)日:2021-05-25

    申请号:US15884732

    申请日:2018-01-31

    Abstract: A system tokenizes values stored in a field by multiple records. The system creates a trie from the tokenized values, each branch in the trie labeled with one of the tokenized values, each node storing a count indicating the number of the multiple records associated with a tokenized value sequence beginning from a root of the trie. The system tokenizes a value stored in the field by a prospective record. Beginning from the root of the trie, the system identifies each node corresponding to a token value sequence for the prospective record's tokenized value. Beginning from the most recently identified node for the prospective record's token value sequence, the system identifies each extending node which stores a count that satisfies a threshold, each identified extending node corresponding to another token value sequence. The system uses the other token value sequence to identify one of the multiple records that matches the prospective record.

    Optimized subset processing for de-duplication

    公开(公告)号:US10901996B2

    公开(公告)日:2021-01-26

    申请号:US15052556

    申请日:2016-02-24

    Abstract: Some embodiments of the present invention include a method for identifying duplicate records from a group of records in a database system. The method includes generating a cluster of records from a group of records based on one or more keys; splitting the cluster of records into multiple subsets of records with each subset of records having fewer number of records than the cluster of records, wherein the splitting the cluster of records into multiple subsets of records is based on a number of records in the cluster of records exceeding a threshold; causing duplicate sets of records in each of the subsets of records to be identified, wherein a duplicate set of records includes one or more records, and wherein when a duplicate set of records includes two or more records, the two or more records are duplicates of one another; merging all of the duplicate sets of records identified from the multiple subsets of records forming a first group of duplicate sets of records; and forming a representative set of records based on selecting a representative record from each of the duplicate sets in the first group of duplicate sets of records.

    INTEGRATING THIRD-PARTY VENDORS' APIs
    6.
    发明申请

    公开(公告)号:US20190230169A1

    公开(公告)日:2019-07-25

    申请号:US15879083

    申请日:2018-01-24

    Abstract: Integrating third-party vendors' APIs is described. A system identifies a current call from a client computing system to an API associated with a third-party vendor, the current call including a configuration file for calling the API. The system determines whether a previous call was made to the API. The system determines whether part of the configuration file in the current call matches a corresponding part of a configuration file in the previous call, in response to a determination that a previous call was made to the API. The system uses a previously parsed configuration set, associated with the part of the configuration file in the current call, to configure a request in the current call and/or a response to the current call, in response to a determination that the configuration file in the current call matches the configuration file in the previous call.

    CROSS OBJECTS DE-DUPLICATION
    7.
    发明申请

    公开(公告)号:US20170286441A1

    公开(公告)日:2017-10-05

    申请号:US15085588

    申请日:2016-03-30

    CPC classification number: G06F16/1748 G06F16/2365

    Abstract: Some embodiments of the present invention include a method for determining duplicate records in multiple objects and may include combining records associated with a first object with records associated with a second object to generate a third object, wherein the first object is related to the second object; performing de-duplication on the third object to generate a combined group of duplicate sets; and from the combined group of duplicate sets, identifying at least one duplicate set associated with both the first object and the second object based on the duplicate set having at least one record associated with the first object and at least one record associated with the second object.

    Match index creation
    8.
    发明授权

    公开(公告)号:US10817465B2

    公开(公告)日:2020-10-27

    申请号:US15496905

    申请日:2017-04-25

    Abstract: A system identifies a first number of distinct values stored in a first field by a dataset of records. The system identifies a second number of distinct values stored in a second field by the dataset of records. The system creates a trie from values stored in a field by multiple records, the field corresponding to the first field or the second field, based on comparing the first number to the second number. The system associates a node in the trie with one of the multiple records, based on a value stored in the field by the record. The system identifies a branch sequence in the trie as a key for a prospective record, based on a prospective value stored in a corresponding field by the prospective record. The system uses the key for the prospective record to identify one of the multiple records that matches the prospective record.

    Linking records between datasets to augment query results

    公开(公告)号:US10810233B2

    公开(公告)日:2020-10-20

    申请号:US15844311

    申请日:2017-12-15

    Abstract: A method for linking records from different datasets based on record similarities is described. The method includes ingesting a first dataset, including a first set of records with a first set of fields, wherein the first dataset is associated with a first vendor and a first type of data, and a second dataset, including a second set of records with a second set of fields, wherein the second dataset is associated with a second vendor and a second type of data; determining that a first record from the first set of records is similar to a second record from the second set of records based on similarities between fields in the first and second set of fields; and linking the first and second records in response to determining that the similarity, wherein the first and second vendors are different and/or the first and second types of data are different.

    Method and system for creating indices and loading key-value pairs for NoSQL databases
    10.
    发明授权
    Method and system for creating indices and loading key-value pairs for NoSQL databases 有权
    为NoSQL数据库创建索引和加载键值对的方法和系统

    公开(公告)号:US09378263B2

    公开(公告)日:2016-06-28

    申请号:US13860220

    申请日:2013-04-10

    CPC classification number: G06F17/30587 G06F17/30303 G06F17/30321

    Abstract: Systems and methods are provided for creating indices and loading key-value pairs for NoSQL databases. Attributes are created that correspond to records in a NoSQL database based on corresponding record fields. An index is created based on the attributes. A memory is loaded with attributes that correspond to a subset of the index as keys in a key-value pair and identifiers that correspond to records that correspond to the attributes as values in the key-value pair. The attributes that correspond to the subset of the index are sorted in the memory. Any duplicate attributes are identified from the sorted attributes in the memory. Any identifiers that correspond to any duplicate attributes also identify records in the NoSQL database to be evaluated as potential duplicate records.

    Abstract translation: 系统和方法用于为NoSQL数据库创建索引和加载键值对。 根据相应的记录字段创建与NoSQL数据库中的记录相对应的属性。 基于属性创建索引。 存储器加载了与索引的子集对应的属性作为键值对中的键以及对应于作为键值对中的值的属性的记录的标识符。 对应于索引子集的属性在内存中排序。 从内存中排序的属性中识别出任何重复的属性。 与任何重复属性相对应的任何标识符也标识要被评估为潜在重复记录的NoSQL数据库中的记录。

Patent Agency Ranking