- 专利标题: Efficiently finding potential duplicate values in data
-
申请号: US16791072申请日: 2020-02-14
-
公开(公告)号: US11334603B2公开(公告)日: 2022-05-17
- 发明人: Namit Kabra , Yannick Saillet
- 申请人: International Business Machines Corporation
- 申请人地址: US NY Armonk
- 专利权人: International Business Machines Corporation
- 当前专利权人: International Business Machines Corporation
- 当前专利权人地址: US NY Armonk
- 代理机构: Shackelford, Bowen, McKinley & Norton, LLP
- 代理商 Robert A. Voigt, Jr.
- 主分类号: G06F7/00
- IPC分类号: G06F7/00 ; G06F16/28 ; G06F16/25 ; G06F16/2455 ; G06F16/2457 ; G06F16/215
摘要:
A method, system and computer program product for finding groups of potential duplicates in attribute values. Each attribute value of the attribute values is converted to a respective set of bigrams. All bigrams present in the attribute values may be determined. Bigrams present in the attribute values may be represented as bits. This may result in a bitmap representing the presence of the bigrams in the attribute values. The attribute values may be grouped using bitwise operations on the bitmap, where each group includes attribute values that are determined based on pairwise bigram-based similarity scores. The pairwise bigram-based similarity score reflects the number of common bigrams between two attribute values.
公开/授权文献
- US20200183954A1 EFFICIENTLY FINDING POTENTIAL DUPLICATE VALUES IN DATA 公开/授权日:2020-06-11
信息查询