-
公开(公告)号:US12001456B2
公开(公告)日:2024-06-04
申请号:US17654858
申请日:2022-03-15
Applicant: International Business Machines Corporation
Inventor: Chun Hua Sun , Xu Bin Cai , Xiaobo Wang , Yi Wang , Wei Wang
CPC classification number: G06F16/285 , G06F16/221
Abstract: Performing a mutual exclusion data class analysis is provided. A data class group of a plurality of data class groups that a matching data class is a member of is identified. The matching data class matches data in a plurality of rows of a column in a data asset. Data classes included in the data class group that the matching data class is a member of are identified. A mutual exclusion data class is filtered from the data class group to form a filtered data class group for the column. The filtered data class group is run against the column of the data asset decreasing processing time and resource utilization of a computer.
-
公开(公告)号:US11886468B2
公开(公告)日:2024-01-30
申请号:US17541704
申请日:2021-12-03
Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION
Inventor: Xu Bin Cai , Xiaobo Wang , Chun Hua Sun , Yi Wang , Wei Wang
CPC classification number: G06F16/285 , G06F16/221 , G06F16/2264 , G06N20/00
Abstract: Systems and methods are provided for automated classification of data using fingerprints. In embodiments, a method includes: generating, by a computing device based on predetermined rules, a fingerprint of a data column in a data set to be classified, the fingerprint comprising dimensions, wherein each of the dimension is assigned an attribute representing a characteristic of data in the data column; determining, by the computing device, that the fingerprint matches one or more target fingerprints by comparing the fingerprint to the target fingerprints, wherein each target fingerprint is associated with a class and includes dimensions, and each dimension is assigned an attribute representing a characteristic of data in the class; and assigning, by the computing device, one or more classes to the data column based on the one or more target fingerprints, thereby generating classified data.
-
公开(公告)号:US20240320234A1
公开(公告)日:2024-09-26
申请号:US18125882
申请日:2023-03-24
Applicant: International Business Machines Corporation
Inventor: Yi Yang Ren , Chun Hua Sun , Xu Bin Cai , Wei Wang , Jian Ling Shi , Chun Leng , Pin Lv
IPC: G06F16/25 , G06F16/215
CPC classification number: G06F16/254 , G06F16/215
Abstract: An approach is disclosed that receives a new ETL job. The job includes a number of intermediate database files descriptors corresponding to a plurality of intermediate database files that are used to accomplish the new ETL. A new data lineage graph is created that pertains to the new ETL job. The new data lineage graph is compared to a number of existing data lineage graphs with each of the existing data lineage graphs corresponding to an existing ETL job. The approach substitutes existing database files found in the existing data lineage graphs for one or more intermediate database files found in the new data lineage graph. The new ETL job is then run by utilizing the substituted database files, the result being a new final database file.
-
公开(公告)号:US20230385252A1
公开(公告)日:2023-11-30
申请号:US17824200
申请日:2022-05-25
Applicant: International Business Machines Corporation
Inventor: Xu Bin Cai , Wei Wang , Chun Hua Sun , Chun Leng , Pin Lv , Yi Yang Ren , Jian Ling Shi , YI WANG , Tao Zhuang
IPC: G06F16/215 , G06F16/23 , G06F16/242
CPC classification number: G06F16/215 , G06F16/2365 , G06F16/2448
Abstract: An approach is provided that retrieves fingerprint configuration sets corresponding to a received data source and uses the configuration sets to generate fingerprints that correspond to the data source. These fingerprints are compared to a number of fingerprints that are stored in a repository. If a match is found, then the data quality configuration set is retrieved from the repository and used to perform a data quality analysis. On the other hand, if a match is not found, then one of the configuration sets is selected to perform the data quality analysis on the received data source and the repository is updated so that the selected fingerprint configuration set corresponds to the received data source.
-
公开(公告)号:US20240386032A1
公开(公告)日:2024-11-21
申请号:US18317475
申请日:2023-05-15
Applicant: International Business Machines Corporation
Inventor: Chun Hua Sun , Xu Bin Cai , Chun Leng , Wei Wang , Yi Yang Ren , Jian Ling Shi , Pin Lv , Xin Yu Wang , Yi Wang , Tao Zhuang
Abstract: New data class generation is provided. A dimension score is generated for each respective dimension of a plurality of predefined dimensions as relating to column attributes of a data asset while performing a static reference data analysis of the data asset. The dimension score of each respective dimension is added together to obtain a total dimension score for the data asset. It is determined whether the total dimension score of the data asset is greater than a predefined minimum dimension score threshold level. The data asset is identified as new static reference data in response to determining that the total dimension score of the data asset is greater than the predefined minimum dimension score threshold level. A new data class is generated based on the new static reference data.
-
公开(公告)号:US20230297596A1
公开(公告)日:2023-09-21
申请号:US17654858
申请日:2022-03-15
Applicant: International Business Machines Corporation
Inventor: Chun Hua Sun , Xu Bin Cai , Xiaobo Wang , Yi Wang , Wei Wang
CPC classification number: G06F16/285 , G06F16/221
Abstract: Performing a mutual exclusion data class analysis is provided. A data class group of a plurality of data class groups that a matching data class is a member of is identified. The matching data class matches data in a plurality of rows of a column in a data asset. Data classes included in the data class group that the matching data class is a member of are identified. A mutual exclusion data class is filtered from the data class group to form a filtered data class group for the column. The filtered data class group is run against the column of the data asset decreasing processing time and resource utilization of a computer.
-
-
-
-
-