-
1.
公开(公告)号:US20230418792A1
公开(公告)日:2023-12-28
申请号:US17851546
申请日:2022-06-28
发明人: Annmary Justine KOOMTHANAM , Suparna Bhattacharya , Aalap Tripathy , Sergey Serebryakov , Martin Foltin , Paolo Faraboschi
IPC分类号: G06F16/215 , G06F16/25 , G06F16/27 , G06N20/00 , G06K9/62
CPC分类号: G06F16/215 , G06F16/254 , G06F16/27 , G06N20/00 , G06K9/6256
摘要: Systems and methods are provide for automatically constructing data lineage representations for distributed data processing pipelines. These data lineage representations (which are constructed and stored in a central repository shared by the multiple data processing sites) can be used to among other things, clone the distributed data processing pipeline for quality assurance or debugging purposes. Examples of the presently disclosed technology are able to construct data lineage representations for distributed data processing pipelines by (1) generating a hash content value for universally identifying each data artifact of the distributed data processing pipeline across the multiple processing stages/processing sites of the distributed data processing pipeline; and (2) creating an data processing pipeline abstraction hierarchy for associating each data artifact to input and output events for given executions of given data processing stages (performed by the multiple data processing sites).