CLEANING AND ORGANIZING SCHEMALESS SEMI-STRUCTURED DATA FOR EXTRACT, TRANSFORM, AND LOAD PROCESSING

    公开(公告)号:US20240193176A1

    公开(公告)日:2024-06-13

    申请号:US18581856

    申请日:2024-02-20

    IPC分类号: G06F16/25 G06F16/21 G06F16/28

    摘要: In some implementations, a system may obtain, from a first data repository, a first dataset that includes event data associated with a generic schema. The system may infer an event-specific schema that defines an organizational structure for the event data based on common attributes identified among a plurality of events included in the event data using one or more data analytics functions. The system may store, in a second data repository, a second dataset in which the event data is partitioned based on the organizational structure defined by the event-specific schema. The system may generate a third dataset that includes a subset of the event data included in the second dataset that satisfies one or more registration parameters related to an extract, transform, load (ETL) use case. The system may provide the third dataset to an ETL system configured to process the third dataset based on the ETL use case.

    CLEANING AND ORGANIZING SCHEMALESS SEMI-STRUCTURED DATA FOR EXTRACT, TRANSFORM, AND LOAD PROCESSING

    公开(公告)号:US20240012827A1

    公开(公告)日:2024-01-11

    申请号:US17810715

    申请日:2022-07-05

    IPC分类号: G06F16/25 G06F16/21 G06F16/28

    摘要: In some implementations, a system may obtain, from a first data repository, a first dataset that includes event data associated with a generic schema. The system may infer an event-specific schema that defines an organizational structure for the event data based on common attributes identified among a plurality of events included in the event data using one or more data analytics functions. The system may store, in a second data repository, a second dataset in which the event data is partitioned based on the organizational structure defined by the event-specific schema. The system may generate a third dataset that includes a subset of the event data included in the second dataset that satisfies one or more registration parameters related to an extract, transform, load (ETL) use case. The system may provide the third dataset to an ETL system configured to process the third dataset based on the ETL use case.