-
公开(公告)号:US11720548B1
公开(公告)日:2023-08-08
申请号:US17205949
申请日:2021-03-18
Applicant: Amazon Technologies, Inc.
Inventor: Daniel Opincariu , Yangbae Park , Sanjay Mathew Thomas
IPC: G06F16/23 , G06F16/245 , G06F21/62
CPC classification number: G06F16/2379 , G06F16/245 , G06F21/6245
Abstract: Systems, devices, and methods are provided for implementing shadow data lakes. In at least one embodiment, a deletion workflow obtains a deletion request from a delete request cache service, gets attestation details from an attestation service, submits a job to scan one or more records from a source table of a data lake and publish the one or more records to a deleted records table of a shadow data lake, and cause deletion of the one or more records from the data lake.
-
公开(公告)号:US11531666B1
公开(公告)日:2022-12-20
申请号:US16998922
申请日:2020-08-20
Applicant: Amazon Technologies, Inc.
Inventor: Yangbae Park , Laxmi Siva Prasad Balaramaraju Jalumari , Daniel Opincariu , Fletcher Liverance , Zhuonan Song
Abstract: Methods, systems, and computer-readable media for indexing partitions using distributed Bloom filters are disclosed. A data indexing system generates a plurality of indices for a plurality of partitions in a distributed object store. The indices comprise a plurality of Bloom filters. An individual one of the Bloom filters corresponds to one or more fields of an individual one of the partitions. Using the Bloom filters, the data indexing system determines a first portion of the partitions that possibly comprise a value and a second portion of the partitions that do not comprise the value. Based (at least in part) on a scan of the first portion of the partitions and not the second portion of the partitions, the data indexing system determines one or more partitions of the first portion of the partitions that comprise the value.
-
公开(公告)号:US11816081B1
公开(公告)日:2023-11-14
申请号:US17205885
申请日:2021-03-18
Applicant: Amazon Technologies, Inc.
Inventor: Daniel Opincariu , Zhuonan Song
IPC: G06F16/22 , G06F16/27 , G06F16/2458 , G06F16/2453
CPC classification number: G06F16/2228 , G06F16/2462 , G06F16/24532 , G06F16/278
Abstract: Systems, devices, and methods are provided for efficient query execution on distributed data sets, such as in the context of data lakes. In at least one embodiment, indexing information is used to identify candidate and non-candidate portions of a data set. Non-candidate portions may be irrelevant to the query. Indexing information can be encoded using Bloom filters.
-
公开(公告)号:US12277134B1
公开(公告)日:2025-04-15
申请号:US18478274
申请日:2023-09-29
Applicant: Amazon Technologies, Inc.
Inventor: Daniel Opincariu , Rajasuba Subramanian , Arnab Dutta , Deepan Chakravarthy Vijayarangam , Ranil Pavithran Muzhangathu , Anas Fattahi
Abstract: In a data lake, a control data object is defined. The control object defines the processes and relationships of processes associated with a data set in the data lake. The control has states that are tied to and adapt in response to state changes of the associated data set. A control can have a control type. The system automatically carries forward enabled processes from one data set version to the next data set version. The system uses the control definition to execute processes, such as compaction or data quality scans, on data sets in the data lake.
-
公开(公告)号:US12072868B1
公开(公告)日:2024-08-27
申请号:US17224987
申请日:2021-04-07
Applicant: Amazon Technologies, Inc.
Inventor: Daniel Opincariu , Sandeep Joshi
CPC classification number: G06F16/2379 , G06F16/2228
Abstract: Systems and methods are disclosed to implement a data storage system that manages data retention for partitioned datasets. A received data retention policy specifies to selectively delete data from a dataset based on a set of data retention attributes. If the data retention attributes are part of the dataset's partition key, a first type of data deletion job is configured to selectively delete entire partitions of the dataset. Otherwise, the system will generate a retention attribute index for the dataset, which will be used by a second type of data deletion job to selectively delete individual records within the partitions. In embodiments, the retention attribute index is implemented as Bloom filters that track retention attribute values in each partition. Advantageously, the disclosed system is able to automatically configure deletion jobs for any dataset schema that avoids full scans of the dataset partitions.
-
-
-
-