Indexing partitions using distributed bloom filters

    公开(公告)号:US11531666B1

    公开(公告)日:2022-12-20

    申请号:US16998922

    申请日:2020-08-20

    Abstract: Methods, systems, and computer-readable media for indexing partitions using distributed Bloom filters are disclosed. A data indexing system generates a plurality of indices for a plurality of partitions in a distributed object store. The indices comprise a plurality of Bloom filters. An individual one of the Bloom filters corresponds to one or more fields of an individual one of the partitions. Using the Bloom filters, the data indexing system determines a first portion of the partitions that possibly comprise a value and a second portion of the partitions that do not comprise the value. Based (at least in part) on a scan of the first portion of the partitions and not the second portion of the partitions, the data indexing system determines one or more partitions of the first portion of the partitions that comprise the value.

    Data retention management for partitioned datasets

    公开(公告)号:US12072868B1

    公开(公告)日:2024-08-27

    申请号:US17224987

    申请日:2021-04-07

    CPC classification number: G06F16/2379 G06F16/2228

    Abstract: Systems and methods are disclosed to implement a data storage system that manages data retention for partitioned datasets. A received data retention policy specifies to selectively delete data from a dataset based on a set of data retention attributes. If the data retention attributes are part of the dataset's partition key, a first type of data deletion job is configured to selectively delete entire partitions of the dataset. Otherwise, the system will generate a retention attribute index for the dataset, which will be used by a second type of data deletion job to selectively delete individual records within the partitions. In embodiments, the retention attribute index is implemented as Bloom filters that track retention attribute values in each partition. Advantageously, the disclosed system is able to automatically configure deletion jobs for any dataset schema that avoids full scans of the dataset partitions.

Patent Agency Ranking