Abstract:
A probabilistic data structure is generated for efficient query processing using a histogram for unsorted data in a column of a columnar database. A bucket range size is determined for multiples buckets of a histogram of a column in a columnar database table. In at least some embodiments, the histogram may be a height-balanced histogram. A probabilistic data structure is generated to indicate for which particular buckets in the histogram there is a data value stored in the data block. When an indication of a query directed to the column for select data is received, the probabilistic data structure for each of the data blocks storing data for the column may be examined to determine particular ones of the data blocks which do not need to be read in order to service the query for the select data.
Abstract:
A database system may maintain a plurality of log records at a distributed storage system. Each of the plurality of log records may be associated with a respective change to a data page. Upon detection of a coalesce event for a particular data page, log records linked to the particular data page may be applied to generate the particular data page in its current state. Detecting the coalesce event may be a determination that the number of log records linked to the particular data page exceeds a threshold.
Abstract:
A distributed data warehouse system may maintain data blocks on behalf of clients, and may store primary and secondary copies of each data block on different disks or nodes in a cluster. The warehouse system may back up data blocks in a remote key-value backup storage system. A restore operation may retrieve data blocks from backup storage using their unique identifiers as keys (while incoming queries are serviced) in response to a failure or a query targeting data that was lost or corrupted. The order in which data blocks are restored may be dependent on the relative likelihood that they will be accessed in the near future (e.g., based on how recently or frequently they were accessed, written, or backed up; the values of one or more access counters associated with each data block; or how recently a database table containing data in each data block was loaded).
Abstract:
Data transformation workflows may be generated to transform data objects. A source data schema for a data object and a target data format or target data schema for a data object may be identified. A comparison of the source data schema and the target data format or schema may be made to determine what transformations can be performed to transform the data object into the target data format or schema. Code to execute the transformation operations may then be generated. The code may be stored for subsequent modification or execution.
Abstract:
A network-based services provider may reserve and provision primary resource instance capacity for a given service (e.g., enough compute instances, storage instances, or other virtual resource instances to implement the service) in one or more availability zones, and may designate contingency resource instance capacity for the service in another availability zone (without provisioning or reserving the contingency instances for the exclusive use of the service). For example, the service provider may provision resource instance(s) for a database engine head node in one availability zone and designate resource instance capacity for another database engine head node in another availability zone without instantiating the other database engine head node. While the service operates as expected using the primary resource instance capacity, the contingency resource capacity may be leased to other entities on a spot market. Leases for contingency instance capacity may be revoked when needed for the given service (e.g., during failover).
Abstract:
Data may be efficiently analyzed and compressed as part of a data compression service. A data compression request may be received from a client indicating data to be compressed. An analysis of the data or metadata associated with the data may be performed. In at least some embodiments, this analysis may be a rules-based analysis. Some embodiments may employ one or more machine learning techniques to historical compression data to update the rules-based analysis. One or more compression techniques may be selected out of a plurality of compression techniques to be applied to the data. Data compression candidates may then be generated according to the selected compression techniques. In some embodiments, a compression service restriction may be enforced. One of the data compression candidates may be selected and sent in a response.
Abstract:
A multi-column index is generated based on an interleaving of data bits for selectivity for efficient processing of data in a relational database system. Two or more columns may be identified for inclusion in the multi-column index for a relational database table. Based, at least in part, on the interleaving of data bits for selectivity from the identified columns, a multi-column index is generated for the relational database table that provides a respective index value for each entry in the relational database table. The entries of the relational database table may then be stored according to the index values of the multi-column index.
Abstract:
A log-structured data store may implement optimized log storage for asynchronous log updates. In some embodiments, log records may be received indicating updates to data stored for a storage client and indicating positions in a log record sequence. The log records themselves may not be guaranteed to be received according to the log record sequence. Received log records may be stored in a hot log portion of a block-based storage device according to an order in which they are received. Log records in the hot log portion may then be identified to be moved to a cold log portion of the block-based storage device in order to complete a next portion of the log record sequence. Log records may be modified, such as compressed, or coalesced, before being stored together in a data block of the cold log portion according to the log record sequence.
Abstract:
A network-based services provider may reserve and provision primary resource instance capacity for a given service (e.g., enough compute instances, storage instances, or other virtual resource instances to implement the service) in one or more availability zones, and may designate contingency resource instance capacity for the service in another availability zone (without provisioning or reserving the contingency instances for the exclusive use of the service). For example, the service provider may provision resource instance(s) for a database engine head node in one availability zone and designate resource instance capacity for another database engine head node in another availability zone without instantiating the other database engine head node. While the service operates as expected using the primary resource instance capacity, the contingency resource capacity may be leased to other entities on a spot market. Leases for contingency instance capacity may be revoked when needed for the given service (e.g., during failover).
Abstract:
Application program data stored in system memory may be selectively persisted. An indication may be provided to an application program that an application data object or a range of application data stored in system memory may be treated as persistent. Data backup may be enabled for the application data object or range of application data in the event of a system failure, copying the application data object or range of application data from system memory to non-volatile data storage. Upon recovery from a system failure, further data backup for the application data object or the range of application data may be disabled. In some embodiments, at least some of the application data object or range of application data may be recovered for the application program to access. Data backup for the application data object or the range of application data may also be re-enabled.