Abstract:
Resource management techniques for shared resources in a distributed system are described. Clients and servers may exchange messages according to an asynchronous messaging protocol that does not guarantee delivery or ordering of messages. A client may send a resource request message including a client timestamp and a measure of client resource demand. The server may allocate a grant of the resource to the client in a manner that prevents resource overload, and indicate the grant to the client via a message including a logical timestamp, the amount of resource granted, the client's original timestamp, and a grant expiration time. The server may acknowledge the grant and cooperatively use the resource in accordance with the grant's terms.
Abstract:
A system that implements a scalable data storage service may maintain tables in a non-relational data store on behalf of clients. The system may provide a Web services interface through which service requests are received, and an API usable to request that a table be created, deleted, or described; that an item be stored, retrieved, deleted, or its attributes modified; or that a table be queried (or scanned) with filtered items and/or their attributes returned. An asynchronous workflow may be invoked to create or delete a table. Items stored in tables may be partitioned and indexed using a simple or composite primary key. The system may not impose pre-defined limits on table size, and may employ a flexible schema. The service may provide a best-effort or committed throughput model. The system may automatically scale and/or re-partition tables in response to detecting workload changes, node failures, or other conditions or anomalies.
Abstract:
Disclosed are various embodiments for distributing data items within a plurality of nodes. A data item that is subject to a data item update request is updated from a master node to a plurality of slave notes. The update of the data item is determined to be locality-based durable based at least in part on acknowledgements received from the slave nodes. Upon detection that the master node has failed, a new master candidate is determined via an election among the plurality of slave nodes.
Abstract:
A system that implements a data storage service may store data on behalf of storage service clients. The system may maintain data in multiple replicas of various partitions that are stored on respective computing nodes in the system. The system may employ a single master failover protocol, usable when a replica attempts to become the master replica for a replica group of which it is a member. Attempting to become the master replica may include acquiring a lock associated with the replica group, and gathering state information from the other replicas in the group. The state information may indicate whether another replica supports the attempt (in which case it is included in a failover quorum) or stores more recent data or metadata than the replica attempting to become the master (in which case synchronization may be required). If the failover quorum includes enough replicas, the replica may become the master.
Abstract:
A system that implements a data storage service may store data on behalf of storage service clients. The system may maintain data in multiple replicas of partitions that are stored on respective computing nodes in the system. A master replica for a replica group may increment a membership version indicator for the group, and may propagate metadata (including the membership version indicator) indicating a membership change for the group to other members of the group. Propagating the metadata may include sending a log record containing the metadata to the other replicas to be appended to their respective logs. Once the membership change becomes durable, it may be committed. A replica attempting to become the master of a replica group may determine that another replica in the group has observed a more recent membership version, in which case logs may be synchronized or snipped, or the attempt may be abandoned.
Abstract:
A system that implements a scalable data storage service may maintain tables in a non-relational data store on behalf of clients. The system may provide a Web services interface through which service requests are received, and an API usable to request that a table be created, deleted, or described; that an item be stored, retrieved, deleted, or its attributes modified; or that a table be queried (or scanned) with filtered items and/or their attributes returned. An asynchronous workflow may be invoked to create or delete a table. Items stored in tables may be partitioned and indexed using a simple or composite primary key. The system may not impose pre-defined limits on table size, and may employ a flexible schema. The service may provide a best-effort or committed throughput model. The system may automatically scale and/or re-partition tables in response to detecting workload changes, node failures, or other conditions or anomalies.
Abstract:
At a client-side component of a storage group, a read descriptor generated in response to a read request directed to a first data store is received. The read descriptor includes a state transition indicator corresponding to a write that has been applied at the first data store. A write descriptor indicative of a write that depends on a result of the read request is generated at the client-side component. The read descriptor and the write descriptor are included in a commit request for a candidate transaction at the client-side component, and transmitted to a transaction manager.
Abstract:
A system that implements a data storage service may store data in multiple replicated partitions on respective storage nodes. The selection of the storage nodes (or storage devices thereof) on which to store the partition replicas may be performed by administrative components that are responsible for partition management and resource allocation for respective groups of storage nodes (e.g., based on a global view of resource capacity or usage), or the selection of particular storage devices of a storage node may be determined by the storage node itself (e.g., based on a local view of resource capacity or usage). Placement policies applied at the administrative layer or storage layer may be based on the percentage or amount of provisioned, reserved, or available storage or IOPS capacity on each storage device, and particular placements (or subsequent operations to move partition replicas) may result in an overall resource utilization that is well balanced.
Abstract:
A system that provides services to clients may receive and service requests, various ones of which may require different amounts of work. An admission control mechanism may manage requests based on tokens, each of which represents a fixed amount of work. The tokens may be added to a token bucket at rate that is dependent on a target work throughput rate while the number of tokens in the bucket does not exceed its maximum capacity. If at least a pre-determined minimum number of tokens is present in the bucket when a service request is received, it may be serviced. Servicing a request may include deducting an initial number of tokens from the bucket, determining that the amount of work performed in servicing the request is different than that represented by the initially deducted tokens, and deducting additional tokens from or replacing tokens in the bucket to reflect the difference.
Abstract:
A system that implements a data storage service may store data on behalf of storage service clients. The system may maintain data in multiple replicas of partitions that are stored on respective computing nodes in the system. The system may split a data partition into two new partitions, and may split the replica group that stored the original partitions into two new replica groups, each storing one of the new partitions. To split the replica group, the master replica may propagate membership changes to the other members of the replica group for adding members to the original replica group and for splitting the expanded replica group into two new replica groups. Subsequent to the split, replicas may attempt to become the master for the original replica group or for a new replica group. If an attempt to become master replica for the original replica group succeeds, the split may fail.