Abstract:
Detecting replica faults within a replica group and dynamically scheduling replica healing operations are described. Status metadata for one or more replica groups may be accessed. Based, at least in part, the status data a number of available replicas for at least one replica group may be determined to incompliant with a healthy state definition for the replica group. One or more healing operations to restore the number of available replicas for the at least one replica group to the respective healthy state definition may be dynamically scheduled. In some embodiments, one or more resource constraints for performing healing operations and one or more resource requirements for each of the one or more healing operations may be used to order the one or more healing operations.
Abstract:
Disclosed are various embodiments for distributing data items within a plurality of nodes. A data item that is subject to a data item update request is updated from a master node to a plurality of slave notes. The update of the data item is determined to be locality-based durable based at least in part on acknowledgements received from the slave nodes. Upon detection that the master node has failed, a new master candidate is determined via an election among the plurality of slave nodes.
Abstract:
A system that implements a data storage service may store data on behalf of storage service clients. The system may maintain data in multiple replicas of various partitions that are stored on respective computing nodes in the system. The system may employ a single master failover protocol, usable when a replica attempts to become the master replica for a replica group of which it is a member. Attempting to become the master replica may include acquiring a lock associated with the replica group, and gathering state information from the other replicas in the group. The state information may indicate whether another replica supports the attempt (in which case it is included in a failover quorum) or stores more recent data or metadata than the replica attempting to become the master (in which case synchronization may be required). If the failover quorum includes enough replicas, the replica may become the master.
Abstract:
A system that implements a data storage service may store data on behalf of storage service clients. The system may maintain data in multiple replicas of partitions that are stored on respective computing nodes in the system. A master replica for a replica group may increment a membership version indicator for the group, and may propagate metadata (including the membership version indicator) indicating a membership change for the group to other members of the group. Propagating the metadata may include sending a log record containing the metadata to the other replicas to be appended to their respective logs. Once the membership change becomes durable, it may be committed. A replica attempting to become the master of a replica group may determine that another replica in the group has observed a more recent membership version, in which case logs may be synchronized or snipped, or the attempt may be abandoned.
Abstract:
A system that implements a data storage service may store data on behalf of storage service clients. The system may maintain data in multiple replicas of various partitions that are stored on respective computing nodes in the system. The system may employ a single master failover protocol, usable when a replica attempts to become the master replica for a replica group of which it is a member. Attempting to become the master replica may include acquiring a lock associated with the replica group, and gathering state information from the other replicas in the group. The state information may indicate whether another replica supports the attempt (in which case it is included in a failover quorum) or stores more recent data or metadata than the replica attempting to become the master (in which case synchronization may be required). If the failover quorum includes enough replicas, the replica may become the master.
Abstract:
A system that implements a scalable data storage service may maintain tables in a data store on behalf of storage service clients. The service may maintain table data in multiple replicas of partitions that are stored on respective computing nodes in the system. In response to detecting an anomaly in the system, detecting a change in data volume on a partition or service request traffic directed to a partition, or receiving a service request from a client to split a partition, the data storage service may create additional copies of a partition replica using a physical copy mechanism. The data storage service may issue a split command defined in an API for the data store to divide the original and additional replicas into multiple replica groups, and to configure each replica group to maintain a respective portion of the table data that was stored in the partition before the split.
Abstract:
A system that implements a data storage service may store data on behalf of clients in multiple replicas on respective computing nodes. The system may employ an external service to select a master replica for a replica group. The master replica may service consistent read operations and/or write operations that are directed to the replica group (or to a data partition stored by the replica group). The master replica may employ a quorum based mechanism for performing replicated write operations, and a local lease mechanism for determining the replica authorized to perform consistent reads, even when the external service is unavailable. The master replica may propagate local leases to replica group members as replicated writes. If another replica assumes mastership for the replica group, it may not begin servicing consistent read operations that are directed to the replica group until the lease period for a current local lease expires.
Abstract:
A system that implements a data storage service may store data in multiple replicated partitions on respective storage nodes. The selection of the storage nodes (or storage devices thereof) on which to store the partition replicas may be performed by administrative components that are responsible for partition management and resource allocation for respective groups of storage nodes (e.g., based on a global view of resource capacity or usage), or the selection of particular storage devices of a storage node may be determined by the storage node itself (e.g., based on a local view of resource capacity or usage). Placement policies applied at the administrative layer or storage layer may be based on the percentage or amount of provisioned, reserved, or available storage or IOPS capacity on each storage device, and particular placements (or subsequent operations to move partition replicas) may result in an overall resource utilization that is well balanced.
Abstract:
A system that provides services to clients may receive and service requests, various ones of which may require different amounts of work. An admission control mechanism may manage requests based on tokens, each of which represents a fixed amount of work. The tokens may be added to a token bucket at rate that is dependent on a target work throughput rate while the number of tokens in the bucket does not exceed its maximum capacity. If at least a pre-determined minimum number of tokens is present in the bucket when a service request is received, it may be serviced. Servicing a request may include deducting an initial number of tokens from the bucket, determining that the amount of work performed in servicing the request is different than that represented by the initially deducted tokens, and deducting additional tokens from or replacing tokens in the bucket to reflect the difference.
Abstract:
A system that implements a data storage service may store data on behalf of storage service clients. The system may maintain data in multiple replicas of partitions that are stored on respective computing nodes in the system. The system may split a data partition into two new partitions, and may split the replica group that stored the original partitions into two new replica groups, each storing one of the new partitions. To split the replica group, the master replica may propagate membership changes to the other members of the replica group for adding members to the original replica group and for splitting the expanded replica group into two new replica groups. Subsequent to the split, replicas may attempt to become the master for the original replica group or for a new replica group. If an attempt to become master replica for the original replica group succeeds, the split may fail.