Abstract:
Detecting replica faults within a replica group and dynamically scheduling replica healing operations are described. Status metadata for one or more replica groups may be accessed. Based, at least in part, the status data a number of available replicas for at least one replica group may be determined to incompliant with a healthy state definition for the replica group. One or more healing operations to restore the number of available replicas for the at least one replica group to the respective healthy state definition may be dynamically scheduled. In some embodiments, one or more resource constraints for performing healing operations and one or more resource requirements for each of the one or more healing operations may be used to order the one or more healing operations.
Abstract:
A system that implements distributed storage may schedule and track control plane operations for performance at the distributed storage service. Information may be maintained for control plane events detected at a distributed storage system. Resource utilization for currently performing control plane operations and currently scheduled control plane operations of the distributed storage system may be determined. The information about detected control plane events may be analyzed to schedule control plane operations to be performed in response to detecting the control plane events. As part of scheduling control plane operations, resource constraints may be applied to the determine resource utilization for the distributed storage system.
Abstract:
Detecting replica faults within a replica group and dynamically scheduling replica healing operations are described. Status metadata for one or more replica groups may be accessed. Based, at least in part, the status data a number of available replicas for at least one replica group may be determined to incompliant with a healthy state definition for the replica group. One or more healing operations to restore the number of available replicas for the at least one replica group to the respective healthy state definition may be dynamically scheduled. In some embodiments, one or more resource constraints for performing healing operations and one or more resource requirements for each of the one or more healing operations may be used to order the one or more healing operations.
Abstract:
A hosted service may limit access to a table initially comprising one or more partitions. Access to the table may be limited to a provisioned capacity. A client of the service may request an increased capacity. A minimum number of partitions for providing the increased capacity may be determined. Proportions of the increased capacity may be allocated among members of successive generations of partitions to be provided by a member of a generation or its descendants. The proportions may be allocated to minimize the costs associated with splitting partitions based on the minimum number of partitions.
Abstract:
An automated system may be employed to perform detection, analysis and recovery from faults occurring in a distributed computing system. Faults may be recorded in a metadata store for verification and analysis by an automated fault management process. Diagnostic procedures may confirm detected faults. The automated fault management process may perform recovery workflows involving operations such as rebooting faulting devices and excommunicating unrecoverable computing nodes from affected clusters.