DYNAMIC REPLICA FAILURE DETECTION AND HEALING
First Claim
1. A system, comprising:
- a plurality of computing nodes, each comprising at least one processor and memory, wherein the plurality of computing nodes are configured to implement a data storage service,wherein the data storage service comprises;
one or more replica groups stored among the plurality of computing nodes, wherein each of the one or more replica groups maintains one or more replicas of data on behalf of one or more storage service clients, wherein each replica group of the one or more replica groups includes a healthy state definition for the replica group;
a replica group status sweeper, configured to identify replica groups with a number of available replicas not compliant with the healthy state definition for the respective replica group, wherein said identification is based, at least in part, on status metadata for the respective replica group; and
a dynamic heal scheduler, configured to schedule one or more replica healing operations to restore the number of available replicas for the identified replica groups to the respective healthy state definition for the identified replica groups based, at least in part, on one or more resource constraints for performing healing operations
0 Assignments
0 Petitions
Accused Products
Abstract
Detecting replica faults within a replica group and dynamically scheduling replica healing operations are described. Status metadata for one or more replica groups may be accessed. Based, at least in part, the status data a number of available replicas for at least one replica group may be determined to incompliant with a healthy state definition for the replica group. One or more healing operations to restore the number of available replicas for the at least one replica group to the respective healthy state definition may be dynamically scheduled. In some embodiments, one or more resource constraints for performing healing operations and one or more resource requirements for each of the one or more healing operations may be used to order the one or more healing operations.
-
Citations
21 Claims
-
1. A system, comprising:
-
a plurality of computing nodes, each comprising at least one processor and memory, wherein the plurality of computing nodes are configured to implement a data storage service, wherein the data storage service comprises; one or more replica groups stored among the plurality of computing nodes, wherein each of the one or more replica groups maintains one or more replicas of data on behalf of one or more storage service clients, wherein each replica group of the one or more replica groups includes a healthy state definition for the replica group; a replica group status sweeper, configured to identify replica groups with a number of available replicas not compliant with the healthy state definition for the respective replica group, wherein said identification is based, at least in part, on status metadata for the respective replica group; and a dynamic heal scheduler, configured to schedule one or more replica healing operations to restore the number of available replicas for the identified replica groups to the respective healthy state definition for the identified replica groups based, at least in part, on one or more resource constraints for performing healing operations - View Dependent Claims (2, 3)
-
-
4. A method, comprising:
performing, by a plurality of computing devices; accessing status metadata for one or more replica groups, wherein each of the one or more replica groups maintains one or more replicas of data, wherein each replica group of the one or more replica groups includes a healthy state definition for the replica group; determining, based at least in part on the status metadata, that a number of available replicas for at least one replica group of the one or more replica groups is not compliant with the healthy state definition for the respective replica group; and dynamically scheduling one or more replica healing operations to restore the number of available replicas for the at least one replica group to the respective healthy state definition for the at least one replica group based, at least in part, on one or more resource constraints for performing healing operations. - View Dependent Claims (5, 6, 7, 8, 9, 10, 11)
-
12. A non-transitory, computer-readable storage medium, storing program instructions that when executed by a plurality of computing devices implement a data storage service that implements:
-
accessing status metadata for one or more replica groups, wherein each of the one or more replica groups maintains one or more replicas of data stored among a plurality of compute nodes implemented by the plurality of computing devices on behalf of one or more storage service clients, wherein each replica group of the one or more replica groups includes a healthy state definition for the respective replica group; determining, based at least in part on the status metadata, that a number of available replicas for at least one replica group of the one or more replica groups is not compliant with the healthy state definition for the respective replica group; and dynamically scheduling one or more replica healing operations to restore the number of available replicas for the at least one replica group to the respective healthy state definition for the at least one replica group based, at least in part, on one or more resource constraints for performing healing operations. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. (canceled)
Specification