Dynamic replica failure detection and healing
First Claim
1. A system, comprising:
- a plurality of compute nodes, each comprising at least one processor and memory, wherein the plurality of compute nodes implement a data store;
wherein the data store is configured to;
maintain a plurality of replicas of data on behalf of a client of the data store at different ones of the compute nodes as a replica group for the data;
obtain individual metadata for different replicas of the replica group to update status metadata stored for the replica group at one or more of the compute nodes remote from the different ones of the compute nodes that maintain the plurality of replicas;
access, by a replica group status sweeper remote from the different ones of the compute nodes and remote from the one or more compute nodes that store the status metadata, the updated status metadata for the replica group at the one or more compute nodes to evaluate the replica group for compliance with a healthy state definition of a number of replicas for the replica group based, at least in part, on the updated status metadata, wherein the evaluation determines that a number of available replicas for the replica group is not compliant with the healthy state definition; and
automatically restore the replica group such that the number of available replicas for the replica group is compliant with the healthy state definition for the replica group.
0 Assignments
0 Petitions
Accused Products
Abstract
Detecting replica faults within a replica group and dynamically scheduling replica healing operations are described. Status metadata for one or more replica groups may be accessed. Based, at least in part, the status data a number of available replicas for at least one replica group may be determined to incompliant with a healthy state definition for the replica group. One or more healing operations to restore the number of available replicas for the at least one replica group to the respective healthy state definition may be dynamically scheduled. In some embodiments, one or more resource constraints for performing healing operations and one or more resource requirements for each of the one or more healing operations may be used to order the one or more healing operations.
73 Citations
20 Claims
-
1. A system, comprising:
-
a plurality of compute nodes, each comprising at least one processor and memory, wherein the plurality of compute nodes implement a data store; wherein the data store is configured to; maintain a plurality of replicas of data on behalf of a client of the data store at different ones of the compute nodes as a replica group for the data; obtain individual metadata for different replicas of the replica group to update status metadata stored for the replica group at one or more of the compute nodes remote from the different ones of the compute nodes that maintain the plurality of replicas; access, by a replica group status sweeper remote from the different ones of the compute nodes and remote from the one or more compute nodes that store the status metadata, the updated status metadata for the replica group at the one or more compute nodes to evaluate the replica group for compliance with a healthy state definition of a number of replicas for the replica group based, at least in part, on the updated status metadata, wherein the evaluation determines that a number of available replicas for the replica group is not compliant with the healthy state definition; and automatically restore the replica group such that the number of available replicas for the replica group is compliant with the healthy state definition for the replica group. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method, comprising:
performing, by a plurality of computing devices; maintaining a plurality of replicas of data on behalf of a client of a data store at different ones of a plurality of compute nodes as a replica group for the data; obtaining individual metadata for different replicas of the replica group to update status metadata stored for the replica group at one or more of the compute nodes remote from the different ones of the compute nodes; accessing, by a replica group status sweeper that is remote from the different ones of the compute nodes and remote from the one or more compute nodes that store the status metadata, the updated status metadata for the replica group at the one or more compute nodes to evaluate the replica group for compliance with a healthy state definition of a number of replicas for the replica group based, at least in part, on the updated status metadata, wherein the evaluation determines that a number of available replicas for the replica group is not compliant with the healthy state definition; and automatically restoring the replica group such that the number of available replicas for the replica group is compliant with the healthy state definition for the replica group. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
15. A non-transitory, computer-readable storage medium, storing program instructions that when executed by a plurality of computing devices implement a data storage service that implements:
-
maintaining a plurality of replicas of data on behalf of a client of a data store at different ones of a plurality of compute nodes as a replica group for the data; obtaining individual metadata for different replicas of the replica group to update status metadata stored for the replica group at one or more of the compute nodes remote from the different ones of the compute nodes; accessing, by a replica group status sweeper that is remote from the different ones of the compute nodes and remote from the one or more compute nodes that store the status metadata, the updated status metadata for the replica group at the one or more compute nodes to evaluate the replica group for compliance with a healthy state definition of a number of replicas for the replica group based, at least in part, on the updated status metadata, wherein the evaluation determines that a number of available replicas for the replica group is not compliant with the healthy state definition; and automatically restoring the replica group such that the number of available replicas for the replica group is compliant with the healthy state definition for the replica group. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification