Monitoring and Automated Recovery of Data Instances
First Claim
1. A computer-implemented method of recovering from a failure in a data environment, comprising:
- under control of one or more computer systems configured with executable instructions,periodically sending a status request from at least one event processor in a control environment to each of a plurality of host managers in a data environment, each host manager responsible for monitoring a status of at least one data instance in the data environment;
analyzing, in the control environment, a response received from each host manager to determine whether a potential problem exists with one of the host managers or data instances in the data environment; and
when a potential problem is determined to exist, determining an appropriate recovery workflow to be executed for the potential problem and causing at least one task of the determined recovery workflow to be executed in the data environment.
1 Assignment
0 Petitions
Accused Products
Abstract
The monitoring and recovery of data instances, data stores, and other such components in a data environment can be performed automatically using a separate control environment. A monitoring component of the control plane can include a set of event processors for monitoring a workload of the data environment, where an event processor detecting a problem in the data plane can cause a recovery workflow to generated in order to recover from the detected problem. The event processors can communicate with each other such that if one of the event processors becomes unavailable, the other event processors in a set are able to automatically redistribute responsibility for the workload.
166 Citations
25 Claims
-
1. A computer-implemented method of recovering from a failure in a data environment, comprising:
-
under control of one or more computer systems configured with executable instructions, periodically sending a status request from at least one event processor in a control environment to each of a plurality of host managers in a data environment, each host manager responsible for monitoring a status of at least one data instance in the data environment; analyzing, in the control environment, a response received from each host manager to determine whether a potential problem exists with one of the host managers or data instances in the data environment; and when a potential problem is determined to exist, determining an appropriate recovery workflow to be executed for the potential problem and causing at least one task of the determined recovery workflow to be executed in the data environment. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 14)
-
-
9. A computer-implemented method of monitoring components in a data environment, comprising:
-
under control of one or more computer systems configured with executable instructions, determining a set of event processors in a control environment for monitoring a plurality of components in the data environment, the plurality of components each having an identifier over a range of identifiers; allocating a portion of the range of identifiers to each of the set of event processors, each event processor being allocated a substantially equivalent portion of the range of identifiers for monitoring; periodically sending a status message from each of the event processors to be received by the other event processors in the set indicating that the event processor sending the status message is active; and in response to not receiving a status message from one of the event processors for at least a determined period of time, automatically reallocating the range of identifiers to the active event processors from which status messages were received, wherein each active event processor receives a different substantially equivalent portion of the range of identifiers based on the number of active event processors. - View Dependent Claims (10, 11, 12, 13, 15, 16)
-
-
17. A system for recovering from a failure in a data environment, comprising:
-
at least one processor; and memory including instructions that, when executed by the at least one processor, cause the system to; periodically send a status request from at least one event processor in a control environment to each of a plurality of host managers in a data environment, each host manager responsible for monitoring a status of at least one data instance in the data environment; analyze, in the control environment, a response received from each host manager to determine whether a potential problem exists with one of the host managers or data instances in the data environment; and when a potential problem is determined to exist, determine an appropriate recovery workflow to be executed for the potential problem and causing at least one task of the determined recovery workflow to be executed in the data environment. - View Dependent Claims (18, 19, 20, 21)
-
-
22. A system for monitoring components in a data environment, comprising:
-
at least one processor; and memory including instructions that, when executed by the at least one processor, cause the system to; determine a set of event processors in a control environment for monitoring a plurality of components in the data environment, the plurality of components each having an identifier over a range of identifiers; allocate a portion of the range of identifiers to each of the set of event processors, each event processor being allocated a substantially equivalent portion of the range of identifiers for monitoring; periodically send a status message from each of the event processors to be received by the other event processors in the set indicating that the event processor sending the status message is active; and in response to not receiving a status message from one of the event processors for at least a determined period of time, automatically reallocate the range of identifiers to the active event processors from which status messages were received, wherein each active event processor receives a different substantially equivalent portion of the range of identifiers based on the number of active event processors. - View Dependent Claims (23, 24, 25)
-
Specification