Monitoring and Automated Recovery of Data Instances

US 20100251002A1
Filed: 03/31/2009
Published: 09/30/2010
Est. Priority Date: 03/31/2009
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method of recovering from a failure in a data environment, comprising:

under control of one or more computer systems configured with executable instructions,periodically sending a status request from at least one event processor in a control environment to each of a plurality of host managers in a data environment, each host manager responsible for monitoring a status of at least one data instance in the data environment;

analyzing, in the control environment, a response received from each host manager to determine whether a potential problem exists with one of the host managers or data instances in the data environment; and

when a potential problem is determined to exist, determining an appropriate recovery workflow to be executed for the potential problem and causing at least one task of the determined recovery workflow to be executed in the data environment.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The monitoring and recovery of data instances, data stores, and other such components in a data environment can be performed automatically using a separate control environment. A monitoring component of the control plane can include a set of event processors for monitoring a workload of the data environment, where an event processor detecting a problem in the data plane can cause a recovery workflow to generated in order to recover from the detected problem. The event processors can communicate with each other such that if one of the event processors becomes unavailable, the other event processors in a set are able to automatically redistribute responsibility for the workload.

166 Citations

25 Claims

1. A computer-implemented method of recovering from a failure in a data environment, comprising:
- under control of one or more computer systems configured with executable instructions,periodically sending a status request from at least one event processor in a control environment to each of a plurality of host managers in a data environment, each host manager responsible for monitoring a status of at least one data instance in the data environment;
  
  analyzing, in the control environment, a response received from each host manager to determine whether a potential problem exists with one of the host managers or data instances in the data environment; and
  
  when a potential problem is determined to exist, determining an appropriate recovery workflow to be executed for the potential problem and causing at least one task of the determined recovery workflow to be executed in the data environment.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 14)
- - 2. A computer-implemented method according to claim 1, further comprising:
    - storing information for the determined problem to a job queue in the data plane; and
      
      detecting the information in the job queue before determining the appropriate recovery workflow.
  - 3. A computer-implemented method according to claim 1, wherein:
    - when the potential problem is a failure of a data instance, the recovery workflow includes automatically rebooting the data instance or re-provisioning another data instance.
  - 4. A computer-implemented method according to claim 1, wherein:
    - the potential problem is selected from a group of problems including at least one of failure of a data instance, a failure of a host device, a network outage, a data center outage, a data store error, an input/output (I/O) error, and a failure of a host manager.
  - 5. A computer-implemented method according to claim 1, further comprising:
    - when the potential problem is determined to involve rebooting of a host device in the data environment, determining an appropriate recovery workflow includes determining not to execute a recovery workflow.
  - 6. A computer-implemented method according to claim 1, further comprising:
    - when the potential problem is determined to involve a number of concurrent failures exceeding a specified threshold, determining an appropriate recovery workflow includes at least one of contacting a operator before the determined recovery workflow is executed or performing a staged recovery.
  - 7. A computer-implemented method according to claim 1, further comprising:
    - determining whether a potential problem exists when a response is not received from one of the host managers.
  - 8. A computer-implemented method according to claim 5, further comprising:
    - resending the status request at least once for a host manager when a response is not received before determining whether a potential problem exists.
  - 14. A computer-implemented method according to claim 8, further comprising:
    - when the newly started event processor is activated, reallocating the portion of the range of identifiers to each of the set of event processors.

9. A computer-implemented method of monitoring components in a data environment, comprising:
- under control of one or more computer systems configured with executable instructions,determining a set of event processors in a control environment for monitoring a plurality of components in the data environment, the plurality of components each having an identifier over a range of identifiers;
  
  allocating a portion of the range of identifiers to each of the set of event processors, each event processor being allocated a substantially equivalent portion of the range of identifiers for monitoring;
  
  periodically sending a status message from each of the event processors to be received by the other event processors in the set indicating that the event processor sending the status message is active; and
  
  in response to not receiving a status message from one of the event processors for at least a determined period of time, automatically reallocating the range of identifiers to the active event processors from which status messages were received,wherein each active event processor receives a different substantially equivalent portion of the range of identifiers based on the number of active event processors.
- View Dependent Claims (10, 11, 12, 13, 15, 16)
- - 10. A computer-implemented method according to claim 9, wherein:
    - each event processor is operable to periodically send status messages to each component in the data environment having an identifier in the portion of the range of identifiers allocated to the event processor.
  - 11. A computer-implemented method according to claim 10, wherein:
    - each event processor is further operable to store information for a monitored component to a job queue when a potential problem is detected, the information being used to determine a recovery workflow to be executed for the monitored component in the data environment.
  - 12. A computer-implemented method according to claim 9, further comprising:
    - storing information for the event processor from which a status message was not received to a job queue in the control environment; and
      
      using the information to generate a workflow to restart the event processor or start a new event processor to the set of event processors.
  - 13. A computer-implemented method according to claim 12, further comprising:
    - causing the newly started event processor to send and receive heartbeats to other event processors in the set before activating the newly started event processor.
  - 15. A computer-implemented method according to claim 9, further comprising:
    - sorting the identifiers and allocating the sorted identifiers substantially uniformly across the set of event processors.
  - 16. A computer-implemented method according to claim 9, further comprising:
    - running each event processor at a portion of a processing capacity, wherein each event processor is able to accept a portion of the workload from an unavailable event processor without negatively impacting performance.

17. A system for recovering from a failure in a data environment, comprising:
- at least one processor; and
  
  memory including instructions that, when executed by the at least one processor, cause the system to;
  
  periodically send a status request from at least one event processor in a control environment to each of a plurality of host managers in a data environment, each host manager responsible for monitoring a status of at least one data instance in the data environment;
  
  analyze, in the control environment, a response received from each host manager to determine whether a potential problem exists with one of the host managers or data instances in the data environment; and
  
  when a potential problem is determined to exist, determine an appropriate recovery workflow to be executed for the potential problem and causing at least one task of the determined recovery workflow to be executed in the data environment.
- View Dependent Claims (18, 19, 20, 21)
- - 18. A system according to claim 17, wherein the instructions, when executed by the at least one processor, further cause the system to:
    - store information for the determined problem to a job queue in the data plane; and
      
      detect the information in the job queue before determining the appropriate recovery workflow.
  - 19. A system according to claim 17, wherein:
    - when the potential problem is a failure of a data instance, a failure of a host device, a network outage, a data center outage, the recovery workflow includes automatically rebooting the data instance or re-provisioning another data instance.
  - 20. A system according to claim 17, wherein the instructions, when executed by the at least one processor, further cause the system to:
    - determine whether a potential problem exists when a response is not received from one of the host managers.
  - 21. A system according to claim 17, wherein the instructions, when executed by the at least one processor, further cause the system to:
    - resend the status request at least once for a host manager when a response is not received before determining whether a potential problem exists.

22. A system for monitoring components in a data environment, comprising:
- at least one processor; and
  
  memory including instructions that, when executed by the at least one processor, cause the system to;
  
  determine a set of event processors in a control environment for monitoring a plurality of components in the data environment, the plurality of components each having an identifier over a range of identifiers;
  
  allocate a portion of the range of identifiers to each of the set of event processors, each event processor being allocated a substantially equivalent portion of the range of identifiers for monitoring;
  
  periodically send a status message from each of the event processors to be received by the other event processors in the set indicating that the event processor sending the status message is active; and
  
  in response to not receiving a status message from one of the event processors for at least a determined period of time, automatically reallocate the range of identifiers to the active event processors from which status messages were received,wherein each active event processor receives a different substantially equivalent portion of the range of identifiers based on the number of active event processors.
- View Dependent Claims (23, 24, 25)
- - 23. A system according to claim 22, wherein the instructions, when executed by the at least one processor, further cause the system to:
    - store information for the event processor from which a status message was not received to a job queue in the control environment; and
      
      use the information to generate a workflow to restart the event processor or start a new event processor to the set of processors.
  - 24. A system according to claim 22, wherein the instructions, when executed by the at least one processor, further cause the system to:
    - when the newly started event processor is activated, reallocate the portion of the range of identifiers to each of the set of event processors.
  - 25. A system according to claim 22, wherein the instructions, when executed by the at least one processor, further cause the system to:
    - sort the identifiers and allocate the sorted identifiers substantially uniformly across the set of event processors.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Hunter, Barry B. JR., MacDonald McAlister, Grant Alexander, SIVASUBRAMANIAN, Swaminathan, Pol, Parikshit S.

Granted Patent

US 8,060,792 B2
Time in Patent Office

Days
Field of Search
US Class Current

714/2
CPC Class Codes

G06F 11/0709   in a distributed system con...

G06F 11/0793   Remedial or corrective acti...

G06F 2209/505   Clust

G06F 9/5061   Partitioning or combining o...

Monitoring and Automated Recovery of Data Instances

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

166 Citations

25 Claims

Specification

Solutions

Use Cases

Quick Links

Monitoring and Automated Recovery of Data Instances

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

166 Citations

25 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links