Migrating recovery modules in a distributed computing environment

US 7,743,126 B2
Filed: 06/28/2001
Issued: 06/22/2010
Est. Priority Date: 06/28/2001
Status: Expired due to Fees

First Claim

Patent Images

1. A system for managing a plurality of distributed nodes of a network, comprising:

a memory storing computer-readable instructions; and

a processor coupled to the memory, operable to execute the instructions, and based at least in part on the execution of the instructions operable to perform operations comprising executing a network management module that causes the processor to launch migratory recovery modules into the network to monitor status of each of the network nodes;

wherein each of the recovery modules is configured to;

cause any given one of the network nodes to migrate the recovery module from the given network node to another one of the network nodes;

cause any given one of the network nodes to determine a respective status of the given network node; and

cause any given one of the network nodes to initiate a recovery process on the given network node in response to a determination that the given network node has one or more failed node processes wherein, in the executing, the network management module causes the processor to perform operations comprising, launching the recovery modules in order to determine the status of each of the network nodes, monitoring transmissions that are received from the recovery modules executing on respective ones of the network nodes in order to provide periodic monitoring of the status of each of the network nodes, and statistically identifying target ones of the network nodes that are needed to achieve a specified confidence level of network monitoring reliability, and launching the recovery modules into the network by transmitting respective ones of the recovery modules to the identified target network nodes.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods for implementing recovery processes on failed nodes in a distributed computing environment are described. In accordance with this scheme, one or more migratory recovery modules are launched into the network. The recovery modules migrate from node to node, determine the status of each node, and initiate recovery processes on failed nodes. In this way, scalable recovery processes may be implemented in distributed systems, even with incomplete network topology and membership information. In addition, the complexity and cost associated with manual status monitoring and recovery operations may be avoided.

40 Citations

View as Search Results

26 Claims

1. A system for managing a plurality of distributed nodes of a network, comprising:
- a memory storing computer-readable instructions; and
  
  a processor coupled to the memory, operable to execute the instructions, and based at least in part on the execution of the instructions operable to perform operations comprising executing a network management module that causes the processor to launch migratory recovery modules into the network to monitor status of each of the network nodes;
  
  wherein each of the recovery modules is configured to;
  
  cause any given one of the network nodes to migrate the recovery module from the given network node to another one of the network nodes;
  
  cause any given one of the network nodes to determine a respective status of the given network node; and
  
  cause any given one of the network nodes to initiate a recovery process on the given network node in response to a determination that the given network node has one or more failed node processes wherein, in the executing, the network management module causes the processor to perform operations comprising, launching the recovery modules in order to determine the status of each of the network nodes, monitoring transmissions that are received from the recovery modules executing on respective ones of the network nodes in order to provide periodic monitoring of the status of each of the network nodes, and statistically identifying target ones of the network nodes that are needed to achieve a specified confidence level of network monitoring reliability, and launching the recovery modules into the network by transmitting respective ones of the recovery modules to the identified target network nodes.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 20, 21, 22, 23, 24, 25, 26)
- - 2. The system of claim 1, wherein at least one of the recovery modules comprises a respective routing component that is executable by a given one of the network nodes to cause the given network node determine next hop addresses for migrating the recovery module from the given network node to a series of successive destination network nodes.
  - 3. The system of claim 2, wherein the routing component is executable by the given network node to cause the given network node to determine the next hop addresses based upon a routing table stored at the given network node.
  - 4. The system of claim 1, wherein at least one of the recovery modules is executable by a given one of the network nodes to cause the given network node to determine the status of the given network node by sending an inter-process communication to a node process executing on the given network node.
  - 5. A The system of claim 1, wherein each of the recovery modules is configured to cause any given one of the network nodes to determine the status of the given network node in accordance with a heartbeat messaging protocol.
  - 6. The system of claim 1, wherein each of the recovery modules is executable by a given one of the network nodes to cause the given network node to perform operations comprising:
    - determining whether the given network node has one or more failed processes; and
      
      in response to a determination that the given network node has a failed process, initiating a recovery process on the given network node in accordance with a restart protocol.
  - 7. The system of claim 6, wherein each of the recovery modules is executable by the given network node to cause to given network node to respond to a determination that the given network node has a failed process by initiating a restart of the failed process by transmitting a request to a process execution service operating on the given network node.
  - 8. The system of claim 1, wherein each of the recovery modules is executable by a given one of the network nodes to cause the given network node to transmit a respective node status message to the network management module.
  - 9. The system of claim 8, wherein each of the node status messages comprises information obtained from a respective log file generated at a respective one of the network nodes having one or more failed node processes.
  - 20. The system of claim 1, wherein each of the recovery modules is a software object that is instantiatable by a respective operating environment on each of the network nodes.
  - 21. The computer-readable persistent storage medium of claim 20, wherein the operating environment on each of the network nodes provides each of the recovery modules with access to status monitoring resources, recovery resources, and native operative system resources that are available at each of the network nodes.
  - 22. The system of claim 1, wherein, upon migrating from a first one of the network nodes to a second one of the network nodes and being instantiated on the second network node, a given one of the recovery modules causes the second network node to determine a status of the second network node.
  - 23. The system of claim 22, wherein the given recovery module causes the second network node to initiate a recovery process on the second network node in response to a determination that the second network node has one or more failed node processes.
  - 24. The system of claim 22, wherein the given recovery module is configured to cause the second network node to migrate the given recovery module to a third one of the network nodes after determining the status of the second network node.
  - 25. The system of claim 1, wherein the network management module causes the processor to determine a number of the recovery modules needed to achieve a specified network monitoring service level, and to launch the determined number of recovery modules into the network to achieve the specified network monitoring service level.
  - 26. The system of claim 1, wherein, in the executing, the network management module causes the processor to monitor number of network node failures reported by the recovery modules and causes the processor to launch more of the migratory recovery modules into the network as the number of reported failures increases.

10. A method for managing a plurality of distributed nodes of a network, comprising:
- (a) on a current one of the network nodes, determining a status of the current network node;
  
  (b) in response to a determination that the current network node has one or more failed node processes, initiating a recovery process on the current network node;
  
  (c) after initiating the recovery process, migrating from the current network node to a successive one of the network nodes;
  
  (d) repeating (a), (b), and (c) with the current network node corresponding to the successive network node for each of the nodes in the network; and
  
  (e) on a respective one of the network nodes;
  
  determining a number of the recovery modules needed to achieve a specified network monitoring service level;
  
  statistically identifying target ones of the network nodes to achieve a specified confidence level of network monitoring reliability; and
  
  transmitting the determined number of the recovery modules to the identified target network nodes.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
- - 11. The method of claim 10, wherein migrating from one network node to another comprises determining a next hop address from an origin network node to a destination network node.
  - 12. The method of claim 11, wherein the next hop address is determined based upon a routing table stored at the origin network node.
  - 13. The method of claim 10, wherein the status of a network node is determined by sending an inter-process communication to a node process.
  - 14. The method of claim 10, wherein the status of a network node is determined in accordance with a heartbeat messaging protocol.
  - 15. The method of claim 10, wherein a recovery process is initiated on a network node having one or more failed node processes in accordance with a restart protocol.
  - 16. The method of claim 15, wherein a restart of a failed node process is initiated by transmitting a request to a process execution service operating on the failed network node.
  - 17. The method of claim 10, further comprising transmitting a node status message to a network management module operating at a network management network node.
  - 18. The method of claim 10, further comprising launching a plurality of recovery modules from a respective one of the network nodes into the network, wherein each of the recovery modules is configured to:
    - migrate from one recipient one of the network nodes to another;
      
      cause each of the recipient network nodes to determine the status of itself; and
      
      cause each of the recipient network nodes having one or more failed node processes to initiate a recovery process on itself.

19. A computer-readable persistent storage medium comprising computer code for managing a plurality of distributed nodes of a network, the computer code comprising computer-readable instructions that, when executed by respective processors, cause the respective processors to implement a management module and recovery modules;
- wherein the management module is operable to cause at least one of the processors to perform operations comprising statistically identifying target ones of the network nodes that are needed to achieve a specified confidence level of network monitoring reliability, and launching the recovery modules into the network by transmitting respective ones of the recovery modules to the identified target network nodes;
  
  wherein each of the recovery modules is operable cause at least one of the processors to perform operations comprising migrating the recovery module from one network node to a series of successive network nodes, determining a status of a current one of the network nodes to which the recovery module has migrated,;
  
  in response to a determination that the current network has one or more failed node processes, initiating a recovery process on the current network node; and
  
  after initiating the recovery process on the current network node, migrating from the current network node to a successive one of the network nodes.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Hewlett Packard Enterprise Development LP (Hewlett-Packard Enterprise Company)
Original Assignee
Hewlett-Packard Development Company, L.P. (HP Inc.)
Inventors
Russell, Lance W.
Primary Examiner(s)
Dollinger; Tonia L
Assistant Examiner(s)
BILGRAMI, ASGHAR H

Application Number

US09/895,235
Publication Number

US 20030005102A1
Time in Patent Office

3,281 Days
Field of Search

709/223, 709/224, 709/226, 714/4, 714/11, 379/9, 379/10, 379/230, 379/229
US Class Current

709/223
CPC Class Codes

G06F 11/0709   in a distributed system con...

G06F 11/0793   Remedial or corrective acti...

G06F 11/1438   Restarting or rejuvenating

G06F 11/3006   where the computing system ...

G06F 11/3055   Monitoring arrangements for...

G06F 11/3093   Configuration details there...

H04L 41/0213   Standardised network manage...

H04L 41/048   mobile agents

H04L 41/0661   by reconfiguring faulty ent...

H04L 43/0817   by checking functioning

Migrating recovery modules in a distributed computing environment

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

40 Citations

26 Claims

Specification

Solutions

Use Cases

Quick Links

Migrating recovery modules in a distributed computing environment

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

40 Citations

26 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links