Migrating recovery modules in a distributed computing environment
First Claim
Patent Images
1. A system for managing a plurality of distributed nodes of a network, comprising:
- a memory storing computer-readable instructions; and
a processor coupled to the memory, operable to execute the instructions, and based at least in part on the execution of the instructions operable to perform operations comprising executing a network management module that causes the processor to launch migratory recovery modules into the network to monitor status of each of the network nodes;
wherein each of the recovery modules is configured to;
cause any given one of the network nodes to migrate the recovery module from the given network node to another one of the network nodes;
cause any given one of the network nodes to determine a respective status of the given network node; and
cause any given one of the network nodes to initiate a recovery process on the given network node in response to a determination that the given network node has one or more failed node processes wherein, in the executing, the network management module causes the processor to perform operations comprising, launching the recovery modules in order to determine the status of each of the network nodes, monitoring transmissions that are received from the recovery modules executing on respective ones of the network nodes in order to provide periodic monitoring of the status of each of the network nodes, and statistically identifying target ones of the network nodes that are needed to achieve a specified confidence level of network monitoring reliability, and launching the recovery modules into the network by transmitting respective ones of the recovery modules to the identified target network nodes.
3 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods for implementing recovery processes on failed nodes in a distributed computing environment are described. In accordance with this scheme, one or more migratory recovery modules are launched into the network. The recovery modules migrate from node to node, determine the status of each node, and initiate recovery processes on failed nodes. In this way, scalable recovery processes may be implemented in distributed systems, even with incomplete network topology and membership information. In addition, the complexity and cost associated with manual status monitoring and recovery operations may be avoided.
40 Citations
26 Claims
-
1. A system for managing a plurality of distributed nodes of a network, comprising:
- a memory storing computer-readable instructions; and
a processor coupled to the memory, operable to execute the instructions, and based at least in part on the execution of the instructions operable to perform operations comprising executing a network management module that causes the processor to launch migratory recovery modules into the network to monitor status of each of the network nodes;
wherein each of the recovery modules is configured to;
cause any given one of the network nodes to migrate the recovery module from the given network node to another one of the network nodes;
cause any given one of the network nodes to determine a respective status of the given network node; and
cause any given one of the network nodes to initiate a recovery process on the given network node in response to a determination that the given network node has one or more failed node processes wherein, in the executing, the network management module causes the processor to perform operations comprising, launching the recovery modules in order to determine the status of each of the network nodes, monitoring transmissions that are received from the recovery modules executing on respective ones of the network nodes in order to provide periodic monitoring of the status of each of the network nodes, and statistically identifying target ones of the network nodes that are needed to achieve a specified confidence level of network monitoring reliability, and launching the recovery modules into the network by transmitting respective ones of the recovery modules to the identified target network nodes. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 20, 21, 22, 23, 24, 25, 26)
- a memory storing computer-readable instructions; and
-
10. A method for managing a plurality of distributed nodes of a network, comprising:
- (a) on a current one of the network nodes, determining a status of the current network node;
(b) in response to a determination that the current network node has one or more failed node processes, initiating a recovery process on the current network node;
(c) after initiating the recovery process, migrating from the current network node to a successive one of the network nodes;
(d) repeating (a), (b), and (c) with the current network node corresponding to the successive network node for each of the nodes in the network; and
(e) on a respective one of the network nodes;
determining a number of the recovery modules needed to achieve a specified network monitoring service level;
statistically identifying target ones of the network nodes to achieve a specified confidence level of network monitoring reliability; and
transmitting the determined number of the recovery modules to the identified target network nodes. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
- (a) on a current one of the network nodes, determining a status of the current network node;
-
19. A computer-readable persistent storage medium comprising computer code for managing a plurality of distributed nodes of a network, the computer code comprising computer-readable instructions that, when executed by respective processors, cause the respective processors to implement a management module and recovery modules;
- wherein the management module is operable to cause at least one of the processors to perform operations comprising statistically identifying target ones of the network nodes that are needed to achieve a specified confidence level of network monitoring reliability, and launching the recovery modules into the network by transmitting respective ones of the recovery modules to the identified target network nodes;
wherein each of the recovery modules is operable cause at least one of the processors to perform operations comprising migrating the recovery module from one network node to a series of successive network nodes, determining a status of a current one of the network nodes to which the recovery module has migrated,;
in response to a determination that the current network has one or more failed node processes, initiating a recovery process on the current network node; and
after initiating the recovery process on the current network node, migrating from the current network node to a successive one of the network nodes.
- wherein the management module is operable to cause at least one of the processors to perform operations comprising statistically identifying target ones of the network nodes that are needed to achieve a specified confidence level of network monitoring reliability, and launching the recovery modules into the network by transmitting respective ones of the recovery modules to the identified target network nodes;
Specification