Fault tolerant architecture for distributed computing systems
First Claim
1. A method comprising:
- monitoring, by a computer comprising a processor executing a supervisor module, a heartbeat signal generated by a node manager monitoring one or more software modules stored on a node, wherein the heartbeat signal contains data indicating a status of each respective software module monitored by the node manager;
detecting, by the computer, a failed software module in the one or more software modules of the node based on the heartbeat signal received from the node manager of the node;
transmitting, by the computer, to the node manager of the node a command instructing the node to restore the failed software module, in response to detecting the failed software module;
determining, by the computer, whether the node manager successfully restored the module based on the heartbeat signal received from the node manager;
detecting, by the computer, a failure of the node manager monitoring the failed software module;
determining, by the computer, a failover node to execute the failed software module, wherein the failover node is associated with a failover node manager;
retrieving, by the computer, a configuration package associated with the failed software module from a dependency manager node;
transmitting, by the computer, the configuration package to the failover node manager, wherein the failover node manager attempts to install the failed software module on the failover node, and wherein the failover node manager attempts to restore the failed software module;
determining, by the computer, if the failover node manager successfully installs the failed software module on the failover node; and
determining, by the computer, if the failover node manager successfully restores the failed software module.
2 Assignments
0 Petitions
Accused Products
Abstract
Disclosed here is a fault tolerant architecture suitable for use with any distributed computing system. A fault tolerant architecture may include any suitable number of supervisors, dependency managers, node managers, and other modules distributed across any suitable number of nodes. In one or more embodiments, supervisors may monitor the system using any suitable number of heartbeats from any suitable number of node managers and other modules. In one or more embodiments, supervisors may automatically recover failed modules in a distributed system by moving the modules and their dependencies to other nodes in the system. In one or more embodiments, supervisors may request a configuration package from one or more dependency managers installing one or more modules on a node. In one or more embodiments, one or more modules may have any suitable number of redundant copies in the system, where redundant copies of modules in the system may be stored in separate nodes.
-
Citations
20 Claims
-
1. A method comprising:
-
monitoring, by a computer comprising a processor executing a supervisor module, a heartbeat signal generated by a node manager monitoring one or more software modules stored on a node, wherein the heartbeat signal contains data indicating a status of each respective software module monitored by the node manager; detecting, by the computer, a failed software module in the one or more software modules of the node based on the heartbeat signal received from the node manager of the node; transmitting, by the computer, to the node manager of the node a command instructing the node to restore the failed software module, in response to detecting the failed software module; determining, by the computer, whether the node manager successfully restored the module based on the heartbeat signal received from the node manager; detecting, by the computer, a failure of the node manager monitoring the failed software module; determining, by the computer, a failover node to execute the failed software module, wherein the failover node is associated with a failover node manager; retrieving, by the computer, a configuration package associated with the failed software module from a dependency manager node; transmitting, by the computer, the configuration package to the failover node manager, wherein the failover node manager attempts to install the failed software module on the failover node, and wherein the failover node manager attempts to restore the failed software module; determining, by the computer, if the failover node manager successfully installs the failed software module on the failover node; and determining, by the computer, if the failover node manager successfully restores the failed software module. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A fault-tolerant distributed computing system comprising:
-
one or more nodes comprising a processor transmitting a heartbeat signal to a first supervisor node and monitoring execution of one or more software modules installed on the one or more nodes; one or more supervisor nodes comprising a processor monitoring one or more heartbeat signals received from the one or more nodes, and determining a status of each respective node based on each respective heartbeat signal; a dependency manager node comprising non-transitory machine-readable storage media storing one or more machine-readable configuration package files; and a failover node comprising a processor transmitting a heartbeat signal to the first supervisor node, wherein the failover node is configured to execute the one or more software modules, wherein the processor of the one or more nodes is configured to attempt to restore a software module executed by the one or more nodes to a status quo configuration responsive to receiving a command to restore the one or more software modules from the first supervisor node, wherein a processor of the dependency manager node transmits a configuration package file of the one or more machine-readable configuration package files to the first supervisor node in response to receiving from the first supervisor node a request identifying the configuration package file, wherein the configuration package file is associated with a software module detected as a failure according to the node status of the heartbeat signal of the respective node executing the failed software module. - View Dependent Claims (9, 10, 11, 12, 13)
-
-
14. A system comprising:
-
a computer executing a supervisor module, wherein the computer is configured to; monitor a heartbeat signal generated by a node manager monitoring one or more software modules stored on a node, wherein the heartbeat signal contains data indicating a status of each respective software module monitored by the node manager; detect a failed software module in the one or more software modules of the node based on the heartbeat signal received from the node manager of the node; transmit to the node manager of the node a command instructing the node to restore the failed software module, in response to detecting the failed software module; determine whether the node manager successfully restored the module based on the heartbeat signal received from the node manager; detect a failure of the node manager monitoring the failed software module; determine a failover node to execute the failed software module, wherein the failover node is associated with a failover node manager; retrieve a configuration package associated with the failed software module from a dependency manager node; transmit the configuration package to the failover node manager, wherein the failover node manager attempts to install the failed software module on the failover node, wherein the failover node manager attempts to restore the failed software module; determine if the failover node manager successfully installs the failed software module on the failover node; and determine if the failover node manager successfully restores the failed software module. - View Dependent Claims (15, 16, 17, 18, 19, 20)
-
Specification