Fault tolerant architecture for distributed computing systems
First Claim
1. A computer-implemented method comprising:
- monitoring, by a computer comprising a processor executing a supervisor module, a heartbeat signal generated by a node manager monitoring one or more software modules stored on a node, wherein the heartbeat signal contains binary data indicating a status of each respective software module monitored by the node manager;
detecting, by the computer, a failed software module in the one or more software modules of the node based on the heartbeat signal received from the node manager of the node;
automatically transmitting, by the computer, to the node manager of the node a command instructing the node to restore the failed software module, in response to detecting the failed software module;
determining, by the computer, whether the node manager successfully restored the module based on the heartbeat signal received from the node manager;
determining, by the computer, a failover node to execute the failed software module when the node manager does not restore the failed software module within a threshold number of attempts;
retrieving, by the computer, a configuration package associated with the failed software module from a dependency manager node;
transmitting, by the computer, the configuration package to a failover node manager associated with the failover node, wherein the failover node manager attempts to install the module on the failover node, and wherein the failover node manager attempts to restore the failed software module;
determining, by the computer, if the failover node manager successfully installed the failed software module on the failover node; and
determining, by the computer, if the failover node manager successfully restored the failed software module.
2 Assignments
0 Petitions
Accused Products
Abstract
Disclosed here is a fault tolerant architecture suitable for use with any distributed computing system. A fault tolerant architecture may include any suitable number of supervisors, dependency managers, node managers, and other modules distributed across any suitable number of nodes. In one or more embodiments, supervisors may monitor the system using any suitable number of heartbeats from any suitable number of node managers and other modules. In one or more embodiments, supervisors may automatically recover failed modules in a distributed system by moving the modules and their dependencies to other nodes in the system. In one or more embodiments, supervisors may request a configuration package from one or more dependency managers installing one or more modules on a node. In one or more embodiments, one or more modules may have any suitable number of redundant copies in the system, where redundant copies of modules in the system may be stored in separate nodes.
98 Citations
14 Claims
-
1. A computer-implemented method comprising:
-
monitoring, by a computer comprising a processor executing a supervisor module, a heartbeat signal generated by a node manager monitoring one or more software modules stored on a node, wherein the heartbeat signal contains binary data indicating a status of each respective software module monitored by the node manager; detecting, by the computer, a failed software module in the one or more software modules of the node based on the heartbeat signal received from the node manager of the node; automatically transmitting, by the computer, to the node manager of the node a command instructing the node to restore the failed software module, in response to detecting the failed software module; determining, by the computer, whether the node manager successfully restored the module based on the heartbeat signal received from the node manager; determining, by the computer, a failover node to execute the failed software module when the node manager does not restore the failed software module within a threshold number of attempts; retrieving, by the computer, a configuration package associated with the failed software module from a dependency manager node; transmitting, by the computer, the configuration package to a failover node manager associated with the failover node, wherein the failover node manager attempts to install the module on the failover node, and wherein the failover node manager attempts to restore the failed software module; determining, by the computer, if the failover node manager successfully installed the failed software module on the failover node; and determining, by the computer, if the failover node manager successfully restored the failed software module. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A fault-tolerant distributed computing system comprising:
-
one or more nodes comprising a processor transmitting a heartbeat signal to a supervisor node and monitoring execution of one or more software modules installed on the one or more nodes; one or more supervisor nodes comprising a processor monitoring one or more heartbeat signals received from the one or more nodes, and determining a status of each respective node based on each respective heartbeat signal, wherein the processor of the one or more nodes is configured to attempt to restore a software module executed by the one or more nodes to a status quo configuration responsive to receiving a command to restore the software module from the supervisor node; a dependency manager node comprising non-transitory machine-readable storage media storing one or more machine-readable configuration package files; and the processor of the one or more supervisor nodes determines a number of attempts to restore the software module executed by the one or more nodes, and wherein the processor of the one or more supervisor nodes automatically retrieves from the dependency manager node a configuration package file associated with the software module responsive to determining the number of attempts exceeds a threshold number of attempts to restore the software module. - View Dependent Claims (8, 9, 10, 11, 12, 13, 14)
-
Specification