Fault tolerant architecture for distributed computing systems

US 9,201,744 B2
Filed: 12/02/2014
Issued: 12/01/2015
Est. Priority Date: 12/02/2013
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method comprising:

monitoring, by a computer comprising a processor executing a supervisor module, a heartbeat signal generated by a node manager monitoring one or more software modules stored on a node, wherein the heartbeat signal contains binary data indicating a status of each respective software module monitored by the node manager;

detecting, by the computer, a failed software module in the one or more software modules of the node based on the heartbeat signal received from the node manager of the node;

automatically transmitting, by the computer, to the node manager of the node a command instructing the node to restore the failed software module, in response to detecting the failed software module;

determining, by the computer, whether the node manager successfully restored the module based on the heartbeat signal received from the node manager;

determining, by the computer, a failover node to execute the failed software module when the node manager does not restore the failed software module within a threshold number of attempts;

retrieving, by the computer, a configuration package associated with the failed software module from a dependency manager node;

transmitting, by the computer, the configuration package to a failover node manager associated with the failover node, wherein the failover node manager attempts to install the module on the failover node, and wherein the failover node manager attempts to restore the failed software module;

determining, by the computer, if the failover node manager successfully installed the failed software module on the failover node; and

determining, by the computer, if the failover node manager successfully restored the failed software module.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Disclosed here is a fault tolerant architecture suitable for use with any distributed computing system. A fault tolerant architecture may include any suitable number of supervisors, dependency managers, node managers, and other modules distributed across any suitable number of nodes. In one or more embodiments, supervisors may monitor the system using any suitable number of heartbeats from any suitable number of node managers and other modules. In one or more embodiments, supervisors may automatically recover failed modules in a distributed system by moving the modules and their dependencies to other nodes in the system. In one or more embodiments, supervisors may request a configuration package from one or more dependency managers installing one or more modules on a node. In one or more embodiments, one or more modules may have any suitable number of redundant copies in the system, where redundant copies of modules in the system may be stored in separate nodes.

98 Citations

View as Search Results

14 Claims

1. A computer-implemented method comprising:
- monitoring, by a computer comprising a processor executing a supervisor module, a heartbeat signal generated by a node manager monitoring one or more software modules stored on a node, wherein the heartbeat signal contains binary data indicating a status of each respective software module monitored by the node manager;
  
  detecting, by the computer, a failed software module in the one or more software modules of the node based on the heartbeat signal received from the node manager of the node;
  
  automatically transmitting, by the computer, to the node manager of the node a command instructing the node to restore the failed software module, in response to detecting the failed software module;
  
  determining, by the computer, whether the node manager successfully restored the module based on the heartbeat signal received from the node manager;
  
  determining, by the computer, a failover node to execute the failed software module when the node manager does not restore the failed software module within a threshold number of attempts;
  
  retrieving, by the computer, a configuration package associated with the failed software module from a dependency manager node;
  
  transmitting, by the computer, the configuration package to a failover node manager associated with the failover node, wherein the failover node manager attempts to install the module on the failover node, and wherein the failover node manager attempts to restore the failed software module;
  
  determining, by the computer, if the failover node manager successfully installed the failed software module on the failover node; and
  
  determining, by the computer, if the failover node manager successfully restored the failed software module.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method according to claim 1, further comprising:
    - determining, by the computer, a next failover node to execute the failed software module when the failover node manager fails to install the failed software module on the failover node or when the failover node manager fails to restore the failed software module within a threshold number of attempts;
      
      transmitting, by the computer, the configuration package to a next failover node manager associated with the next failover node;
      
      determining, by the computer, if the next failover node manager successfully installs the failed software module on the next failover node; and
      
      determining, by the computer, if the next failover node manager successfully restores the failed software module.
  - 3. The method according to claim 2, further comprising generating, by the computer, a module failure alert after one or more next failover node managers exceed a global threshold number of attempts to restore the failed software module, wherein the computer sequentially determines a next failover node until the global threshold of attempts to restore the failed software module is met.
  - 4. The method according to claim 1, further comprising:
    - detecting, by the computer, a failure of the node manager monitoring the failed software module;
      
      determining, by the computer, a failover node to execute the module, wherein the failover node is associated with a failover node manager;
      
      retrieving, by the computer, a configuration package associated with the failed software module from a dependency manager node;
      
      transmitting, by the computer, the configuration package to the failover node manager, wherein the failover node manager attempts to install the failed software module on the failover node, and wherein the failover node manager attempts to restore the failed software module;
      
      determining, by the computer, if the failover node manager successfully installs the failed software module on the failover node; and
      
      determining, by the computer, if the failover node manager successfully restores the failed software module.
  - 5. The method according to claim 4, further comprising:
    - determining, by the computer, that the node is a failed node when the node is not functioning according to a status quo;
      
      determining, by the computer, one or more modules executed by the failed node to be migrated off of the failed node and restored at one or more new nodes;
      
      retrieving, by the computer, a configuration package for each of the one or more modules executed by the failed node from the dependency manager node; and
      
      transmitting, by the computer, each configuration package to the one or more new nodes.
  - 6. The method according to claim 5, further comprising:
    - determining, by the computer, a next new node having a set of available resources capable of installing and executing a module in the one or more modules migrated off of the failed node;
      
      instructing, by the computer, a new node manager of a new node storing the module to unload the module; and
      
      transmitting, by the computer, the configuration package to the next new node.

7. A fault-tolerant distributed computing system comprising:
- one or more nodes comprising a processor transmitting a heartbeat signal to a supervisor node and monitoring execution of one or more software modules installed on the one or more nodes;
  
  one or more supervisor nodes comprising a processor monitoring one or more heartbeat signals received from the one or more nodes, and determining a status of each respective node based on each respective heartbeat signal, wherein the processor of the one or more nodes is configured to attempt to restore a software module executed by the one or more nodes to a status quo configuration responsive to receiving a command to restore the software module from the supervisor node;
  
  a dependency manager node comprising non-transitory machine-readable storage media storing one or more machine-readable configuration package files; and
  
  the processor of the one or more supervisor nodes determines a number of attempts to restore the software module executed by the one or more nodes, and wherein the processor of the one or more supervisor nodes automatically retrieves from the dependency manager node a configuration package file associated with the software module responsive to determining the number of attempts exceeds a threshold number of attempts to restore the software module.
- View Dependent Claims (8, 9, 10, 11, 12, 13, 14)
- - 8. The system according to claim 7, further comprising a failover node comprising a processor transmitting a heartbeat signal to the supervisor node, wherein the failover node is configured to execute the software module.
  - 9. The system according to claim 8, wherein the processor of the one or more supervisor nodes transmits the one or more configuration package files to the failover node in response to receiving the one or more configuration package files from the dependency manager node, and instructs the processor of the failover node to attempt to restore the software module.
  - 10. The system according to claim 7, wherein a processor of the dependency manager node transmits a configuration package file of the one or more machine-readable configuration package files to the supervisor node in response to receiving from the supervisor node a request identifying the configuration package file.
  - 11. The system according to claim 10, wherein the configuration package file is associated with a software module detected as a failure according to the node status of the heartbeat signal of the respective node executing the failed software module.
  - 12. The system according to claim 11, wherein the one or more supervisor nodes transmits a resource-shifting command to a failover node responsive to determining that a node status of the failover node indicates that the failover node has insufficient resources to restore the failed software module.
  - 13. The system according to claim 12, wherein a processor of the failover node automatically uninstalls an installed software module from the failover node in response to the resource-shifting command, and wherein the failover node attempts to install and restore the failed software module received from the supervisor node.
  - 14. The system according to claim 10, further comprising a redundant node comprising a non-transitory machine-readable storage medium storing a redundant copy of a software module of the one or more software modules, and a processor configured to automatically attempt to execute the redundant copy of the software module responsive to receiving a command to restore the software module from the supervisor node instructing the redundant node to attempt to execute the redundant copy of the software module.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Finch Computing LLC (Qbase, LLC)
Original Assignee
Qbase, LLC
Inventors
Lightner, Scott, Weckesser, Franz
Primary Examiner(s)
KUDIRKA, JOSEPH R

Application Number

US14/557,951
Publication Number

US 20150154079A1
Time in Patent Office

364 Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 11/0709   in a distributed system con...

G06F 11/0757   by exceeding a time limit, ...

G06F 11/1438   Restarting or rejuvenating

G06F 11/1662   the resynchronized componen...

G06F 11/2023   Failover techniques

G06F 11/2025   using centralised failover ...

G06F 11/2028   eliminating a faulty proces...

G06F 11/203   using migration

G06F 11/2041   with more than one idle spa...

G06F 2201/805   Real-time

Fault tolerant architecture for distributed computing systems

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

98 Citations

14 Claims

Specification

Solutions

Use Cases

Quick Links

Fault tolerant architecture for distributed computing systems

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

98 Citations

14 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links