Proactive Failure Recovery Model for Distributed Computing

US 20160034362A1
Filed: 07/29/2014
Published: 02/04/2016
Est. Priority Date: 07/29/2014
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method, comprising:

building a virtual tree-like computing structure of a plurality of computing nodes;

for each computing node of the virtual tree-like computing structure, performing, by a hardware processor, a node failure prediction model to calculate a mean time between failure (MTBF) associated with the computing node;

determining whether to perform a checkpoint of the computing node based on a comparison between the calculated MTBF and a maximum and minimum threshold;

migrating a process from the computing node to a different computing node acting as a recovery node; and

resuming execution of the process on the different computing node.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

This disclosure generally describes methods and systems, including computer-implemented methods, computer-program products, and computer systems, for providing a proactive failure recovery model for distributed computing. One computer-implemented method includes building a virtual tree-like computing structure of a plurality of computing nodes, for each computing node of the virtual tree-like computing structure, performing, by a hardware processor, a node failure prediction model to calculate a mean time between failure (MTBF) associated with the computing node, determining whether to perform a checkpoint of the computing node based on a comparison between the calculated MTBF and a maximum and minimum threshold, migrating a process from the computing node to a different computing node acting as a recovery node, and resuming execution of the process on the different computing node.

34 Citations

20 Claims

1. A computer-implemented method, comprising:
- building a virtual tree-like computing structure of a plurality of computing nodes;
  
  for each computing node of the virtual tree-like computing structure, performing, by a hardware processor, a node failure prediction model to calculate a mean time between failure (MTBF) associated with the computing node;
  
  determining whether to perform a checkpoint of the computing node based on a comparison between the calculated MTBF and a maximum and minimum threshold;
  
  migrating a process from the computing node to a different computing node acting as a recovery node; and
  
  resuming execution of the process on the different computing node.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, further comprising:
    - collecting at least a computing power and node location parameter value for each computing node;
      
      dividing the computing nodes into collections based on their node location parameter; and
      
      sorting the nodes within each collection based on the computing power parameter.
  - 3. The method of claim 2, further comprising:
    - identifying the upper-limit and lower-limit to determine levels of the sorted computing nodes;
      
      sorting the computing nodes within each collection into horizontal levels based on the computing power parameter and the upper-limit and lower-limit;
      
      recording the horizontal level placement and a vertical placement into a node-record-information table associated with each computing node; and
      
      populating each node-record-information table with a designated recovery node.
  - 4. The method of claim 3, wherein the upper-limit and lower-limit are determined from a cross plot of the collected computing power and node location parameters for each computing node and the vertical placement is determined based at least on the node location parameter for each computing node.
  - 5. The method of claim 1, wherein the MTBF is calculated based at least upon a network or data storage failure.
  - 6. The method of claim 1, further comprising:
    - creating a checkpoint when the MTBF of the computing node is less than the lower-limit; and
      
      updating the lower-limit associated with the computing node to equal the MTBF.
  - 7. The method of claim 6, further comprising:
    - determining that a failure of the computing node has occurred; and
      
      using the last checkpoint taken for the computing node as a process state.

8. A non-transitory, computer-readable medium storing computer-readable instructions, the instructions executable by a computer and configured to:
- build a virtual tree-like computing structure of a plurality of computing nodes;
  
  for each computing node of the virtual tree-like computing structure, perform a node failure prediction model to calculate a mean time between failure (MTBF) associated with the computing node;
  
  determine whether to perform a checkpoint of the computing node based on a comparison between the calculated MTBF and a maximum and minimum threshold;
  
  migrate a process from the computing node to a different computing node acting as a recovery node; and
  
  resume execution of the process on the different computing node.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The medium of claim 8, further including instructions to:
    - collect at least a computing power and node location parameter value for each computing node;
      
      divide the computing nodes into collections based on their node location parameter; and
      
      sort the nodes within each collection based on the computing power parameter.
  - 10. The medium of claim 9, further including instructions to:
    - identify the upper-limit and lower-limit to determine levels of the sorted computing nodes;
      
      sort the computing nodes within each collection into horizontal levels based on the computing power parameter and the upper-limit and lower-limit;
      
      record the horizontal level placement and a vertical placement into a node-record-information table associated with each computing node; and
      
      populate each node-record-information table with a designated recovery node.
  - 11. The medium of claim 10, wherein the upper-limit and lower-limit are determined from a cross plot of the collected computing power and node location parameters for each computing node and the vertical placement is determined based at least on the node location parameter for each computing node.
  - 12. The medium of claim 8, wherein the MTBF is calculated based at least upon a network or data storage failure.
  - 13. The medium of claim 8, further including instructions to:
    - create a checkpoint when the MTBF of the computing node is less than the lower-limit; and
      
      update the lower-limit associated with the computing node to equal the MTBF.
  - 14. The medium of claim 13, further including instructions to:
    - determine that a failure of the computing node has occurred; and
      
      use the last checkpoint taken for the computing node as a process state.

15. A computer system, comprising:
- at least one hardware processor interoperably coupled with a memory storage and configured to;
  
  build a virtual tree-like computing structure of a plurality of computing nodes;
  
  for each computing node of the virtual tree-like computing structure, perform a node failure prediction model to calculate a mean time between failure (MTBF) associated with the computing node;
  
  determine whether to perform a checkpoint of the computing node based on a comparison between the calculated MTBF and a maximum and minimum threshold;
  
  migrate a process from the computing node to a different computing node acting as a recovery node; and
  
  resume execution of the process on the different computing node.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The system of claim 15, further configured to:
    - collect at least a computing power and node location parameter value for each computing node;
      
      divide the computing nodes into collections based on their node location parameter; and
      
      sort the nodes within each collection based on the computing power parameter.
  - 17. The system of claim 16, further configured to:
    - identify the upper-limit and lower-limit to determine levels of the sorted computing nodes;
      
      sort the computing nodes within each collection into horizontal levels based on the computing power parameter and the upper-limit and lower-limit;
      
      record the horizontal level placement and a vertical placement into a node-record-information table associated with each computing node; and
      
      populate each node-record-information table with a designated recovery node.
  - 18. The system of claim 17, wherein the upper-limit and lower-limit are determined from a cross plot of the collected computing power and node location parameters for each computing node and the vertical placement is determined based at least on the node location parameter for each computing node.
  - 19. The system of claim 15, wherein the MTBF is calculated based at least upon a network or data storage failure.
  - 20. The system of claim 15, further configured to:
    - create a checkpoint when the MTBF of the computing node is less than the lower-limit;
      
      update the lower-limit associated with the computing node to equal the MTBF;
      
      determine that a failure of the computing node has occurred; and
      
      use the last checkpoint taken for the computing node as a process state.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Saudi Arabian Oil Company (Government of Saudi Arabia)
Original Assignee
Saudi Arabian Oil Company (Government of Saudi Arabia)
Inventors
Al-Wahabi, Khalid S.

Granted Patent

US 9,348,710 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 11/0721   within a central processing...

G06F 11/0757   by exceeding a time limit, ...

G06F 11/1407   Checkpointing the instructi...

G06F 11/1438   Restarting or rejuvenating

G06F 11/1461   Backup scheduling policy

G06F 11/1471   involving logging of persis...

G06F 11/203   using migration

G06F 11/34   Recording or statistical ev...

Proactive Failure Recovery Model for Distributed Computing

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

34 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Proactive Failure Recovery Model for Distributed Computing

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

34 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links