FAILURE RECOVERY RESOLUTION IN TRANSPLANTING HIGH PERFORMANCE DATA INTENSIVE ALGORITHMS FROM CLUSTER TO CLOUD

US 20150149814A1
Filed: 11/26/2014
Published: 05/28/2015
Est. Priority Date: 11/27/2013
Status: Active Grant

First Claim

Patent Images

1. A method of performing fault tolerance at an infrastructure as a service (IaaS) layer on a cloud computing platform having network resources, comprising:

a component collecting system distributed data of the cloud computing platform using a message passing interface (MPI);

the component establishing long-term transmission control protocol (TCP) interconnections of the cloud computing platform using a remote procedure call (RPC);

the component automatically detecting a failure of one of the network resources; and

the component recovering the failure by adding a new network resource in place of the failed network resource using combined MPI and RPC functionalities.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method of providing failure recovery capabilities to a cloud environment for scientific HPC applications. An HPC application with MPI implementation extends the class of MPI programs to embed the HPC application with various degrees of fault tolerance. An MPI fault tolerance mechanism realizes a recover-and-continue solution. If an error occurs, only failed processes re-spawn, the remaining living processes remain in their original processors/nodes, and system recovery costs are thus minimized.

Citations

10 Claims

1. A method of performing fault tolerance at an infrastructure as a service (IaaS) layer on a cloud computing platform having network resources, comprising:
- a component collecting system distributed data of the cloud computing platform using a message passing interface (MPI);
  
  the component establishing long-term transmission control protocol (TCP) interconnections of the cloud computing platform using a remote procedure call (RPC);
  
  the component automatically detecting a failure of one of the network resources; and
  
  the component recovering the failure by adding a new network resource in place of the failed network resource using combined MPI and RPC functionalities.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method as specified in claim 1, wherein the failure is detected in a cluster of the network resources comprising a plurality MPI nodes in an MPI communication group, comprising the steps of:
    - the component calling the MPI nodes;
      
      the component delivering information indicative of a failed MPI node to a MPI master node in order to spawn a new MPI node; and
      
      the MPI master node broadcasting the information of the failed MPI node to all MPI nodes in the MPI communication group such that each MPI node is updated with the information.
  - 3. The method as specified in claim 2, further comprising the steps of:
    - the component determining a new MPI node as a new MPI communicator according to information at the MPI master node;
      
      the component establishing a new connection with RPC to the new MPI node;
      
      the component spawning a new communicator on the new MPI node; and
      
      the component updating the new communicator with group member information and parallel processing information.
  - 4. A method as specified in claim 3, further comprising:
    - the component establishing checkpoints during parallel processing periodically; and
      
      the component saving data of each checkpoint on a cloud storage.
  - 5. The method as specified in claim 4, further comprising:
    - the component updating the spawned new MPI node with current checkpoint data from the cloud storage; and
      
      the component updating all the MPI members with the current checkpoint data from the cloud storage.
  - 6. The method as specified in claim 4, wherein:
    - the cloud storage has a definition in MPI; and
      
      the cloud storage is one of the MPI members such that all the MPI nodes recognize the cloud storage and can copy data to/from the cloud storage.
  - 7. The method as specified in claim 2, further comprising the steps of:
    - defining a threshold time T allowing the component to determine whether or not an MPI node has failed;
      
      wherein when the master MPI node determines no response from an MPI node, the master MPI node waits a time length of time T,wherein if the MPI node with no response is recovered and responds to the master MPI node correctly within the time T, no new MPI node is spawned,wherein if the MPI node with no response is not recovered within time T, the component spawns a new MPI node to replace the failed MPI node.
  - 8. The method as specified in claim 7 wherein a time T_opt represents a time to establish the new MPI node, spawn the new MPI node, update the new MPI node information, and update the new MPI node with checkpoint data, wherein:
    - if T>
      
      T_opt, the master MPI node will not wait for the time T, and instead, the component spawns the new MPI node to replace the failed MPI node.
  - 9. The method as specified in claim 8, wherein if T<
    - =T_opt, the master MPI node waits until time T to decide if the non-responsive MPI node has failed.

10. A method of performing failure recovery in a parallel cloud high performance computing (HPC) system having nodes, including the steps of:
- a component pinging and establishing connections with a plurality of virtual machines (VMs) having communicators, building a communication group that includes the communicators, and determining if the VMs are up and available;
  
  an message passing interface (MPI) process sending node numbers, node names, a folder path on which a MPI process can run, and file names with application instructions;
  
  a remote procedure call (RPC) initializing independent, long-term transmission control protocol (TCP) connections;
  
  the MPI process returning an error code to the component if a communication failure occurs in one of the communicators;
  
  the component spawning a new communicator if there is a failure in one of the communicators to replace the failed communicator;
  
  the RPC re-initializing independent, long-term TCP connections; and
  
  the MPI process loading checkpoints from storage.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Futurewei Technologies Incorporated (Huawei Investment & Holding Co., Ltd.)
Original Assignee
Futurewei Technologies Incorporated (Huawei Investment & Holding Co., Ltd.)
Inventors
Wei, Zhulin, Ren, Da Qi

Granted Patent

US 9,626,261 B2
Time in Patent Office

Days
Field of Search
US Class Current

714/4.11
CPC Class Codes

G06F 11/0703   Error or fault processing n...

G06F 11/1438   Restarting or rejuvenating

G06F 11/1448   Management of the data invo...

G06F 11/1484   involving virtual machines

G06F 11/2028   eliminating a faulty proces...

G06F 11/203   using migration

G06F 2201/84   Using snapshots, i.e. a log...

G06F 2201/85   Active fault masking withou...

H04L 43/10   Active monitoring, e.g. hea...

H04L 67/1001   for accessing one among a p...

H04L 67/1095   Replication or mirroring of...

FAILURE RECOVERY RESOLUTION IN TRANSPLANTING HIGH PERFORMANCE DATA INTENSIVE ALGORITHMS FROM CLUSTER TO CLOUD

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

10 Claims

Specification

Solutions

Use Cases

Quick Links

FAILURE RECOVERY RESOLUTION IN TRANSPLANTING HIGH PERFORMANCE DATA INTENSIVE ALGORITHMS FROM CLUSTER TO CLOUD

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

10 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links