FAILURE RECOVERY RESOLUTION IN TRANSPLANTING HIGH PERFORMANCE DATA INTENSIVE ALGORITHMS FROM CLUSTER TO CLOUD
First Claim
Patent Images
1. A method of performing fault tolerance at an infrastructure as a service (IaaS) layer on a cloud computing platform having network resources, comprising:
- a component collecting system distributed data of the cloud computing platform using a message passing interface (MPI);
the component establishing long-term transmission control protocol (TCP) interconnections of the cloud computing platform using a remote procedure call (RPC);
the component automatically detecting a failure of one of the network resources; and
the component recovering the failure by adding a new network resource in place of the failed network resource using combined MPI and RPC functionalities.
1 Assignment
0 Petitions
Accused Products
Abstract
A method of providing failure recovery capabilities to a cloud environment for scientific HPC applications. An HPC application with MPI implementation extends the class of MPI programs to embed the HPC application with various degrees of fault tolerance. An MPI fault tolerance mechanism realizes a recover-and-continue solution. If an error occurs, only failed processes re-spawn, the remaining living processes remain in their original processors/nodes, and system recovery costs are thus minimized.
-
Citations
10 Claims
-
1. A method of performing fault tolerance at an infrastructure as a service (IaaS) layer on a cloud computing platform having network resources, comprising:
-
a component collecting system distributed data of the cloud computing platform using a message passing interface (MPI); the component establishing long-term transmission control protocol (TCP) interconnections of the cloud computing platform using a remote procedure call (RPC); the component automatically detecting a failure of one of the network resources; and the component recovering the failure by adding a new network resource in place of the failed network resource using combined MPI and RPC functionalities. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A method of performing failure recovery in a parallel cloud high performance computing (HPC) system having nodes, including the steps of:
-
a component pinging and establishing connections with a plurality of virtual machines (VMs) having communicators, building a communication group that includes the communicators, and determining if the VMs are up and available; an message passing interface (MPI) process sending node numbers, node names, a folder path on which a MPI process can run, and file names with application instructions; a remote procedure call (RPC) initializing independent, long-term transmission control protocol (TCP) connections; the MPI process returning an error code to the component if a communication failure occurs in one of the communicators; the component spawning a new communicator if there is a failure in one of the communicators to replace the failed communicator; the RPC re-initializing independent, long-term TCP connections; and the MPI process loading checkpoints from storage.
-
Specification