Method and apparatus for providing fault-tolerance in parallel-processing systems
First Claim
1. A method for providing fault-tolerance in a parallel-processing system, comprising:
- executing a parallel-computing application in parallel across a subset of computing nodes within the parallel-processing system;
monitoring telemetry signals within the parallel-processing system;
analyzing the monitored telemetry signals to determine if the probability that the parallel-processing system will fail is increasing; and
if so, increasing the frequency at which the parallel-computing application is checkpointed, wherein a checkpoint includes the state of the parallel-computing application at each computing node within the parallel-processing system.
2 Assignments
0 Petitions
Accused Products
Abstract
A system that provides fault tolerance in a parallel processing system. During operation, the system executes a parallel computing application in parallel across a subset of computing nodes within the parallel processing system. During this process, the system monitors telemetry signals within the parallel processing system. The system analyzes the monitored telemetry signals to determine if the probability that the parallel processing system will fail is increasing. If so, the system increases the frequency at which the parallel computing application is checkpointed, wherein a checkpoint includes the state of the parallel computing application at each computing node within the parallel processing system.
-
Citations
20 Claims
-
1. A method for providing fault-tolerance in a parallel-processing system, comprising:
-
executing a parallel-computing application in parallel across a subset of computing nodes within the parallel-processing system;
monitoring telemetry signals within the parallel-processing system;
analyzing the monitored telemetry signals to determine if the probability that the parallel-processing system will fail is increasing; and
if so, increasing the frequency at which the parallel-computing application is checkpointed, wherein a checkpoint includes the state of the parallel-computing application at each computing node within the parallel-processing system. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A computer-readable storage medium storing instructions that when executed by a computer cause the computer to perform a method for providing fault-tolerance in a parallel-processing system, the method comprising:
-
executing a parallel-computing application in parallel across a subset of computing nodes within the parallel-processing system;
monitoring telemetry signals within the parallel-processing system;
analyzing the monitored telemetry signals to determine if the probability that the parallel-processing system will fail is increasing; and
if so, increasing the frequency at which the parallel-computing application is checkpointed, wherein a checkpoint includes the state of the parallel-computing application at each computing node within the parallel-processing system. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. An apparatus that provides fault-tolerance in a parallel-processing system, comprising:
-
an execution mechanism configured to execute a parallel-computing application in parallel across a subset of computing nodes within the parallel-processing system;
a health-monitoring mechanism configured to monitor telemetry signals within the parallel-processing system;
a checkpointing mechanism configured to;
analyze the monitored telemetry signals to determine if the probability that the parallel-processing system will fail is increasing; and
if so, to increase the frequency at which the parallel-computing application is checkpointed, wherein a checkpoint includes the state of the parallel-computing application at each computing node within the parallel-processing system. - View Dependent Claims (20)
-
Specification