×

Method and apparatus for providing fault-tolerance in parallel-processing systems

  • US 20070220298A1
  • Filed: 03/20/2006
  • Published: 09/20/2007
  • Est. Priority Date: 03/20/2006
  • Status: Active Grant
First Claim
Patent Images

1. A method for providing fault-tolerance in a parallel-processing system, comprising:

  • executing a parallel-computing application in parallel across a subset of computing nodes within the parallel-processing system;

    monitoring telemetry signals within the parallel-processing system;

    analyzing the monitored telemetry signals to determine if the probability that the parallel-processing system will fail is increasing; and

    if so, increasing the frequency at which the parallel-computing application is checkpointed, wherein a checkpoint includes the state of the parallel-computing application at each computing node within the parallel-processing system.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×