TEMPLATE BASED PARALLEL CHECKPOINTING IN A MASSIVELY PARALLEL COMPUTER SYSTEM

US 20080195892A1
Filed: 04/16/2008
Published: 08/14/2008
Est. Priority Date: 04/14/2005
Status: Active Grant

First Claim

Patent Images

1. A computer implemented method for checkpointing a massively parallel computer system comprising the steps of:

a) a checkpoint server broadcasting a list of data block checksums from a previous checkpoint to all compute nodes arranged in a in the cluster; and

b) each compute node searching its own memory image for checksum matches using an rsync protocol rolling checksum algorithm;

wherein each node performs the steps of;

1) producing a template of new data blocks with checksums that didn'"'"'t exist in the previous checkpoint;

2) producing a template of references to the original data blocks that did exist in the previous checkpoint;

3) sending its new data block checksum template to an adjacent node in the cluster of nodes;

4) comparing checksums to find common data blocks between all adjacent nodes as well as its own data blocks;

5) informing adjacent nodes to replace a reference to a common data block with a reference to a data block on another node;

c) the checkpoint server then collecting reference templates from the compute nodes and storing them in the checkpoint server; and

d) collecting new unique data blocks and storing them to the checkpoint server.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and apparatus for a template based parallel checkpoint save for a massively parallel super computer system using a parallel variation of the rsync protocol, and network broadcast. In preferred embodiments, the checkpoint data for each node is compared to a template checkpoint file that resides in the storage and that was previously produced. Embodiments herein greatly decrease the amount of data that must be transmitted and stored for faster checkpointing and increased efficiency of the computer system. Embodiments are directed to a parallel computer system with nodes arranged in a cluster with a high speed interconnect that can perform broadcast communication. The checkpoint contains a set of actual small data blocks with their corresponding checksums from all nodes in the system. The data blocks may be compressed using conventional non-lossy data compression algorithms to further reduce the overall checkpoint size.

56 Citations

View as Search Results

9 Claims

1. A computer implemented method for checkpointing a massively parallel computer system comprising the steps of:
- a) a checkpoint server broadcasting a list of data block checksums from a previous checkpoint to all compute nodes arranged in a in the cluster; and
  
  b) each compute node searching its own memory image for checksum matches using an rsync protocol rolling checksum algorithm;
  
  wherein each node performs the steps of;
  
  1) producing a template of new data blocks with checksums that didn'"'"'t exist in the previous checkpoint;
  
  2) producing a template of references to the original data blocks that did exist in the previous checkpoint;
  
  3) sending its new data block checksum template to an adjacent node in the cluster of nodes;
  
  4) comparing checksums to find common data blocks between all adjacent nodes as well as its own data blocks;
  
  5) informing adjacent nodes to replace a reference to a common data block with a reference to a data block on another node;
  
  c) the checkpoint server then collecting reference templates from the compute nodes and storing them in the checkpoint server; and
  
  d) collecting new unique data blocks and storing them to the checkpoint server.
- View Dependent Claims (6, 7, 8, 9)
- - 6. The computer implemented method of claim 1 wherein the step of storing the data blocks stores the data on auxiliary storage servers instead of the checkpoint server.
  - 7. The computer implemented method of claim 1 wherein the step of storing the data blocks stores uses the I/O nodes to store the data to unique storage servers over network lines to the I/O nodes rather than funneling the data through a network line to the checkpoint server.
  - 8. The computer implemented method of claim 1 further comprising the step of sharing a subset of the reference template from a first application to a second application.
  - 9. The computer implemented method of claim 1 further comprising the steps of:
    - a) the checkpoint server sends a reference template of data block checksums from a previous checkpoint to a plurality of compute nodes in a cluster to restore the memory of the massively parallel computer system;
      
      b) the checkpoint server broadcasts data blocks to the nodes;
      
      c) each node copies broadcast data according to its own template of data block checksums; and
      
      d) the checkpoint server broadcasts a start message.

2-5. -5. (canceled)

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Archer, Charles Jens, Inglett, Todd Alan

Granted Patent

US 7,487,393 B2
Time in Patent Office

Days
Field of Search
US Class Current

714/20
CPC Class Codes

G06F 11/1438 Restarting or rejuvenating

G06F 11/1451 by selection of backup cont...

TEMPLATE BASED PARALLEL CHECKPOINTING IN A MASSIVELY PARALLEL COMPUTER SYSTEM

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

56 Citations

9 Claims

Specification

Solutions

Use Cases

Quick Links

TEMPLATE BASED PARALLEL CHECKPOINTING IN A MASSIVELY PARALLEL COMPUTER SYSTEM

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

56 Citations

9 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links