TEMPLATE BASED PARALLEL CHECKPOINTING IN A MASSIVELY PARALLEL COMPUTER SYSTEM
First Claim
1. A computer implemented method for checkpointing a massively parallel computer system comprising the steps of:
- a) a checkpoint server broadcasting a list of data block checksums from a previous checkpoint to all compute nodes arranged in a in the cluster; and
b) each compute node searching its own memory image for checksum matches using an rsync protocol rolling checksum algorithm;
wherein each node performs the steps of;
1) producing a template of new data blocks with checksums that didn'"'"'t exist in the previous checkpoint;
2) producing a template of references to the original data blocks that did exist in the previous checkpoint;
3) sending its new data block checksum template to an adjacent node in the cluster of nodes;
4) comparing checksums to find common data blocks between all adjacent nodes as well as its own data blocks;
5) informing adjacent nodes to replace a reference to a common data block with a reference to a data block on another node;
c) the checkpoint server then collecting reference templates from the compute nodes and storing them in the checkpoint server; and
d) collecting new unique data blocks and storing them to the checkpoint server.
0 Assignments
0 Petitions
Accused Products
Abstract
A method and apparatus for a template based parallel checkpoint save for a massively parallel super computer system using a parallel variation of the rsync protocol, and network broadcast. In preferred embodiments, the checkpoint data for each node is compared to a template checkpoint file that resides in the storage and that was previously produced. Embodiments herein greatly decrease the amount of data that must be transmitted and stored for faster checkpointing and increased efficiency of the computer system. Embodiments are directed to a parallel computer system with nodes arranged in a cluster with a high speed interconnect that can perform broadcast communication. The checkpoint contains a set of actual small data blocks with their corresponding checksums from all nodes in the system. The data blocks may be compressed using conventional non-lossy data compression algorithms to further reduce the overall checkpoint size.
56 Citations
9 Claims
-
1. A computer implemented method for checkpointing a massively parallel computer system comprising the steps of:
-
a) a checkpoint server broadcasting a list of data block checksums from a previous checkpoint to all compute nodes arranged in a in the cluster; and b) each compute node searching its own memory image for checksum matches using an rsync protocol rolling checksum algorithm; wherein each node performs the steps of; 1) producing a template of new data blocks with checksums that didn'"'"'t exist in the previous checkpoint; 2) producing a template of references to the original data blocks that did exist in the previous checkpoint; 3) sending its new data block checksum template to an adjacent node in the cluster of nodes; 4) comparing checksums to find common data blocks between all adjacent nodes as well as its own data blocks; 5) informing adjacent nodes to replace a reference to a common data block with a reference to a data block on another node; c) the checkpoint server then collecting reference templates from the compute nodes and storing them in the checkpoint server; and d) collecting new unique data blocks and storing them to the checkpoint server. - View Dependent Claims (6, 7, 8, 9)
-
-
2-5. -5. (canceled)
Specification