Method and apparatus for template based parallel checkpointing
First Claim
1. A parallel computer system comprising:
- a) a plurality of compute nodes;
b) a plurality of I/O nodes coupled to the compute nodes;
c) a network that connects the compute nodes and the I/O nodes that supports a broadcast communication with the compute nodes; and
d) a checkpoint server that collects a parallel checkpoint of the state of the compute nodes using a rolling checksum algorithm.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and apparatus for a template based parallel checkpoint save for a massively parallel super computer system using a parallel checksum algorithm such as rsync. In preferred embodiments, the checkpoint data for each node is compared to a template checkpoint file that resides in the storage and that was previously produced. Embodiments herein greatly decrease the amount of data that must be transmitted and stored for faster checkpointing and increased efficiency of the computer system. Embodiments are directed a parallel computer system with nodes arranged in a cluster with a high speed interconnect that can perform broadcast communication. The checkpoint contains a set of actual small data blocks with their corresponding checksums from all nodes in the system. The data blocks may be compressed using conventional non-lossy data compression algorithms to further reduce the overall checkpoint size.
101 Citations
23 Claims
-
1. A parallel computer system comprising:
-
a) a plurality of compute nodes;
b) a plurality of I/O nodes coupled to the compute nodes;
c) a network that connects the compute nodes and the I/O nodes that supports a broadcast communication with the compute nodes; and
d) a checkpoint server that collects a parallel checkpoint of the state of the compute nodes using a rolling checksum algorithm. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A computer implemented method for checkpointing a parallel computer system comprising the steps of:
-
a) a checkpoint server broadcasting a template of data block checksums from a previous checkpoint to all compute nodes arranged in a in the cluster; and
b) each compute node searching its own memory image for checksum matches. - View Dependent Claims (8, 9, 10, 11, 12, 13, 14)
-
-
15. A computer implemented method for restoring a memory of a parallel computer system with a stored checkpoint comprising the steps of:
-
a) a checkpoint server sends a reference template of data block checksums from a previous checkpoint to a plurality of compute nodes in a cluster;
b) the checkpoint server broadcasts data block to the nodes;
c) each node copies broadcast data according to its own template of data block checksums; and
d) the checkpoint server broadcasts a start message.
-
-
16. A program product comprising:
-
(A) a checkpoint server that collects a parallel checkpoint of the state of a cluster of compute nodes on a parallel computer system using a rolling checksum algorithm on each node; and
(B) computer-readable signal bearing media bearing the checkpoint server. - View Dependent Claims (17, 18, 19, 20, 21, 22, 23)
-
Specification