Checkpoint restart method and apparatus utilizing multiple log memories
First Claim
1. A checkpoint controlling apparatus for use in a multiprocessor system capable of recovering from a fault during execution of application programs, wherein the multiprocessor system has at least two processor modules, each having a processing unit and a cache memory, a shared memory for use by the processor modules to store data related to the execution of the application programs, and log memory means for recording log data as states of data in both the processor unit and the cache memory for the processor modules before updating the shared memory, and means for storing states of data in both the processor unit and the cache memory for the processor modules in the shared memory at intervals, and wherein each of the processor modules executes checkpoint processing independently, the checkpoint controlling apparatus comprising:
- means for selecting one of a first portion and a second portion of the log memory means;
means for storing log data in the selected portion of the log memory means; and
means for switching to the other portion of the log memory means to store log data for the first processor module after the first processor module has completed execution of checkpoint processing.
1 Assignment
0 Petitions
Accused Products
Abstract
Log memories for recording updated history of a main memory are provided. CPUs record the updated history of the main memory to either of the log memories and writes context thereof and content of a cache memory to the main memory at a checkpoint acquisition. The updated history of the main memory is switched from one of CPUs that has finished a checkpoint processing to other one of the log memories in which the CPUs do not use to record the updated history of the main memory. Normal processing is restarted without waiting for finishing the checkpoint acquisition of the other ones of CPUs.
103 Citations
21 Claims
-
1. A checkpoint controlling apparatus for use in a multiprocessor system capable of recovering from a fault during execution of application programs, wherein the multiprocessor system has at least two processor modules, each having a processing unit and a cache memory, a shared memory for use by the processor modules to store data related to the execution of the application programs, and log memory means for recording log data as states of data in both the processor unit and the cache memory for the processor modules before updating the shared memory, and means for storing states of data in both the processor unit and the cache memory for the processor modules in the shared memory at intervals, and wherein each of the processor modules executes checkpoint processing independently, the checkpoint controlling apparatus comprising:
-
means for selecting one of a first portion and a second portion of the log memory means; means for storing log data in the selected portion of the log memory means; and means for switching to the other portion of the log memory means to store log data for the first processor module after the first processor module has completed execution of checkpoint processing. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A method of recording log data in log memories having a shared memory shared by processor modules in a multiprocessor system, the log memories including at least first and second log memories for recording log data of the shared memory, the method comprising the steps of:
-
storing a context of each of the processor modules and a content of each cache memory in the shared memory at each checkpoint processing by each of the processor modules; switching from the log memory that the processor module uses to the other log memory that the processor module has not used when the checkpoint processing by the processor module is finished; determining whether a processor module that finished a checkpoint processing is the last of the processor modules to complete checkpoint processing; and clearing the content of the log memory used by the processor modules if it is determined that the last processor module has completed checkpoint processing. - View Dependent Claims (10, 11, 12, 13)
-
-
14. A checkpoint controlling method for use in a fault tolerant multiprocessor system capable of recovering from a fault during execution of application programs, wherein the fault tolerant multiprocessor system has a plurality of log memories, at least two processor modules each having a processor unit and a cache memory, a shared memory, and means for storing data in the shared memory at intervals, and wherein each of the processor modules executes checkpoint processing independently, the checkpoint controlling method comprising the steps of:
-
selecting one of a first log memory and a second log memory of the plurality of log memories of the fault tolerant multiprocessor system; storing log data in the selected one of the first and the second log memories, the log data including a context of the processing unit and a content of the cache memory for the processor modules executing checkpoint processing; determining whether all of the processor modules have completed execution of checkpoint processing; and clearing the selected one of the first and the second log memories when it is determined that all of the processor modules have completed execution of checkpoint processing. - View Dependent Claims (15, 16, 17)
-
-
18. A checkpoint controlling method for use in a fault tolerant multiprocessor system capable of recovering from a fault during execution of application programs, wherein the fault tolerant multiprocessor system has a plurality of log memories, at least two processor modules each having a processor unit and a cache memory, a shared memory, and means for storing data in the shared memory at intervals, and wherein each of the processor modules repeatedly executes a checkpoint processing independently, the checkpoint controlling method comprising the steps of;
-
selecting one of a first log memory and a second log memory of the plurality of log memories of the fault tolerant multiprocessor system; storing log data in the selected one of the first and the second log memories, the log data including a context of the processing unit and a content of the cache memory for the processor modules executing checkpoint processing; determining whether all of the processor modules have completed execution of checkpoint processing; and clearing the selected one of the first and the second log memories when it is determined that all of the processor modules have completed an iteration of the checkpoint processing.
-
-
19. A checkpoint controlling method for use in a fault tolerant multiprocessor system capable of recovering from a fault during execution of application programs, wherein the fault tolerant multiprocessor system has a plurality of log memories, at least two processor modules each having a processor unit and a cache memory, a shared memory, and means for storing data in the shared memory at intervals, and wherein each of the processor modules executes checkpoint processing independently, the checkpoint controlling method comprising the steps of:
-
(a) selecting a first log memory of the plurality of log memories of the fault tolerant multiprocessor system; (b) determining whether a first one of the processor modules, has completed execution of checkpoint processing; (c) storing log data for the first processor module in the selected first log memory, the log data including a context of the processing unit and a content of the cache memory for the first processor module at a time prior to execution of the checkpoint processing; (d) switching to a second log memory different than the first log memory, selected during step (a); (e) determining whether a second one of the processor modules has completed execution of checkpoint processing; and (f) storing log data for the second processor module in the first log memory, the log data including a context of the processing unit and a content of the cache memory for the second processor module at a time prior to execution of the checkpoint processing.
-
-
20. A checkpoint controlling system for use in a fault tolerant multiprocessor system capable of recovering from a fault during execution of application programs, wherein the fault tolerant multiprocessor system has a plurality of log memories, at least two processor modules each having a processor unit and a cache memory, a shared memory, and means for storing data in the shared memory at intervals, and wherein each of the processor modules executes checkpoint processing independently, the checkpoint controlling system comprising:
-
means for selecting one of a first log memory and a second log memory of the plurality of log memories of the fault tolerant multiprocessor system; means for storing log data in the selected one of the first and the second log memories, the log data including a context of the processing unit and a content of the cache memory for the processor modules executing checkpoint processing; means for determining whether all of the processor modules have completed execution of checkpoint processing; and means for clearing the selected one of the first and the second log memories when it is determined that all of the processor modules have completed execution of checkpoint processing.
-
-
21. A fault tolerant multiprocessor system capable of recovering from a fault during execution of application programs, comprising:
-
a plurality of log memories; at least two processor modules, each having a processor unit and a cache memory, wherein each of the processor modules executes checkpoint processing independently; a shared memory; means for storing data in the shared memory at intervals; means for selecting one of a first log memory and a second log memory of the plurality of log memories; means for storing log data in the selected one of the first and the second log memories, the log data including a context of the processing unit and a content of the cache memory for the processor modules executing checkpoint processing; means for determining whether all of the processor modules have completed execution of checkpoint processing; and means for clearing the selected one of the first and the second log memories when it is determined that all of the processor modules have completed execution of checkpoint processing.
-
Specification