Fault-tolerant computer system with auto-restart after power-fall
First Claim
1. A method of operating a computer system having a central unit (CPU), memory including volatile memory and non-volatile memory, a main power supply, a backup power supply, and a plurality of devices peripheral to said CPU said method comprising the steps of:
- (a) executing processes in the central processing unit (CPU) from memory, while the main power supply provides power to said computer system;
(b) detecting a failure of said main power supply and, in response thereto, providing power to said computer system from the backup power supply and executing a shutdown procedure in said CPU, said shutdown procedure including first warning said processes of an impending shutdown of the computer system, said processes responding to said warning in a manner varying from process to process, and then copying state information of said computer system from said memory to said non-volatile storage, wherein said state information includes state information of the processes and state information of the devices;
(c) after completing said shutdown procedure, if said power supply has been restored, automatically initiating a restart procedure;
(d) said restart procedure including reading said stored state from said non-volatile storage and restarting said processes and continuing executing without rebooting;
(e) or, if said power supply has not been restored within a predetermined period of time after completion of said shutdown procedure, automatically shutting down said backup power and ceasing execution by said CPU.
0 Assignments
0 Petitions
Accused Products
Abstract
A fault-tolerant computer system employs a power supply system including a battery backup so that upon AC power failure the system can execute an orderly shutdown, saving state to disk. A restart procedure restores the state existing at the time of power failure if the AC power has been restored by the time the shutdown is completed. This powerfail/autorestart procedure may be implemented in a fault-tolerant multiprocessor configuration having multiple identical CPUs executing the same instruction stream, with multiple, identical memory modules in the address space of the CPUs storing duplicates of the same data. The system detects faults in the CPUs and memory modules, and places a faulty unit offline while continuing to operate using the good units. The multiple CPUs are loosely synchronized, as by detecting events such as memory references and stalling any CPU ahead of others until all execute the function simultaneously; interrupts can be synchronized by ensuring that all CPUs implement the interrupt at the same point in their instruction stream. Memory references via the separate CPU-to-memory busses are voted at the three separate ports of each of the memory modules. I/O functions are implemented using two identical I/O busses, each of which is separately coupled to only one of the memory modules. A number of I/O processors are coupled to both I/O busses. I/O devices are accessed through a pair of identical (redundant) I/O processors, but only one is designated to actively control a given device; in case of failure of one I/O processor, however, an I/O device can be accessed by the other one without system shutdown.
-
Citations
16 Claims
-
1. A method of operating a computer system having a central unit (CPU), memory including volatile memory and non-volatile memory, a main power supply, a backup power supply, and a plurality of devices peripheral to said CPU said method comprising the steps of:
-
(a) executing processes in the central processing unit (CPU) from memory, while the main power supply provides power to said computer system; (b) detecting a failure of said main power supply and, in response thereto, providing power to said computer system from the backup power supply and executing a shutdown procedure in said CPU, said shutdown procedure including first warning said processes of an impending shutdown of the computer system, said processes responding to said warning in a manner varying from process to process, and then copying state information of said computer system from said memory to said non-volatile storage, wherein said state information includes state information of the processes and state information of the devices; (c) after completing said shutdown procedure, if said power supply has been restored, automatically initiating a restart procedure; (d) said restart procedure including reading said stored state from said non-volatile storage and restarting said processes and continuing executing without rebooting; (e) or, if said power supply has not been restored within a predetermined period of time after completion of said shutdown procedure, automatically shutting down said backup power and ceasing execution by said CPU. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A method of operating a computer system having a central processing unit (CPU), a main power supply, and a backup power supply, said method comprising the steps of:
-
(a) detecting a failure of the main power supply for said computer system and, in response thereto, providing power to said computer system from the backup power supply, and executing a shutdown process in the CPU; (b) continuing said shutdown process to completion using said backup power supply even if said main power supply is restored before said completion; (c) after said shutdown process is completed, beginning a restart process for said CPU if said main power supply is restored; (d) automatically terminating said restart process if another power failure occurs before expiration of a selected time period, said restart process continuing to completion if said another failure occurs after expiration of said selected time period; and (e) within a predetermined period of time after said shutdown process is completed, automatically turning off said backup power supply if said main power supply has not been restored. - View Dependent Claims (7, 8, 9, 10)
-
-
11. A method of operating a computer system having a central processing unit (CPU), memory including volatile memory and non-volatile memory, a main power supply, a backup power supply, and a plurality of devices peripheral to said CPU, said method comprising the steps of:
-
(a) executing a process in the CPU, the process including instructions to store data in non-volatile memory, and using volatile memory to temporarily store the data which said process has instructed to be written to the non-volatile memory; (b) detecting a failure of the main power supply for said computer system; (c) entering a shutdown procedure using the backup power supply including writing to the non-volatile memory all data which is temporarily stored in said volatile memory prior to being written to said non-volatile memory and preventing further writes to said non-volatile memory from being temporarily stored in volatile memory, said shutdown procedure including copying a state of said process from volatile memory to non-volatile memory; and (d) completing said shutdown procedure even if said main power supply is restored during execution of said shutdown procedure. - View Dependent Claims (12)
-
-
13. A method of operating a computer system having a central processing unit (CPU), memory including volatile memory and non-volatile memory, a main power supply, a backup power supply, and at least one device peripheral to said CPU controlled by said CPU during normal power operation from the main power supply, said method comprising the steps of:
-
(a) executing code by the CPU from the memory in normal operation, said code corresponding to processes being executed by the CPU, said execution of code including sending signals to said device and receiving signals from said device so as to control operation of said device; (b) detecting occurrence of failure of the main power supply for said computer system, and continuing execution of code by said CPU using the backup power supply; (c) after detecting said failure, initiating execution of a shutdown procedure by said CPU, including sending a sequence of signals between said CPU and said device, while continuing execution of said shutdown procedure by the CPU to save the current state of processes being executed, the sequence of signals including; (i) a first signal from said CPU to said device indicating powerfail; (ii) a second signal from said CPU to said device indicating halt of further device operations; (iii) a third signal from said device to said CPU indicating the amount of memory needed by the device to save state; (iv) a fourth signal from said CPU to said device including an address in said memory to save the state of said device; (d) storing in said non-volatile memory the data written by said device to said address in memory; and (e) shutting down said backup power supply and ceasing execution of code by said CPU. - View Dependent Claims (14)
-
-
15. A method of operating a computer system having a central processing unit (CPU), memory including volatile memory and non-volatile memory, a main power supply, and a backup power supply, said method comprising the steps of:
-
(a) executing code by the CPU from the memory in normal operation while power for said computer system is supplied by the main power supply, said execution including controlling processes; (b) detecting the occurrence of failure of said main power supply, and continuing execution of code by said CPU using the backup power supply; (c) after detecting said failure, initiating execution of a shutdown procedure by said CPU, including issuing a sequence of signals from said CPU to said processes controlled by said CPU during normal operation immediately prior to said power failure, while continuing execution of said shutdown procedure by the CPU to save state of said processes being executed, the signals to said processes including; (i) "signal power failure" (SIGPWR) with code "power failure quiesce" (PFQUIESCE) during shutdown followed by "signal power failure" (SIGPWR) with code "power failure restart" (PFRESTART), or (ii) "signal terminated" (SIGTERM) with code "power failure quiesce" (PFQUIESCE) followed by "signal kill" (SIGKILL); (d) storing on said non-volatile memory said state; and (e) shutting down said backup power supply and ceasing execution of code by said CPU. - View Dependent Claims (16)
-
Specification