Method and apparatus for failure recovery in a multi-processor computer system
First Claim
1. A computer system having a plurality of system resources including processors, memory and I/O circuitry, the computer system comprising:
- an interconnection mechanism for electrically interconnecting the processors, memory and I/O circuitry so that each processor has electrical access to all of the memory and at least some of the I/O circuitry;
a software mechanism for dividing the system resources into a plurality of partitions;
at least one operating system instance running in each of two or more of the plurality of partitions; and
a failure recovery apparatus that detects the event of a system failure within a first operating system instance, and initiates a transfer of control of a processing resource from the first operating system instance to a second operating system instance.
4 Assignments
0 Petitions
Accused Products
Abstract
Multiple instances of operating systems execute cooperatively in a single multiprocessor computer wherein all processors and resources are electrically connected together. The single physical machine with multiple physical processors and resources is subdivided by software into multiple partitions, each with the ability to run a distinct copy, or instance, of an operating system. At different times, different operating system instances may be loaded on a given partition. Resources, such as CPUs and memory, can be dynamically assigned to different partitions and used by instances of operating systems running within the machine by modifying the configuration. The partitions themselves can also be changed without rebooting the system by modifying the configuration tree. The system makes use of a failure protocol that results in the transfer of processing resources controlled by an instance that experiences a failure to new destination instances on other partitions. For CPUs, destination instance IDs are stored in an array which is accessed upon occurrence of a failure to determine where the CPUs will be assigned. The secondary CPUs then dump their processing contexts, and each invoke a migration routine to transfer their control to the new instances. A destination instance may be a backup instance for the instance experiencing the failure, having no processing functions prior to the failure. Thus, the processing activities of the failed instance may be resumed quickly by the backup instance.
376 Citations
27 Claims
-
1. A computer system having a plurality of system resources including processors, memory and I/O circuitry, the computer system comprising:
-
an interconnection mechanism for electrically interconnecting the processors, memory and I/O circuitry so that each processor has electrical access to all of the memory and at least some of the I/O circuitry;
a software mechanism for dividing the system resources into a plurality of partitions;
at least one operating system instance running in each of two or more of the plurality of partitions; and
a failure recovery apparatus that detects the event of a system failure within a first operating system instance, and initiates a transfer of control of a processing resource from the first operating system instance to a second operating system instance. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A computer system having a plurality of system resources including processors, memory and I/O circuitry, the computer system comprising:
-
an interconnection mechanism for electrically interconnecting the processors, memory and I/O circuitry so that each processor has electrical access to all of the memory and at least some of the I/O circuitry;
a software mechanism for dividing the system resources into a plurality of partitions, each partition of the plurality of partitions having associated with it at least one processor;
at least one operating system instance running in each of two or more of the plurality of partitions; and
means for detecting a failure within a first operating system instance that controls a primary processor and at least one secondary processor, and for transferring control of said secondary processor from the first operating system instance to a second operating system instance. - View Dependent Claims (15, 16, 17, 18)
-
-
19. In a computer system having a plurality of system resources including processors, memory and I/O circuitry, an interconnection mechanism for electrically interconnecting the processors, memory and I/O circuitry so that each processor has electrical access to all of the memory and at least some of the I/O circuitry, a software mechanism for dividing the system resources into a plurality of partitions, and at least one operating system instance running in each of two or more of the plurality of partitions, a method of responding to a detected failure within a first operating system instance, the method comprising:
-
storing an indication of a destination operating system instance to which control of a first processing resource of the first operating system instance is to be transferred upon a failure within the first operating system instance;
determining, upon the occurrence of said failure, the destination operating system instance for the first processing resource; and
changing an indicia of control for the first processing resource to transfer control of the first processing resource from the first operating system instance to the destination operating system instance. - View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27)
-
Specification