Fault monitor for restarting failed instances of the fault monitor
First Claim
1. A fault tolerant computer system for executing one or more jobs on one or more nodes, comprising,a hierarchy of monitors for monitoring operations in the computer system including, one or more first monitors for monitoring first operations and, for any particular one of said first operations that fails, for restarting another instance of said particular one of said first operations, one or more second monitors for monitoring said first monitors and, if any particular one of said first monitors fails, for restarting another instance of said particular one of said first monitors.
2 Assignments
0 Petitions
Accused Products
Abstract
A computer system having a fault-tolerance framework in an extendable computer architecture. The computer system is formed of clusters of nodes where each node includes computer hardware and operating system software for executing jobs that implement the services provided by the computer system. Jobs are distributed across the nodes under control of a hierarchical resource management unit. The resource management unit includes hierarchical monitors that monitor and control the allocation of resources. In the resource management unit, a first monitor, at a first level, monitors and allocates elements below the first level. A second monitor, at a second level, monitors and allocates elements at the first level. The framework is extendable from the hierarchy of the first and second levels to higher levels where monitors at higher levels each monitor lower level elements in a hierarchical tree. If a failure occurs down the hierarchy, a higher level monitor restarts an element at a lower level. If a failure occurs up the hierarchy, a lower level monitor restarts an element at a higher level. Each of the monitors includes termination code that causes an element to terminate if duplicate elements have been restarted for the same job. The termination code in one embodiment includes suicide code whereby an element will self-destruct when the element detects that it is an unnecessary duplicate element.
-
Citations
72 Claims
-
1. A fault tolerant computer system for executing one or more jobs on one or more nodes, comprising,
a hierarchy of monitors for monitoring operations in the computer system including, one or more first monitors for monitoring first operations and, for any particular one of said first operations that fails, for restarting another instance of said particular one of said first operations, one or more second monitors for monitoring said first monitors and, if any particular one of said first monitors fails, for restarting another instance of said particular one of said first monitors.
-
70. In a fault tolerant computer system operating to execute one or more jobs on one or more nodes where the computer system includes a hierarchy of monitors for monitoring operations in the computer system, the method comprising,
monitoring first operations with one or more first monitors and, for any particular one of said first operations that fails, restarting another instance of said particular one of said first operations, monitoring said first monitors with one or more second monitors and, if any particular one of said first monitors fails, restarting another instance of said particular one of said first monitors.
Specification