Active failure detection
First Claim
1. A method for detecting a failure in a fault-tolerant computer system that includes a first input/output processor and a second input/output processor coupled to a data communication system, the method comprising the steps of:
- processing a first data communication at the first input/output processor;
applying a timing criterion to a category of the first data communication processed by the first input/output processor to produce a first timing result;
processing a second data communication at the second input/output processor;
applying the timing criterion to the category of the second data communication processed by the second input/output processor to produce a second timing result;
determining a relationship between the first timing result and the second timing result; and
detecting whether a failure has occurred based on the determined relationship.
10 Assignments
0 Petitions
Accused Products
Abstract
Failures in a fault-tolerant computer system which includes two or more input/output processors connected to a data communication system are detected by monitoring data communication. The computer system is able to detect failures associated with a primary input/output processor, as well as with a standby input/output processors, and is also able to discriminate between failures of the input/output processors and communication failures in the data communication network itself. In addition to using heartbeat-like transmissions, various other categories of data communication are also used to detect failures. The system is able to detect failures when the input/output processors are on a common network segment, allowing the processors to monitor identical data traffic, as well as when the processors are on different segments where, as a result of filtering behavior of network elements such as active hubs, the processors may not be able to monitor identical data traffic.
-
Citations
24 Claims
-
1. A method for detecting a failure in a fault-tolerant computer system that includes a first input/output processor and a second input/output processor coupled to a data communication system, the method comprising the steps of:
-
processing a first data communication at the first input/output processor; applying a timing criterion to a category of the first data communication processed by the first input/output processor to produce a first timing result; processing a second data communication at the second input/output processor; applying the timing criterion to the category of the second data communication processed by the second input/output processor to produce a second timing result; determining a relationship between the first timing result and the second timing result; and detecting whether a failure has occurred based on the determined relationship. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A fault-tolerant computer system coupled to a data communication system comprising:
-
a first input/output processor configured to process a category of data communications and to apply a timing criterion to the category of data communications to produce a first timing result; and a second input/output processor configured to process the category of data communications and to apply a timing criterion to the category of data communications to produce a second timing result; wherein the computer system is configured to determine a relationship between the timing results and to determine whether a failure has occurred based on the relationship. - View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
-
Specification