Fault recovery in a distributed processing system
First Claim
1. In an arrangement comprising a plurality of processors interconnected for message communication and each having a logical identity defining functions performed by that processor with respect to said arrangement, a fault recovery method comprisingeach of said processors repeatedly broadcasting heartbeat messages to others of said processors, which heartbeat messages each define the logical identity of the processor broadcasting the heartbeat message,at least one of said processors maintaining an associated status table defining the logical identities of others of said processors based on heartbeat messages received therefrom andsaid at least one of said processors, upon failing to receive heartbeat messages defining one of said logical identities defined in said status table, initiating performance of the functions defined by said one of said logical identities.
3 Assignments
0 Petitions
Accused Products
Abstract
A fault recovery method for a distributed processing system where a message called a heartbeat is broadcast among the processors once during each major processing cycle. The heartbeat message indicates the physical and logical identity of the transmitting processor with respect to the system arrangement as well as the processor'"'"'s present operational state. By monitoring the heartbeats from other processors, spare processors can autonomously take over the functions of failed processors without being required to consult or obtain the approval of an executive processor. The new physical location of a replaced processor will be automatically recorded by the other processors. The method has application to duplex standby and resource pool configurations as well as sparing arrangements.
147 Citations
37 Claims
-
1. In an arrangement comprising a plurality of processors interconnected for message communication and each having a logical identity defining functions performed by that processor with respect to said arrangement, a fault recovery method comprising
each of said processors repeatedly broadcasting heartbeat messages to others of said processors, which heartbeat messages each define the logical identity of the processor broadcasting the heartbeat message, at least one of said processors maintaining an associated status table defining the logical identities of others of said processors based on heartbeat messages received therefrom and said at least one of said processors, upon failing to receive heartbeat messages defining one of said logical identities defined in said status table, initiating performance of the functions defined by said one of said logical identities.
-
9. In an arrangement comprising a plurality of processors interconnected for message communication and each having a physical identity with respect to said arrangement and a logical identity defining functions performed by that processor with respect to said arrangement, a fault recovery method comprising
each of said processors repeatedly broadcasting heartbeat messages to others of said processors, which heartbeat messages each define the physical identity and the logical identity of the processor broadcasting the heartbeat message, each of said processors maintaining an associated status table defining the physical and logical identities of others of said processors based on heartbeat messages received therefrom, and a given one of said processors, upon failing to receive heartbeat messages defining one of the logical identities defined in the status table associated with said given processor, initiating performance of the functions defined by said one of said logical identities.
-
12. In an arrangement comprising a resource pool of processors and at least one other processor interconnected for message communication, a method for use by said other processor for selecting a processor from said resource pool comprising
each of said resource pool of processors repeatedly transmitting heartbeat messages to said other processor, which heartbeat messages each define a present processor state of the processor transmitting the heartbeat message, said other processor maintaining based on heartbeat messages received from said resource pool of processors, a status table defining a present processor state of each of said resource pool of processors and said other processor selecting a processor from said resource pool based on the processor state defined by said status table for said selected processor.
-
16. In an arrangement comprising a plurality of processors interconnected for message communication and including N active processors and at least one spare processor, N being a positive integer greater than one, each of said plurality of processors having a logical identity defining functions performed by that processor with respect to said arrangement, a method of recovering from a failure of any one of said N active processors comprising
each of said plurality of processors repeatedly broadcasting heartbeat messages to others of said plurality of processors, each of said plurality of processors monitoring the receipt of heartbeat messages from others of said plurality of processors, said one processor terminating its broadcasting of heartbeat messages, and said spare processor, upon failing to receive heartbeat messages from said one processor, initiating performance of the functions defined by the logical identity of said one processor.
-
18. In an arrangement comprising a plurality of processors interconnected for message communication and including N active processors, at least one primary square processor, and at least one secondary spare processor, N being a positive integer greater than one, each of said plurality of processors having a logical identity defining functions performed by that processor with respect to said arrangement, a method of recovering from a failure of any one of said N active processors comprising
each of said plurality of processors repeatedly broadcasting heartbeat messages to others of said plurality of processors, each of said plurality processors monitoring the receipt of hearbeat messages from others of said plurality of processors, said one processor terminating its broadcasting of heartbeat messages, said primary spare processor, upon failing to receive hearbeat messages from said one processor, initiating performance of the functions defined by the logical identity of said one processor, and upon said primary spare processor initiating performance of the functions defined by the logical identity of said one processor, said secondary spare processor initiating performance of the functions defined by the logical identity of said primary spare processor.
-
19. In an arrangement comprising a plurality of processors interconnected for message communication and including at least one active processor, at least one primary spare processor, and at least one secondary spare processor, each of said plurality of processors having a logical identity defining functions performed by that processor with respect to said arrangement, a method of recovering from a failure of said one active processor comprising
each of said plurality of processors repeatedly broadcasting heartbeat messages to others of said plurality of processors, which heartbeat messages each define the logical identity of the processor broadcasting the heartbeat message, each of said plurality of processors monitoring the receipt of heartbeat messages from others of said plurality of processors, said one active processor terminating its broadcasting of heartbeat messages, upon failing to receive heartbeat messages from said one active processor, said primary spare processor terminating its broadcasting of heartbeat messages defining the logical identity of said primary spare processor and initiating performance of the functions defined by the logical identity of said one active processor, said secondary spare processor, upon failing to receive heartbeat messages defining the logical identity of said primary spare processor, initiating performance of the functions defined by the logical identity of said primary spare processor.
-
20. In an arrangement comprising a plurality of processors interconnected for message communication and each having a logical identity defining the functions performed by that processor with respect to said arrangement, a fault recovery method comprising
each of said processors repeatedly broadcasting heartbeat messages to others of said processors, said heartbeat messages each defining the logical identity of the processor broadcasting the heartbeat message, any one of said processors terminating the broadcasting of its heartbeat messages and another of said processors, upon failing to receive heartbeat messages from said any one of said processors, initiating performance of the functions defined by the logical identity of said any one of said processors.
-
23. A distributed processing arrangement comprising a plurality of processors interconnected for message communication and each having a logical identity defining functions performed by that processor with respect to said arrangement, wherein each of said processors comprises
means for repeatedly broadcasting heartbeat messages to others of said processors, which heartbeat messages each define the logical identity of said each processor, and wherein at least one of said processors further comprises means for maintaining a status table defining the logical identities of others of said processors based on heartbeat messages received therefrom, and means responsive to a failure to receive heartbeat messages defining one of said logical identities defined in said status table, for initiating performance of the functions defined by said one of said logical identities.
-
31. A distributed processing arrangement comprising a plurality of processors interconnected for message communication and each having a physical identity with respect to said arrangement and a logical identity defining functions performed by that processor with respect to said arrangement, wherein each of said processors comprises
means for repeatedly broadcasting heartbeat messages to others of said processors, which heartbeat messages each define the physical identity and the logical identity of the processor broadcasting the heartbeat message, and means for maintaining an associated status table defining the physical and logical identities of others of said processors based on heartbeat messages received therefrom, and wherein a given one of said processors further comprises means responsive to a failure to receive heartbeat messages defining one of the logical identities defined in the status table associated with said given processor, for initiating performance of the functions defined by said one of said logical identities.
-
34. A distributed processing arrangement comprising a resource pool of processors and at least one other processor interconnected for message communication,
each of said resource pool of processors comprising means for repeatedly transmitting heartbeat messages to said other processor, which heartbeat messages each define a present processor state of the processor transmitting the heartbeat message, and said other processor comprising means for maintaining based on heartbeat messages received from said resource pool of processors, a status table defining a present processor state of each of said resource pool of processors, and means for selecting a processor from said resource pool based on the processor state defined by said status table for the selected processor.
-
35. A distributed processing arrangement comprising a plurality of processors interconnected for message communication and each having a logical identity defining functions performed by that processor with respect to said arrangement, wherein each of said processors comprises
means for repeatedly broadcasting heartbeat messages to others of said processors, said heartbeat messages each defining the logical identity of the processor broadcasting the heartbeat message, means responsive to a termination in receiving heartbeat messages from any one of said processors, for initiating performance of the functions defined by the logical identity of said any one of said processors.
Specification