Fast cluster failure detection
First Claim
1. A method for fast failure detection in a distributed computer system, comprising:
- executing a distributed computer system having a plurality of clusters comprising at least a first cluster, a second cluster and the third cluster;
initializing failure detection by creating a connected cluster list in each of the plurality of clusters, wherein for each one of the plurality of clusters, a respective connected cluster list describes others of the plurality of clusters said each one is communicatively connected with;
sending a status update message upon a change in connectivity between the plurality of clusters;
generating an updated connected cluster list in each of the plurality of clusters in accordance with the status update message; and
determining whether the change in connectivity is a result of a cluster failure by examining the updated connected cluster list in each of the plurality of clusters;
wherein upon receiving a loss of communication status update message from the second cluster, the third cluster removes the first cluster from a connected cluster list of the second cluster, and wherein the third cluster checks a connected cluster list of the third cluster to determine whether the third cluster is connected to another cluster to which the first cluster is also connected.
6 Assignments
0 Petitions
Accused Products
Abstract
A method and system for fast failure detection in a distributed computer system. The method includes executing a distributed computer system having a plurality of clusters comprising at least a first cluster, a second cluster and the third cluster, and initializing failure detection by creating a connected cluster list in each of the plurality of clusters, wherein for each one of the plurality of clusters, a respective connected cluster list describes others of the plurality of clusters said each one is communicatively connected with. A status update message is sent upon changes in connectivity between the plurality of clusters, and generating an updated connected cluster list in each of the plurality of clusters in accordance with the status update message. The method then determines whether the change in connectivity results from a cluster failure by examining the updated connected cluster list in each of the plurality of clusters.
22 Citations
18 Claims
-
1. A method for fast failure detection in a distributed computer system, comprising:
-
executing a distributed computer system having a plurality of clusters comprising at least a first cluster, a second cluster and the third cluster; initializing failure detection by creating a connected cluster list in each of the plurality of clusters, wherein for each one of the plurality of clusters, a respective connected cluster list describes others of the plurality of clusters said each one is communicatively connected with; sending a status update message upon a change in connectivity between the plurality of clusters; generating an updated connected cluster list in each of the plurality of clusters in accordance with the status update message; and determining whether the change in connectivity is a result of a cluster failure by examining the updated connected cluster list in each of the plurality of clusters; wherein upon receiving a loss of communication status update message from the second cluster, the third cluster removes the first cluster from a connected cluster list of the second cluster, and wherein the third cluster checks a connected cluster list of the third cluster to determine whether the third cluster is connected to another cluster to which the first cluster is also connected. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A computer readable storage medium having stored thereon, computer executable instructions that, if executed by a computer system cause the computer system to perform a method comprising:
-
executing a distributed computer system having a plurality of clusters comprising at least a first cluster, a second cluster and the third cluster; initializing failure detection by creating a connected cluster list in each of the plurality of clusters, wherein for each one of the plurality of clusters, a respective connected cluster list describes others of the plurality of clusters said each one is communicatively connected with; sending a status update message upon a change in connectivity between the plurality of clusters; generating an updated connected cluster list in each of the plurality of clusters in accordance with the status update message; and determining whether the change in connectivity is a result of a cluster failure by examining the updated connected cluster list in each of the plurality of clusters; wherein upon receiving a loss of communication status update message from the second cluster, the third cluster removes the first cluster from a connected cluster list of the second cluster, and wherein the third cluster checks a connected cluster list of the third cluster to determine whether the third cluster is connected to another cluster to which the first cluster is also connected. - View Dependent Claims (9, 10, 11, 12, 13)
-
-
14. A server computer system, comprising:
-
a computer system having a plurality of clusters comprising at least a first cluster, a second cluster and a third cluster; a processor coupled to a computer readable storage media and executing computer readable code which causes the computer system to implement a failure detection agent that functions by; initializing failure detection by creating a connected cluster list, wherein for each one of the plurality of clusters, a respective connected cluster list describes others of the plurality of clusters said each one is communicatively connected with; sending a status update message upon a change in connectivity between the plurality of clusters; generating an updated connected cluster list in in accordance with the status update message; and determining whether the change in connectivity is a result of a cluster failure by examining the updated connected cluster list for each of the plurality of clusters; wherein upon receiving a loss of communication status update message from the second cluster, the third cluster removes the first cluster from a connected cluster list of the second cluster, and wherein the third cluster checks a connected cluster list of the third cluster to determine whether the third cluster is connected to another cluster to which the first cluster is also connected. - View Dependent Claims (15, 16, 17, 18)
-
Specification