FAST CLUSTER FAILURE DETECTION
First Claim
1. A method for fast failure detection in a distributed computer system, comprising:
- executing a distributed computer system having a plurality of clusters comprising at least a first cluster, a second cluster and the third cluster;
initializing failure detection by creating a connected cluster list in each of the plurality of clusters, wherein for each one of the plurality of clusters, a respective connected cluster list describes others of the plurality of clusters said each one is communicatively connected with;
sending a status update message upon a change in connectivity between the plurality of clusters;
generating an updated connected cluster list in each of the plurality of clusters in accordance with the status update message; and
determining whether the change in connectivity is a result of a cluster failure by examining the updated connected cluster list in each of the plurality of clusters.
6 Assignments
0 Petitions
Accused Products
Abstract
A method and system for fast failure detection in a distributed computer system. The method includes executing a distributed computer system having a plurality of clusters comprising at least a first cluster, a second cluster and the third cluster, and initializing failure detection by creating a connected cluster list in each of the plurality of clusters, wherein for each one of the plurality of clusters, a respective connected cluster list describes others of the plurality of clusters said each one is communicatively connected with. A status update message is sent upon changes in connectivity between the plurality of clusters, and generating an updated connected cluster list in each of the plurality of clusters in accordance with the status update message. The method then determines whether the change in connectivity results from a cluster failure by examining the updated connected cluster list in each of the plurality of clusters.
22 Citations
20 Claims
-
1. A method for fast failure detection in a distributed computer system, comprising:
-
executing a distributed computer system having a plurality of clusters comprising at least a first cluster, a second cluster and the third cluster; initializing failure detection by creating a connected cluster list in each of the plurality of clusters, wherein for each one of the plurality of clusters, a respective connected cluster list describes others of the plurality of clusters said each one is communicatively connected with; sending a status update message upon a change in connectivity between the plurality of clusters; generating an updated connected cluster list in each of the plurality of clusters in accordance with the status update message; and determining whether the change in connectivity is a result of a cluster failure by examining the updated connected cluster list in each of the plurality of clusters. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 17, 18, 19, 20)
-
-
9. A computer readable storage medium having stored thereon, computer executable instructions that, if executed by a computer system cause the computer system to perform a method comprising:
-
executing a distributed computer system having a plurality of clusters comprising at least a first cluster, a second cluster and the third cluster; initializing failure detection by creating a connected cluster list in each of the plurality of clusters, wherein for each one of the plurality of clusters, a respective connected cluster list describes others of the plurality of clusters said each one is communicatively connected with; sending a status update message upon a change in connectivity between the plurality of clusters; generating an updated connected cluster list in each of the plurality of clusters in accordance with the status update message; and determining whether the change in connectivity is a result of a cluster failure by examining the updated connected cluster list in each of the plurality of clusters. - View Dependent Claims (10, 11, 12, 13, 14, 15)
-
-
16. A server computer system, comprising:
-
a computer system having a processor coupled to a computer readable storage media and executing computer readable code which causes the computer system to implement a failure detection agent that functions by; initializing failure detection by creating a connected cluster list, wherein for each one of the plurality of clusters, a respective connected cluster list describes others of the plurality of clusters said each one is communicatively connected with; sending a status update message upon a change in connectivity between the plurality of clusters; generating an updated connected cluster list in in accordance with the status update message; and determining whether the change in connectivity is a result of a cluster failure by examining the updated connected cluster list for each of the plurality of clusters.
-
Specification