Topology manager for failure detection in a distributed computing system
First Claim
Patent Images
1. A computer-implemented method comprising:
- receiving, by a topology manager of a distributed computing system, notification that a destination computing node in the distributed computing system is not responding to a communication request, the topology manager being implemented on a data partition of the distributed computing system, the distributed computing system comprising a plurality of computing nodes, the plurality of nodes comprising the destination computing node;
determining, by the topology manager, that the destination computing node is dead and/or has a loss of communication with one or more other computing nodes in the plurality of computing nodes by querying at least a subset of other computing nodes of the plurality of computing nodes regarding liveness of the destination computing node and receiving confirmation from a quorum of the queried computing nodes;
retiring, by the topology manager in response to the determining, the destination computing node, the retiring causing the destination computing node to become a retired computing node; and
causing, by the topology manager, a load balancing of replicas of data partitions in the distributed computing system to compensate for loss of the retired computing node, the load balancing comprising re-assigning one or more of the replicas of data partitions among one or more surviving computing nodes in the plurality of computing nodes.
1 Assignment
0 Petitions
Accused Products
Abstract
A topology manager implemented on a data partition of a distributed computing system can be notified that a destination computing node in the distributed computing system is not responding to a communication request. Upon determining that the destination computing node is dead and/or has a loss of communication with one or more, and optionally a majority of other computing nodes in a plurality of computing nodes of the distributed computing system, the topology manager can retire the destination computing node and cause a load balancing of replicas of data partitions in the distributed computing system to compensate for loss of the retired computing node.
10 Citations
17 Claims
-
1. A computer-implemented method comprising:
-
receiving, by a topology manager of a distributed computing system, notification that a destination computing node in the distributed computing system is not responding to a communication request, the topology manager being implemented on a data partition of the distributed computing system, the distributed computing system comprising a plurality of computing nodes, the plurality of nodes comprising the destination computing node; determining, by the topology manager, that the destination computing node is dead and/or has a loss of communication with one or more other computing nodes in the plurality of computing nodes by querying at least a subset of other computing nodes of the plurality of computing nodes regarding liveness of the destination computing node and receiving confirmation from a quorum of the queried computing nodes; retiring, by the topology manager in response to the determining, the destination computing node, the retiring causing the destination computing node to become a retired computing node; and causing, by the topology manager, a load balancing of replicas of data partitions in the distributed computing system to compensate for loss of the retired computing node, the load balancing comprising re-assigning one or more of the replicas of data partitions among one or more surviving computing nodes in the plurality of computing nodes. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A computer program product comprising a non-transitory machine readable medium storing instructions that, when executed by at least one programmable processor, cause the at least one programmable processor to perform operations comprising:
-
receiving, by a topology manager of a distributed computing system, notification that a destination computing node in the distributed computing system is not responding to a communication request, the topology manager being implemented on a data partition of the distributed computing system, the distributed computing system comprising a plurality of computing nodes, the plurality of nodes comprising the destination computing node; determining, by the topology manager, that the destination computing node is dead and/or has a loss of communication with one or more other computing nodes in the plurality of computing nodes by querying at least a subset of other computing nodes of the plurality of computing nodes regarding liveness of the destination computing node and receiving confirmation from a quorum of the queried computing nodes; retiring, by the topology manager in response to the determining, the destination computing node, the retiring causing the destination computing node to become a retired computing node; and causing, by the topology manager, a load balancing of replicas of data partitions in the distributed computing system to compensate for loss of the retired computing node, the load balancing comprising re-assigning one or more of the replicas of data partitions among one or more surviving computing nodes in the plurality of computing nodes. - View Dependent Claims (8, 9, 10, 11)
-
-
12. A system comprising:
computer hardware configured to perform operations comprising; receiving, by a topology manager of a distributed computing system, notification that a destination computing node in the distributed computing system is not responding to a communication request, the topology manager being implemented on a data partition of the distributed computing system, the distributed computing system comprising a plurality of computing nodes, the plurality of nodes comprising the destination computing node; determining, by the topology manager, that the destination computing node is dead and/or has a loss of communication with one or more other computing nodes in the plurality of computing nodes by querying at least a subset of other computing nodes of the plurality of computing nodes regarding liveness of the destination computing node and receiving confirmation from a quorum of the queried computing nodes; retiring, by the topology manager in response to the determining, the destination computing node, the retiring causing the destination computing node to become a retired computing node; and causing, by the topology manager, a load balancing of replicas of data partitions in the distributed computing system to compensate for loss of the retired computing node, the load balancing comprising re-assigning one or more of the replicas of data partitions among one or more surviving computing nodes in the plurality of computing nodes. - View Dependent Claims (13, 14, 15, 16, 17)
Specification