Topology manager for failure detection in a distributed computing system

US 10,341,168 B2
Filed: 04/18/2017
Issued: 07/02/2019
Est. Priority Date: 04/18/2017
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method comprising:

receiving, by a topology manager of a distributed computing system, notification that a destination computing node in the distributed computing system is not responding to a communication request, the topology manager being implemented on a data partition of the distributed computing system, the distributed computing system comprising a plurality of computing nodes, the plurality of nodes comprising the destination computing node;

determining, by the topology manager, that the destination computing node is dead and/or has a loss of communication with one or more other computing nodes in the plurality of computing nodes by querying at least a subset of other computing nodes of the plurality of computing nodes regarding liveness of the destination computing node and receiving confirmation from a quorum of the queried computing nodes;

retiring, by the topology manager in response to the determining, the destination computing node, the retiring causing the destination computing node to become a retired computing node; and

causing, by the topology manager, a load balancing of replicas of data partitions in the distributed computing system to compensate for loss of the retired computing node, the load balancing comprising re-assigning one or more of the replicas of data partitions among one or more surviving computing nodes in the plurality of computing nodes.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A topology manager implemented on a data partition of a distributed computing system can be notified that a destination computing node in the distributed computing system is not responding to a communication request. Upon determining that the destination computing node is dead and/or has a loss of communication with one or more, and optionally a majority of other computing nodes in a plurality of computing nodes of the distributed computing system, the topology manager can retire the destination computing node and cause a load balancing of replicas of data partitions in the distributed computing system to compensate for loss of the retired computing node.

10 Citations

View as Search Results

17 Claims

1. A computer-implemented method comprising:
- receiving, by a topology manager of a distributed computing system, notification that a destination computing node in the distributed computing system is not responding to a communication request, the topology manager being implemented on a data partition of the distributed computing system, the distributed computing system comprising a plurality of computing nodes, the plurality of nodes comprising the destination computing node;
  
  determining, by the topology manager, that the destination computing node is dead and/or has a loss of communication with one or more other computing nodes in the plurality of computing nodes by querying at least a subset of other computing nodes of the plurality of computing nodes regarding liveness of the destination computing node and receiving confirmation from a quorum of the queried computing nodes;
  
  retiring, by the topology manager in response to the determining, the destination computing node, the retiring causing the destination computing node to become a retired computing node; and
  
  causing, by the topology manager, a load balancing of replicas of data partitions in the distributed computing system to compensate for loss of the retired computing node, the load balancing comprising re-assigning one or more of the replicas of data partitions among one or more surviving computing nodes in the plurality of computing nodes.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. A computer-implemented method as in claim 1, wherein the determining further comprises the topology manager identifying that the destination computing node has not sent a status message to the topology manager in longer than a preset number of messaging periods.
  - 3. A computer-implemented method as in claim 1, wherein the notification is received from a source computing node of the plurality of computing nodes, the source computing node having sent the communication request to the destination computing node.
  - 4. A computer-implemented method as in claim 1, wherein the notification is received from a client machine accessing the distributed computing system, the client machine having sent the communication request to the destination computing node.
  - 5. A computer-implemented method as in claim 1, wherein the topology manager stores information about computing nodes in the plurality of computing nodes and a current state of these computing nodes.
  - 6. A computer-implemented method as in claim 1, wherein the one or more other computing nodes in the plurality of computing nodes comprises a majority of the plurality of computing nodes.

7. A computer program product comprising a non-transitory machine readable medium storing instructions that, when executed by at least one programmable processor, cause the at least one programmable processor to perform operations comprising:
- receiving, by a topology manager of a distributed computing system, notification that a destination computing node in the distributed computing system is not responding to a communication request, the topology manager being implemented on a data partition of the distributed computing system, the distributed computing system comprising a plurality of computing nodes, the plurality of nodes comprising the destination computing node;
  
  determining, by the topology manager, that the destination computing node is dead and/or has a loss of communication with one or more other computing nodes in the plurality of computing nodes by querying at least a subset of other computing nodes of the plurality of computing nodes regarding liveness of the destination computing node and receiving confirmation from a quorum of the queried computing nodes;
  
  retiring, by the topology manager in response to the determining, the destination computing node, the retiring causing the destination computing node to become a retired computing node; and
  
  causing, by the topology manager, a load balancing of replicas of data partitions in the distributed computing system to compensate for loss of the retired computing node, the load balancing comprising re-assigning one or more of the replicas of data partitions among one or more surviving computing nodes in the plurality of computing nodes.
- View Dependent Claims (8, 9, 10, 11)
- - 8. A computer program product as in claim 7, wherein the determining further comprises the topology manager identifying that the destination computing node has not sent a status message to the topology manager in longer than a preset number of messaging periods.
  - 9. A computer program product as in claim 7, wherein the notification is received from a source computing node of the plurality of computing nodes, the source computing node having sent the communication request to the destination computing node.
  - 10. A computer program product as in claim 7, wherein the notification is received from a client machine accessing the distributed computing system, the client machine having sent the communication request to the destination computing node.
  - 11. A computer program product as in claim 7, wherein the topology manager stores information about computing nodes in the plurality of computing nodes and a current state of these computing nodes.

12. A system comprising:
- computer hardware configured to perform operations comprising;
  
  receiving, by a topology manager of a distributed computing system, notification that a destination computing node in the distributed computing system is not responding to a communication request, the topology manager being implemented on a data partition of the distributed computing system, the distributed computing system comprising a plurality of computing nodes, the plurality of nodes comprising the destination computing node;
  
  determining, by the topology manager, that the destination computing node is dead and/or has a loss of communication with one or more other computing nodes in the plurality of computing nodes by querying at least a subset of other computing nodes of the plurality of computing nodes regarding liveness of the destination computing node and receiving confirmation from a quorum of the queried computing nodes;
  
  retiring, by the topology manager in response to the determining, the destination computing node, the retiring causing the destination computing node to become a retired computing node; and
  
  causing, by the topology manager, a load balancing of replicas of data partitions in the distributed computing system to compensate for loss of the retired computing node, the load balancing comprising re-assigning one or more of the replicas of data partitions among one or more surviving computing nodes in the plurality of computing nodes.
- View Dependent Claims (13, 14, 15, 16, 17)
- - 13. A system as in claim 12, wherein the determining further comprises the topology manager identifying that the destination computing node has not sent a status message to the topology manager in longer than a preset number of messaging periods.
  - 14. A system as in claim 12, wherein the notification is received from a source computing node of the plurality of computing nodes, the source computing node having sent the communication request to the destination computing node.
  - 15. A system as in claim 12, wherein the notification is received from a client machine accessing the distributed computing system, the client machine having sent the communication request to the destination computing node.
  - 16. A system as in claim 12, wherein the topology manager stores information about computing nodes in the plurality of computing nodes and a current state of these computing nodes.
  - 17. A system as in claim 12, wherein the computer hardware comprises a programmable processor and a machine readable medium storing instructions that, when executed by the programmable processor, cause the programmable processor to perform at least some of the operations.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
SAP SE
Original Assignee
SAP SE
Inventors
Schreter, Ivan
Primary Examiner(s)
Scheibel, Robert C

Application Number

US15/490,819
Publication Number

US 20180302270A1
Time in Patent Office

805 Days
Field of Search
US Class Current
CPC Class Codes

H04L 41/0668   by dynamic selection of rec...

H04L 41/0677   Localisation of faults

H04L 41/12   Discovery or management of ...

H04L 43/0805   by checking availability

H04L 43/0811   by checking connectivity

H04L 47/125   by balancing the load, e.g....

Topology manager for failure detection in a distributed computing system

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

10 Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

Topology manager for failure detection in a distributed computing system

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

10 Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links