Node cluster synchronization

US 10,212,226 B2
Filed: 01/16/2014
Issued: 02/19/2019
Est. Priority Date: 01/16/2014
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method, comprising:

periodically requesting timing values from a set of nodes in a computing cluster that are implementing a distributed service by locally performing events;

receiving timing values from members of the set of nodes; and

providing a synchronization value to members of the set of nodes identifying a global sequence number of an epoch to which each node then transitions at time of receipt of the synchronization value, each node transitioning to the epoch within a global uncertainty period between a time of providing the synchronization value and a time of last acknowledgement of receipt of the synchronization value from the nodes, in which exactly when each node has transitioned to the epoch is unknown,the synchronization value generated based on the timing values;

performing a node failure remedy responsive to a node failure, comprising;

determining an order of the events that occurred prior to an epoch in which the node failure occurred, including determining with guaranteed certainty that a first event occurred at a first node before a second event occurred at a second node when the global sequence number of the epoch in which the first event occurred is less than one plus the global sequence number of the epoch in which the second event occurred; and

re-performing the ordered events to recover from the node failure.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods associated with computing cluster synchronization are disclosed. One example method includes periodically requesting timing values from a set of notes in a computing cluster. The method also includes receiving timing values from members of the set of nodes. The method also includes providing a synchronization value to members of the set of nodes. The synchronization value may be generated based on the timing values. Additionally, the synchronization value may be used to order events across the members.

17 Citations

View as Search Results

14 Claims

1. A computer-implemented method, comprising:
- periodically requesting timing values from a set of nodes in a computing cluster that are implementing a distributed service by locally performing events;
  
  receiving timing values from members of the set of nodes; and
  
  providing a synchronization value to members of the set of nodes identifying a global sequence number of an epoch to which each node then transitions at time of receipt of the synchronization value, each node transitioning to the epoch within a global uncertainty period between a time of providing the synchronization value and a time of last acknowledgement of receipt of the synchronization value from the nodes, in which exactly when each node has transitioned to the epoch is unknown,the synchronization value generated based on the timing values;
  
  performing a node failure remedy responsive to a node failure, comprising;
  
  determining an order of the events that occurred prior to an epoch in which the node failure occurred, including determining with guaranteed certainty that a first event occurred at a first node before a second event occurred at a second node when the global sequence number of the epoch in which the first event occurred is less than one plus the global sequence number of the epoch in which the second event occurred; and
  
  re-performing the ordered events to recover from the node failure.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The computer-implemented method of claim 1, where the timing values are Lamport clock values.
  - 3. The computer-implemented method of claim 1, comprising:
    - receiving the acknowledgements of receipt of the synchronization value from members of the set of nodes.
  - 4. The computer-implemented method of claim 3,where the timing values are requested using a multicast signal and received via unicast replies,the synchronization value is provided using a multicast signal, andthe acknowledgements are received via unicast acknowledgments.
  - 5. The computer-implemented method of claim 3, where the node failure remedy is taken when a timing value is not received from a member of the set of nodes within a certain period of time after requesting the timing values, or when an acknowledgement is not received from a member of the set of nodes within a certain period of time after providing the synchronization value.
  - 6. The computer-implemented method of claim 5, where the node failure remedy further comprises one of, resending the request for the timing value from the member of the set of nodes and resending the synchronization value to the member of the set of nodes.
  - 7. The computer-implemented method of claim 5, where the node failure remedy comprises:
    - identifying the epoch during which the node failure occurred based on the synchronization value,wherein re-performing the ordered events restores members of the set of nodes to a state from prior to the epoch.

8. A cluster system, comprising:
- a set of computing nodes, each computing node having a Lamport clock, the nodes to each locally perform events to realize a service distributed over the nodes; and
  
  a synchronization logic to request Lamport clock values from the Lamport clock of each computing node and to provide a synchronization value to the nodes in the set of nodes based on the Lamport clock values, the synchronization value identifying a global sequence number of an epoch to which each node then transitions at time of receipt of the synchronization value, each node transitioning to the epoch within a global uncertainty period between a time of providing the synchronization value and a time of last acknowledgement of receipt of the synchronization value from the nodes, in which exactly when each node has transitioned to the epoch is unknown; and
  
  a recover logic to perform a remedial action responsive to a node failure, by determining an order of the events that occurred prior to the epoch in which the node failure occurred, including determining with guaranteed certainty that a first event occurred at a first node before a second event occurred at a second node when the global sequence number of the epoch in which the first event occurred is less than one plus the global sequence number of the epoch in which the second event occurred,wherein the nodes re-perform the ordered events to recover from the node failure.
- View Dependent Claims (9, 10, 11, 12)
- - 9. The system of claim 8, where the recovery logic is to perform the remedial action when the synchronization logic fails to obtain a Lamport clock value from a node after a period of time.
  - 10. The system of claim 8, where they synchronization logic periodically alternates between requesting the Lamport clock values and providing the synchronization value to create epoch divisions that the nodes use to create a partial ordering of events occurring on the nodes.
  - 11. The system of claim 10, where the synchronization logic introduces a delay between requesting actions and providing actions to increase sizes of epochs.
  - 12. The system of claim 8, where the synchronization logicrequests the Lamport clock values in response to a signal received from an upstream synchronization logic,provides data to the upstream synchronization logic based on the Lamport clock values, andprovides the synchronization value to the nodes based on data received from the upstream synchronization logic,where the data is generated based on the Lamport clock values.

13. A non-transitory computer-readable medium storing computer-executable instructions that when executed by a computer cause the computer to:
- periodically provide a global sequence number to a set of nodes that are implementing a distributed service by locally performing events, the global sequence number identifying a global sequence number of an epoch to which each node then transitions at time of receipt of the synchronization value, each node transitioning to the epoch within a global uncertainty period between a time of providing the synchronization value and a time of last acknowledgment of receipt of the synchronization value from the nodes, in which exactly when each node has transitioned to the epoch is unknown, the global sequence number generated as a function of Lamport clock values obtained from members of the set of nodes;
  
  perform a node failure remedy responsive to a node failure, by;
  
  determining an order of the events that occurred prior to the epoch in which the node failure occurred, including determining with guaranteed certainty that a first event occurred at a first node before a second event occurred at a second node when the global sequence number of the epoch in which the first event occurred is less than one plus the global sequence number of the epoch in which the second event occurred; and
  
  cause the nodes to re-perform the ordered events to recover from the node failure.
- View Dependent Claims (14)
- - 14. The non-transitory computer-readable medium of claim 13,where global sequence numbers identify sequential epochs during which events occur on members of the set of nodes.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Hewlett Packard Enterprise Development LP (Hewlett-Packard Enterprise Company)
Original Assignee
Hewlett Packard Enterprise Development LP (Hewlett-Packard Enterprise Company)
Inventors
Johnson, Charles Stuart
Primary Examiner(s)
Yeung, Mang Hang

Application Number

US15/106,447
Publication Number

US 20170006097A1
Time in Patent Office

1,860 Days
Field of Search

370508
US Class Current
CPC Class Codes

G06F 11/1482   by means of middleware or O...

G06F 11/1691   using a quantum

H04L 65/611   for multicast or broadcast ...

H04L 67/1095   Replication or mirroring of...

H04L 7/0016   correction of synchronizati...

H04L 7/042   Detectors therefor, e.g. co...

Node cluster synchronization

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

17 Citations

14 Claims

Specification

Solutions

Use Cases

Quick Links

Node cluster synchronization

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

17 Citations

14 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links