Node cluster synchronization
First Claim
Patent Images
1. A computer-implemented method, comprising:
- periodically requesting timing values from a set of nodes in a computing cluster that are implementing a distributed service by locally performing events;
receiving timing values from members of the set of nodes; and
providing a synchronization value to members of the set of nodes identifying a global sequence number of an epoch to which each node then transitions at time of receipt of the synchronization value, each node transitioning to the epoch within a global uncertainty period between a time of providing the synchronization value and a time of last acknowledgement of receipt of the synchronization value from the nodes, in which exactly when each node has transitioned to the epoch is unknown,the synchronization value generated based on the timing values;
performing a node failure remedy responsive to a node failure, comprising;
determining an order of the events that occurred prior to an epoch in which the node failure occurred, including determining with guaranteed certainty that a first event occurred at a first node before a second event occurred at a second node when the global sequence number of the epoch in which the first event occurred is less than one plus the global sequence number of the epoch in which the second event occurred; and
re-performing the ordered events to recover from the node failure.
2 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods associated with computing cluster synchronization are disclosed. One example method includes periodically requesting timing values from a set of notes in a computing cluster. The method also includes receiving timing values from members of the set of nodes. The method also includes providing a synchronization value to members of the set of nodes. The synchronization value may be generated based on the timing values. Additionally, the synchronization value may be used to order events across the members.
17 Citations
14 Claims
-
1. A computer-implemented method, comprising:
-
periodically requesting timing values from a set of nodes in a computing cluster that are implementing a distributed service by locally performing events; receiving timing values from members of the set of nodes; and providing a synchronization value to members of the set of nodes identifying a global sequence number of an epoch to which each node then transitions at time of receipt of the synchronization value, each node transitioning to the epoch within a global uncertainty period between a time of providing the synchronization value and a time of last acknowledgement of receipt of the synchronization value from the nodes, in which exactly when each node has transitioned to the epoch is unknown, the synchronization value generated based on the timing values; performing a node failure remedy responsive to a node failure, comprising; determining an order of the events that occurred prior to an epoch in which the node failure occurred, including determining with guaranteed certainty that a first event occurred at a first node before a second event occurred at a second node when the global sequence number of the epoch in which the first event occurred is less than one plus the global sequence number of the epoch in which the second event occurred; and re-performing the ordered events to recover from the node failure. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A cluster system, comprising:
-
a set of computing nodes, each computing node having a Lamport clock, the nodes to each locally perform events to realize a service distributed over the nodes; and a synchronization logic to request Lamport clock values from the Lamport clock of each computing node and to provide a synchronization value to the nodes in the set of nodes based on the Lamport clock values, the synchronization value identifying a global sequence number of an epoch to which each node then transitions at time of receipt of the synchronization value, each node transitioning to the epoch within a global uncertainty period between a time of providing the synchronization value and a time of last acknowledgement of receipt of the synchronization value from the nodes, in which exactly when each node has transitioned to the epoch is unknown; and a recover logic to perform a remedial action responsive to a node failure, by determining an order of the events that occurred prior to the epoch in which the node failure occurred, including determining with guaranteed certainty that a first event occurred at a first node before a second event occurred at a second node when the global sequence number of the epoch in which the first event occurred is less than one plus the global sequence number of the epoch in which the second event occurred, wherein the nodes re-perform the ordered events to recover from the node failure. - View Dependent Claims (9, 10, 11, 12)
-
-
13. A non-transitory computer-readable medium storing computer-executable instructions that when executed by a computer cause the computer to:
-
periodically provide a global sequence number to a set of nodes that are implementing a distributed service by locally performing events, the global sequence number identifying a global sequence number of an epoch to which each node then transitions at time of receipt of the synchronization value, each node transitioning to the epoch within a global uncertainty period between a time of providing the synchronization value and a time of last acknowledgment of receipt of the synchronization value from the nodes, in which exactly when each node has transitioned to the epoch is unknown, the global sequence number generated as a function of Lamport clock values obtained from members of the set of nodes; perform a node failure remedy responsive to a node failure, by; determining an order of the events that occurred prior to the epoch in which the node failure occurred, including determining with guaranteed certainty that a first event occurred at a first node before a second event occurred at a second node when the global sequence number of the epoch in which the first event occurred is less than one plus the global sequence number of the epoch in which the second event occurred; and cause the nodes to re-perform the ordered events to recover from the node failure. - View Dependent Claims (14)
-
Specification