Method and system for coordinated multiple cluster failover

US 7,757,116 B2
Filed: 04/04/2007
Issued: 07/13/2010
Est. Priority Date: 04/04/2007
Status: Active Grant

First Claim

Patent Images

1. A method for coordinating availability of data processing resources between a first cluster of nodes each controlled by a respective first cluster manager and a second cluster of nodes each controlled by a respective second cluster manager, the method comprising:

receiving a disruption signal from an exit program of one of the first cluster managers, the disruption signal being representative of a disruption event associated with a specific one of the nodes of the first cluster, the disruption signal being received by a first hypercluster manager of the specific one of the nodes of the first cluster;

deriving a local action code from a hypercluster rules list, the local action code corresponding to the disruption event and containing a cluster activation sequence for regulating the operation of one of the nodes of the second cluster; and

transmitting the local action code to the second cluster of nodes each including a second hypercluster manager for execution of the cluster activation sequence;

wherein the first cluster of nodes and the second cluster of nodes each function autonomously and communicate with each other by the local action code.

View all claims

16 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Hyperclusters are a cluster of clusters. Each cluster has associated with it one or more resource groups, and independent node failures within the clusters are handled by platform specific clustering software. The management of coordinated failovers across dependent or independent resources running on heterogeneous platforms is contemplated. A hypercluster manager running on all of the nodes in a cluster communicates with platform specific clustering software regarding any failure conditions, and utilizing a rule-based decision making system, determines actions to take on the node. A plug-in extends exit points definable in non-hypercluster clustering technologies. The failure notification is passed to other affected resource groups in the hypercluster.

Citations

42 Claims

1. A method for coordinating availability of data processing resources between a first cluster of nodes each controlled by a respective first cluster manager and a second cluster of nodes each controlled by a respective second cluster manager, the method comprising:
- receiving a disruption signal from an exit program of one of the first cluster managers, the disruption signal being representative of a disruption event associated with a specific one of the nodes of the first cluster, the disruption signal being received by a first hypercluster manager of the specific one of the nodes of the first cluster;
  
  deriving a local action code from a hypercluster rules list, the local action code corresponding to the disruption event and containing a cluster activation sequence for regulating the operation of one of the nodes of the second cluster; and
  
  transmitting the local action code to the second cluster of nodes each including a second hypercluster manager for execution of the cluster activation sequence;
  
  wherein the first cluster of nodes and the second cluster of nodes each function autonomously and communicate with each other by the local action code.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The method of claim 1, further comprising:
    - synchronizing views of the first and second clusters of nodes amongst each of the first and second clusters of nodes, view synchrony being determined by a token.
  - 3. The method of claim 1, wherein deriving the local action code includes:
    - translating the disruption event to a universal event code with a translation table, the translation table including a first sequence of disruption events and a second sequence of universal event codes correlated thereto.
  - 4. The method of claim 3, wherein the universal event code is referenced to derive the local action code from the hypercluster rules list.
  - 5. The method of claim 1, wherein the cluster activation sequence includes dependencies therebetween, the dependencies establishing the timing and order of the cluster activation sequence.
  - 6. The method of claim 5, wherein transmitting the local action code to the active cluster manager is in response to receiving a confirmation code representative of completion of one step in the cluster activation sequence as defined by the dependencies.
  - 7. The method of claim 1, wherein the disruption event is received in response to one or more of the nodes of the first cluster being deactivated.
  - 8. The method of claim 7, wherein an inactive cluster manager generates the disruption event, the inactive cluster manager running on the one or more of the nodes of the first cluster being deactivated.
  - 9. The method of claim 1, wherein the disruption event is generated on the first cluster becoming unavailable.
  - 10. The method of claim 1, wherein:
    - the nodes of the first cluster run a first operating system; and
      
      the nodes of the second cluster run a second operating system different from the first operating system.
  - 11. The method of claim 1, wherein the active cluster manager is running on one of the nodes of the first cluster, the cluster activation sequence being executed thereon.
  - 12. The method of claim 11, further comprising relaying the disruption event to the second cluster.
  - 13. The method of claim 1, wherein the active cluster manager is running on one of the nodes of the second cluster, the cluster activation sequence being executed thereon.
  - 14. The method of claim 1, further comprising:
    - receiving updates to the hypercluster rules list from a hypercluster configuration module; and
      
      applying the updates to the hypercluster rules list.

15. An apparatus implemented within a local node for coordinating availability of data processing resources between the local node in a first cluster of local nodes each including a cluster manager and a remote node, the apparatus comprising:
- a local event receiver for capturing local disruption event signals generated in response to local cluster heartbeat signals exchanged between the first cluster of nodes by the respective cluster managers, the local cluster heartbeat signals being representative of the status and condition of at least one of the nodes of the first cluster;
  
  a hypercluster event translator for translating the local disruption event signals to a first universal event code;
  
  a hypercluster event receiver for capturing a second universal event code from the remote node, the second universal event code being representative of a disruption event associated with the remote node;
  
  a hypercluster heartbeat receiver for capturing hypercluster heartbeat signals from the second cluster of nodes, the hypercluster heartbeat signals being representative of the status and condition of the second cluster; and
  
  a router for correlating a one of the first and second universal event codes to a cluster activation sequence operative to regulated the operation of at least one of the nodes in accordance with a set of hypercluster rules;
  
  wherein the local cluster heartbeat signals are communicated within the first cluster and within the second cluster, and the hypercluster heartbeat signals are communicated between the first cluster and the second cluster.
- View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23, 24, 25)
- - 16. The apparatus of claim 15, wherein the nodes each run a cluster manager specific to the respective platform of the node.
  - 17. The apparatus of claim 16, wherein the local disruption events are generated by the cluster manager running on the local node.
  - 18. The apparatus of claim 16, wherein the remote disruption event is generated by a router connected to the cluster manager on the remote node.
  - 19. The apparatus of claim 15, wherein the cluster activation sequence includes a local action code representative of modifying the operational state of the local node.
  - 20. The apparatus of claim 19, further comprising:
    - a local action handler in communication with the cluster manager of the local node for transmitting the local action code from the router thereto.
  - 21. The apparatus of claim 15, further comprising:
    - a hypercluster event dispatcher in communication with the router for relaying the universal event code to the remote node.
  - 22. The apparatus of claim 15, wherein the remote node is associated with the first cluster.
  - 23. The apparatus of claim 15, wherein the remote node is associated with the second cluster.
  - 24. The apparatus of claim 15, wherein the hypercluster rules list defines dependencies within the cluster activation sequence.
  - 25. The apparatus of claim 15, further comprising:
    - a rule propagation module for receiving changes to the hypercluster rules list from the remote node, the rule propagation module further verifying the changes and applying the changes to the hypercluster rules list.

26. An article of manufacture comprising a program storage medium readable by a computer, the medium tangibly embodying one or more programs of instructions executable by the computer to perform a method for coordinating availability of data processing resources between a first cluster of nodes each controlled by a respective first cluster manager and a second cluster of nodes each controlled by a respective second cluster manager, the method comprising:
- receiving a disruption signal from an exit program of one of the first cluster managers, the disruption signal being representative of a disruption event associated with a specific one of the nodes of the first cluster, the disruption signal being received by a first hypercluster manager of the specific one of the nodes of the first cluster;
  
  deriving a local action code from a hypercluster rules list, the local action code corresponding to the disruption event and containing a cluster activation sequence for regulating the operation of one of the nodes of the second cluster; and
  
  transmitting the local action code to the second cluster of nodes each including a second hypercluster manager for execution of the cluster activation sequence;
  
  wherein the first cluster of nodes and the second cluster of nodes each function autonomously and communicate with each other by the local action code.
- View Dependent Claims (27, 28, 29, 30, 31, 32)
- - 27. The article of manufacture of claim 26, wherein deriving the local action code includes:
    - translating the disruption event to a universal event code with a translation table, the translation table including a first sequence of disruption events and a second sequence of universal event codes correlated thereto.
  - 28. The article of manufacture of claim 26, wherein the local action code defines a dependency with one of the nodes of the second cluster, the dependency establishing the timing and order of the cluster activation sequence.
  - 29. The article of manufacture of claim 26, wherein the disruption event is received in response to one or more of the nodes of the first cluster being deactivated.
  - 30. The article of manufacture of claim 26, wherein the disruption event is generated on the first cluster becoming unavailable.
  - 31. The article of manufacture of claim 26 wherein the active cluster manager is running on one of the nodes of the first cluster, the cluster activation sequence being executed thereon.
  - 32. The article of manufacture of claim 26, wherein the active cluster manager is running on one of the nodes of the second cluster, the cluster activation sequence being executed thereon.

33. An apparatus for coordinating availability of data processing resources between a local node in a first cluster and a remote node in a second cluster, the apparatus comprising:
- a local event receiver for capturing local disruption events;
  
  an event translator for translating the local disruption event to a universal event code;
  
  a hypercluster event receiver for capturing remote disruption events from one of the nodes of the second cluster;
  
  a router for correlating the universal event code to a cluster activation sequence in accordance with a set of hypercluster rules; and
  
  a rule propagation module for receiving changes to the hypercluster rules list from the remote node, the rule propagation module further verifying the changes and applying the changes to the hypercluster rules list.
- View Dependent Claims (34, 35, 36, 37, 38, 39, 40, 41, 42)
- - 34. The apparatus of claim 33, wherein the nodes each run a cluster manager specific to the respective platform of the node.
  - 35. The apparatus of claim 34, wherein the local disruption events are generated by the cluster manager running on the local node.
  - 36. The apparatus of claim 34, wherein the remote disruption event is generated by a router connected to the cluster manager on the remote node.
  - 37. The apparatus of claim 33, wherein the cluster activation sequence includes a local action code representative of modifying the operational state of the local node.
  - 38. The apparatus of claim 37, further comprising:
    - a local action handler in communication with the cluster manager of the local node for transmitting the local action code from the router thereto.
  - 39. The apparatus of claim 33, further comprising:
    - a hypercluster event dispatcher in communication with the router for relaying the universal event code to the remote node.
  - 40. The apparatus of claim 33, wherein the remote node is associated with the first cluster.
  - 41. The apparatus of claim 33, wherein the remote node is associated with the second cluster.
  - 42. The apparatus of claim 33, wherein the hypercluster rules list defines dependencies within the cluster activation sequence.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Precisely Software Incorporated
Original Assignee
Vision Solutions Incorporated (Precisely Software Incorporated)
Inventors
Simpson, Scott, Brown, David E.
Primary Examiner(s)
Chu; Gabriel L

Application Number

US11/732,670
Publication Number

US 20080250267A1
Time in Patent Office

1,196 Days
Field of Search

714/23
US Class Current

714/13
CPC Class Codes

G06F 11/1482   by means of middleware or O...

G06F 11/2033   switching over of hardware ...

G06F 11/2035   without idle spare hardware

G06F 11/2038   with a single idle spare pr...

G06F 11/2041   with more than one idle spa...

G06F 11/2046   where the redundant compone...

Method and system for coordinated multiple cluster failover

First Claim

16 Assignments

0 Petitions

Accused Products

Abstract

Citations

42 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for coordinated multiple cluster failover

First Claim

16 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

42 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links