Controlled take over of services by remaining nodes of clustered computing system

US 6,789,213 B2
Filed: 01/10/2000
Issued: 09/07/2004
Est. Priority Date: 01/10/2000
Status: Expired due to Term

First Claim

Patent Images

1. A method for taking over services by one or more remaining sub-clusters of a clustered computing system from one or more other sub-clusters of the clustered computing system after the one or more other sub-clusters have been shutdown, said method comprising:

(a) attempting to determine whether a sub-cluster of the clustered computing system is to remain active such that the processes of the sub-cluster may temporarily continue;

(b) initiating shutdown of the sub-cluster when said attempting (a) does not determine within a first predetermined amount of time that the sub-cluster is to remain active;

(c) to allow the processors of the sub-cluster to temporarily continue, delaying for a second predetermined amount of time after the first predetermined amount of time expires when said attempting (a) determines within the first predetermined amount of time that the sub-cluster is to remain active such that data corruption is avoided; and

(d) taking over services of one or more other sub-clusters of the clustered computing system by one or more remaining sub-clusters after said delaying (c) for the second predetermined amount of time.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Improved techniques for controlled take over of services for clustered computing systems are disclosed. The improved techniques can be implemented to allow one sub-cluster of the clustered computing system to safely take over services of one or more other sub-clusters in the clustered computing system. Accordingly, if the clustered computing system is fragmented into two or more disjointed sub-clusters, one sub-cluster can safely take over services of the one or more other sub-clusters after the one or more other sub-clusters have been shutdown. As a result, the clustered computing system can continue to safely provide services even when the clustered computing system has been fragmented into two or more disjointed sub-clusters due to an operational failure.

Citations

18 Claims

1. A method for taking over services by one or more remaining sub-clusters of a clustered computing system from one or more other sub-clusters of the clustered computing system after the one or more other sub-clusters have been shutdown, said method comprising:
- (a) attempting to determine whether a sub-cluster of the clustered computing system is to remain active such that the processes of the sub-cluster may temporarily continue;
  
  (b) initiating shutdown of the sub-cluster when said attempting (a) does not determine within a first predetermined amount of time that the sub-cluster is to remain active;
  
  (c) to allow the processors of the sub-cluster to temporarily continue, delaying for a second predetermined amount of time after the first predetermined amount of time expires when said attempting (a) determines within the first predetermined amount of time that the sub-cluster is to remain active such that data corruption is avoided; and
  
  (d) taking over services of one or more other sub-clusters of the clustered computing system by one or more remaining sub-clusters after said delaying (c) for the second predetermined amount of time.
- View Dependent Claims (2, 3, 4, 5)
- - 2. A method as recited in claim 1, wherein said method is performed by each of the sub-clusters of the clustered computing system.
  - 3. A method as recited in claim 1, said method further comprising:
4. A method as recited in claim 1, wherein the first predetermined amount of time represents an upper estimate of time required to determine whether a sub-cluster is to remain active.
5. A method as recited in claim 1, wherein the second predetermined amount of time represents an upper estimate for a delay typically encountered in initiating said attempting (a) after an error condition has actually occurred.

6. A method for taking over services by one or more remaining sub-clusters of a clustered computing system from one or more other sub-clusters of the clustered computing system after the one or more other sub-clusters have been shutdown, said method comprising:
- (a) determining whether one or more computing nodes in a cluster have become one or more non-responsive nodes;
  
  (b) starting a first timer when said determining (a) determines that one or more of the computing nodes in the cluster have become one or more non-responsive nodes, the first timer having a first duration;
  
  (c) attempting to determine whether a sub-cluster vote is at least a majority of a total votes available, the sub-cluster vote representing votes for a sub-cluster of one or more computing nodes, the sub-cluster representing a portion of the cluster that remains responsive;
  
  (d) initiating shutdown of the one or more non-responsive computing nodes of the sub-cluster when said attempting (c) does not determine within the first duration of the first timer that the sub-cluster vote is at least a majority of the total votes available;
  
  (e) to allow the processors of the sub-cluster to temporarily continue, starting a second timer after the first timer expires when the said attempting (c) has determined within the first duration of the first timer that the sub-cluster vote is at least a majority of the total votes available, the second timer having a second duration such that data corruption is avoided; and
  
  (f) taking over services from the one or more non-responsive nodes by at least one of the remaining computing nodes of the sub-cluster after the second timer expires.
- View Dependent Claims (7, 8, 9)
- - 7. A method as recited in claim 6, wherein said method is performed by each of the sub-clusters of the clustered computing system.
  - 8. A method as recited in claim 6, wherein the (a) determining further comprises:
9. A method as recited in claim 6, wherein the method further comprises:
- determining whether there is at least one service of the one or more non-responsive nodes that needs to be taken over.

10. A computer readable medium including computer program code for taking over services by one or more remaining sub-clusters of a clustered computing system from one or more other sub-clusters of the clustered computing system after the one or more other sub-clusters have been shutdown, said computer readable medium comprising:
- computer program code for attempting to determine whether a sub-cluster of the clustered computing system is to remain active such that the processes of the sub-cluster may temporarily continue;
  
  computer program code for initiating shutdown of the sub-cluster when said computer program code for attempting does not determine within a first predetermined amount of time that the sub-cluster is to remain active;
  
  computer program code for delaying for a second predetermined amount of time after the first predetermined amount of time expires when said computer program code for attempting determines within the first predetermined amount of time that the sub-cluster is to remain active such that the processes of the sub-cluster may temporarily continue; and
  
  computer program code to allow the processors of the sub-cluster to temporarily continue for taking over services of one or more other sub-clusters of the clustered computing system by one or more remaining sub-clusters after said computer program code for delaying has delayed for the second predetermined amount of time such that data corruption is avoided.
- View Dependent Claims (11, 12)
- - 11. A computer readable medium as recited in claim 10, wherein the computer readable medium is provided for each of the sub-clusters of the clustered computing system.
  - 12. A method as recited in claim 10, said method further comprising:

13. A computer readable medium for taking over services by one or more remaining sub-clusters of a clustered computing system from one or more other sub-clusters of the clustered computing system after the one or more other sub-clusters have been shutdown, said computer readable medium comprising:
- computer program code for determining whether one or more computing nodes in a cluster have become one or more non-responsive nodes;
  
  computer program code for starting a first timer when said computer program code for determining determines that one or more of the computing nodes in the cluster have become one or more non-responsive nodes, the first timer having a first duration;
  
  computer program code for attempting to determine whether a sub-cluster vote is at least a majority of a total votes available, the sub-cluster vote representing votes for a sub-cluster of one or more computing nodes, the sub-cluster representing a portion of the cluster that remains responsive;
  
  computer program code for initiating shutdown of the one or more non-responsive computing nodes of the sub-cluster when said computer program code for attempting does not determine within the first duration of the first timer that the sub-cluster vote is at least a majority of the total votes available;
  
  computer program code for starting a second timer to allow the processors of the sub-cluster to temporarily continue after the first timer expires when the said computer program code for attempting has determined within the first duration of the first timer that the sub-cluster vote is at least a majority of the total votes available, the second timer having a second duration such that data corruption is avoided; and
  
  computer program code for taking over services from the one or more non-responsive nodes by at least one of the remaining computing nodes of the sub-cluster after the second timer expires.
- View Dependent Claims (14)
- - 14. A computer readable medium as recited in claim 13, wherein the computer readable medium is provided for each of the sub-clusters of the clustered computing system.

15. A clustered computing system, comprising:
- a cluster of computing nodes having at least two computing nodes; and
  
  an integrity protector, comprising;
  
  a cluster error detector operable to detect a formation of disjoint sub-clusters;
  
  a cluster shutdown controller operable to;
  
  attempt to determine whether a sub-cluster of the clustered computing system is to remain active such that the processes of the sub-cluster may temporarily continue;
  
  initiate shutdown of the sub-cluster when the attempt does not determine within a first predetermined amount of time that the sub-cluster is to remain active; and
  
  to allow the processors of the sub-cluster to temporarily continue, delay for a second predetermined amount of time after the first predetermined amount of time expires when said attempt determines within the first predetermined amount of time that the sub-cluster is to remain active such that data corruption is avoided; and
  
  a takeover controller operable to;
  
  take over services of one or more other sub-clusters of the clustered computing system by one or more remaining sub-clusters after the delay for the second predetermined amount of time.
- View Dependent Claims (16, 17, 18)
- - 16. The clustered computing system of claim 15, wherein the cluster shutdown controller is operable to:
17. The clustered computing system of claim 15, wherein the first predetermined amount of time represents an upper estimate of time required to determine whether a sub-cluster is to remain active.
18. The clustered computing system of claim 15, wherein the second predetermined amount of time represents an upper estimate for a delay typically encountered in initiating the attempt after an error condition has actually occurred.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Oracle America, Inc. (Oracle Corporation)
Original Assignee
Sun Microsystems Incorporated (Oracle Corporation)
Inventors
Murphy, Declan J., Kumar, Krishna, Hisgen, Andrew L.
Primary Examiner(s)
Beausoliel, Robert
Assistant Examiner(s)
MASKULINSKI, MICHAEL C

Application Number

US09/479,485
Publication Number

US 20030159084A1
Time in Patent Office

1,702 Days
Field of Search

714/4, 714/11-13, 714/10, 709/102-105, 709/220, 709/221, 718/102-105
US Class Current

714/13
CPC Class Codes

G06F 11/203   using migration

G06F 11/2035   without idle spare hardware

G06F 11/2046   where the redundant compone...

Controlled take over of services by remaining nodes of clustered computing system

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Controlled take over of services by remaining nodes of clustered computing system

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links