Fault-tolerance and fault-containment models for zoning clustered application silos into continuous availability and high availability zones in clustered systems during recovery and maintenance

US 20070006015A1
Filed: 06/29/2005
Published: 01/04/2007
Est. Priority Date: 06/29/2005
Status: Active Grant

First Claim

Patent Images

1. In a server cluster having plural nodes, a cluster recovery and maintenance method comprising:

determining first and second zones in said cluster in response to an active node membership change involving one or more active nodes departing from or being added to said cluster as a result of an active node failing or becoming unreachable or as a result of a maintenance operation involving an active node;

said first zone being a fault tolerant zone comprising all active nodes that are operational;

said second zone being a fault containment zone comprising all active nodes participating in said membership change and some number of spare nodes in the event that said membership change involves a node departure;

implementing fast recovery/maintenance and high cluster application availability in said fault containment zone during cluster recovery or maintenance; and

maintaining continuous application cluster availability in said fault tolerant zone during cluster recovery or maintenance.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A cluster recovery and maintenance system, method and computer program product for use in a server cluster having plural nodes implementing a server tier in a client-server computing architecture. A first group of N active nodes each run a software stack comprising a cluster management tier and a cluster application tier that actively provides services on behalf of one or more client applications running in a client application tier on the clients. A second group of M spare nodes each run a software stack comprising a cluster management tier and a cluster application tier that does not actively provide services on behalf of client applications. First and second zones in the cluster are determined in response to an active node membership change involving one or more active nodes departing from or being added to the first group as a result of an active node failing or becoming unreachable or as a result of a maintenance operation involving an active node. The first zone is a fault tolerant zone comprising all active nodes that remain operational. The second zone is a fault containment zone comprising all active nodes participating in the membership change and at least a corresponding number of spare nodes to the extent that the membership change involves a node departure. During recovery and maintenance, fast recovery/maintenance and high application availability are implemented in the fault containment zone, while continuous application availability is maintained in the fault tolerant zone.

144 Citations

20 Claims

1. In a server cluster having plural nodes, a cluster recovery and maintenance method comprising:
- determining first and second zones in said cluster in response to an active node membership change involving one or more active nodes departing from or being added to said cluster as a result of an active node failing or becoming unreachable or as a result of a maintenance operation involving an active node;
  
  said first zone being a fault tolerant zone comprising all active nodes that are operational;
  
  said second zone being a fault containment zone comprising all active nodes participating in said membership change and some number of spare nodes in the event that said membership change involves a node departure;
  
  implementing fast recovery/maintenance and high cluster application availability in said fault containment zone during cluster recovery or maintenance; and
  
  maintaining continuous application cluster availability in said fault tolerant zone during cluster recovery or maintenance.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. A method in accordance with claim 1 wherein transactional isolation is maintained between said fault tolerant group and said fault containment group by failing over client application services provided by any departing node(s) in said fault containment group to a corresponding number of said spare nodes in said fault containment group.
  - 3. A method in accordance with claim 1 wherein transactional continuity is maintained in said fault tolerant zone by guaranteeing cluster membership integrity in said fault tolerant zone as a result of exploiting a request/response-based cluster recovery communication protocol to monitor node membership integrity in said fault tolerant zone and removing from said fault tolerant zone a node that fails to provide a node response pursuant to said request/response-based communication protocol.
  - 4. A method in accordance with claim 1 wherein transactional continuity is maintained in said fault tolerant zone by guaranteeing communication continuity in said fault tolerant zone through absolute node identification independent of cluster size and retention of pre-recovery/maintenance communication sessions in the fault-tolerant zone.
  - 5. A method in accordance with claim 1 wherein transactional continuity is maintained in said fault tolerant zone by way of concurrent protocol scoping in which application failover and recovery protocols are limited to cluster application and cluster management tiers of said fault containment group and normal transactional application protocols are limited to cluster application and cluster management tiers of said fault tolerant group.
  - 6. A method in accordance with claim 1 further including rejoining one or more departing nodes after repair or maintenance into said cluster as spare nodes.

7. A computer program product for use in a server cluster having plural nodes, comprising:
- one or more data storage media;
  
  means recorded on said data storage media for programming a data processing platform to operate as by;
  
  determining first and second zones in said cluster in response to an active node membership change involving one or more active nodes departing from or being added to said cluster as a result of an active node failing or becoming unreachable or as a result of a maintenance operation involving an active node;
  
  said first zone being a fault tolerant zone comprising all active nodes that are operational;
  
  said second zone being a fault containment zone comprising all active nodes participating in said membership change and some number of spare nodes in the event that said membership change involves a node departure;
  
  implementing fast recovery/maintenance and high cluster application availability in said fault containment zone during cluster recovery or maintenance; and
  
  maintaining continuous application cluster availability in said fault tolerant zone during cluster recovery or maintenance.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. A program product in accordance with claim 7 wherein transactional isolation is maintained between said fault tolerant group and said fault containment group by failing over client application services provided by any departing node(s) in said fault containment group to a corresponding number of said spare nodes in said fault containment group.
  - 9. A program product in accordance with claim 7 wherein transactional continuity is maintained in said fault tolerant zone by guaranteeing cluster membership integrity in said fault tolerant zone as a result of exploiting a request/response-based cluster recovery communication protocol to monitor node membership integrity in said fault tolerant zone and removing from said fault tolerant zone a node that fails to provide a node response pursuant to said request/response-based communication protocol.
  - 10. A program product in accordance with claim 7 wherein transactional continuity is maintained in said fault tolerant zone by guaranteeing communication continuity in said fault tolerant zone through absolute node identification independent of cluster size and retention of pre-recovery/maintenance communication sessions in the fault-tolerant zone.
  - 11. A program product in accordance with claim 7 wherein transactional continuity is maintained in said fault tolerant zone by way of concurrent protocol scoping in which application failover and recovery protocols are limited to cluster application and cluster management tiers of said fault containment group and normal transactional application protocols are limited to cluster application and cluster management tiers of said fault tolerant group.
  - 12. A program product in accordance with claim 7 further including rejoining one or more departing nodes after repair or maintenance into said cluster as spare nodes.

13. A server cluster having plural nodes adapted to provide cluster application services to clients that access said cluster, comprising:
- program logic adapted to determine first and second zones in said cluster in response to an active node membership change involving one or more active nodes departing from or being added to said cluster as a result of an active node failing or becoming unreachable or as a result of a maintenance operation involving an active node;
  
  said first zone being a fault tolerant zone comprising all active nodes that are operational;
  
  said second zone being a fault containment zone comprising all active nodes participating in said membership change and some number of spare nodes in the event that said membership change involves a node departure;
  
  program logic adapted to implement fast recovery/maintenance and high cluster application availability in said fault containment zone during cluster recovery or maintenance; and
  
  program logic adapted to maintain continuous application cluster availability in said fault tolerant zone during cluster recovery or maintenance.
- View Dependent Claims (14, 15, 16, 17, 18, 19)
- - 14. A system in accordance with claim 13 wherein transactional isolation is maintained between said fault tolerant group and said fault containment group by program logic adapted to transition client application services provided by any departing node(s) in said fault containment group to a corresponding number of said spare nodes in said fault containment group.
  - 15. A system in accordance with claim 13 wherein transactional continuity is maintained in said fault tolerant zone by program logic adapted to guarantee cluster membership integrity in said fault tolerant zone as a result of exploiting a request/response-based cluster recovery communication protocol to monitor node membership integrity in said fault tolerant zone and removing from said fault tolerant zone a node that fails to provide a node response pursuant to said request/response-based communication protocol.
  - 16. A system in accordance with claim 13 wherein transactional continuity is maintained in said fault tolerant zone by program logic adapted to guarantee communication continuity in said fault tolerant zone through absolute node identification independent of cluster size and retention of pre-recovery/maintenance communication sessions in the fault-tolerant zone.
  - 17. A system in accordance with claim 13 wherein transactional continuity is maintained in said fault tolerant zone by way of program logic adapted to implement concurrent protocol scoping in which application failover and recovery protocols are limited to cluster application and cluster management tiers of said fault containment group and normal transactional application protocols are limited to cluster application and cluster management tiers of said fault tolerant group.
  - 18. A system in accordance with claim 13 further including program logic adapted to rejoin one or more departing nodes after repair or maintenance into said cluster as spare nodes.
  - 19. A system in accordance with claim 13 wherein said program logic is embodied in a cluster leader node in said server cluster.

20. A computer program product for use in a server cluster having plural nodes implementing a server tier in a client-server computing architecture to provide cluster recovery, comprising:
- one or more data storage media;
  
  means recorded on said data storage media for programming a data processing platform to operate as by;
  
  determining a first group of N active nodes that each run a software stack comprising a cluster management tier and a cluster application tier that actively provides services on behalf of one or more client applications running in a client application tier on said clients;
  
  determining a second group of M spare nodes that each run a software stack comprising a cluster management tier and a cluster application tier that does not actively provide services on behalf of client applications;
  
  determining first and second zones in said cluster in response to an active node membership change involving one or more active nodes departing from or being added to said first group as a result of an active node failing or becoming unreachable or as a result of a maintenance operation involving an active node;
  
  said first zone being a fault tolerant zone comprising all active nodes that are operational;
  
  said second zone being a fault containment zone comprising all active nodes participating in said membership change and at most a corresponding number of said spare nodes in the event that said membership change involves a node departure;
  
  implementing fast recovery/maintenance and high cluster application availability in said fault containment zone during recovery or maintenance by;
  
  failing over the client application services provided by any departing nodes in said fault containment group to at most a corresponding number of said spare nodes in said fault containment group in order to maintain transactional isolation between said fault tolerant group and said fault containment group; and
  
  maintaining continuous application cluster availability in said fault tolerant zone during recovery or maintenance by;
  
  using a request/response-based cluster recovery communication protocol to monitor node membership integrity in said fault tolerant zone and remove nodes that fail to provide a node response, thus guaranteeing cluster membership integrity;
  
  guaranteeing communication continuity in said fault tolerant zone through absolute node identification independent of cluster size and retention of pre-recovery communication sessions in the fault-tolerant zone; and
  
  implementing concurrent protocol scoping to limit application failover and recovery protocols to the cluster application and cluster management tiers of said fault containment group and normal transactional application protocols to the cluster application and cluster management tiers of said fault tolerant group.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Rao, Sudhir, Jackson, Bruce

Granted Patent

US 8,195,976 B2
Time in Patent Office

Days
Field of Search
US Class Current

714/4
CPC Class Codes

G06F 11/1425   by reconfiguration of node ...

G06F 11/2028   eliminating a faulty proces...

G06F 11/2041   with more than one idle spa...

H04L 41/0654   using network fault recover...

Fault-tolerance and fault-containment models for zoning clustered application silos into continuous availability and high availability zones in clustered systems during recovery and maintenance

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

144 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Fault-tolerance and fault-containment models for zoning clustered application silos into continuous availability and high availability zones in clustered systems during recovery and maintenance

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

144 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links