Fault-tolerance and fault-containment models for zoning clustered application silos into continuous availability and high availability zones in clustered systems during recovery and maintenance
First Claim
1. In a server cluster having plural nodes, a cluster recovery and maintenance method comprising:
- determining first and second zones in said cluster in response to an active node membership change involving one or more active nodes departing from or being added to said cluster as a result of an active node failing or becoming unreachable or as a result of a maintenance operation involving an active node;
said first zone being a fault tolerant zone comprising all active nodes that are operational;
said second zone being a fault containment zone comprising all active nodes participating in said membership change and some number of spare nodes in the event that said membership change involves a node departure;
implementing fast recovery/maintenance and high cluster application availability in said fault containment zone during cluster recovery or maintenance; and
maintaining continuous application cluster availability in said fault tolerant zone during cluster recovery or maintenance.
1 Assignment
0 Petitions
Accused Products
Abstract
A cluster recovery and maintenance system, method and computer program product for use in a server cluster having plural nodes implementing a server tier in a client-server computing architecture. A first group of N active nodes each run a software stack comprising a cluster management tier and a cluster application tier that actively provides services on behalf of one or more client applications running in a client application tier on the clients. A second group of M spare nodes each run a software stack comprising a cluster management tier and a cluster application tier that does not actively provide services on behalf of client applications. First and second zones in the cluster are determined in response to an active node membership change involving one or more active nodes departing from or being added to the first group as a result of an active node failing or becoming unreachable or as a result of a maintenance operation involving an active node. The first zone is a fault tolerant zone comprising all active nodes that remain operational. The second zone is a fault containment zone comprising all active nodes participating in the membership change and at least a corresponding number of spare nodes to the extent that the membership change involves a node departure. During recovery and maintenance, fast recovery/maintenance and high application availability are implemented in the fault containment zone, while continuous application availability is maintained in the fault tolerant zone.
144 Citations
20 Claims
-
1. In a server cluster having plural nodes, a cluster recovery and maintenance method comprising:
-
determining first and second zones in said cluster in response to an active node membership change involving one or more active nodes departing from or being added to said cluster as a result of an active node failing or becoming unreachable or as a result of a maintenance operation involving an active node;
said first zone being a fault tolerant zone comprising all active nodes that are operational;
said second zone being a fault containment zone comprising all active nodes participating in said membership change and some number of spare nodes in the event that said membership change involves a node departure;
implementing fast recovery/maintenance and high cluster application availability in said fault containment zone during cluster recovery or maintenance; and
maintaining continuous application cluster availability in said fault tolerant zone during cluster recovery or maintenance. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A computer program product for use in a server cluster having plural nodes, comprising:
-
one or more data storage media;
means recorded on said data storage media for programming a data processing platform to operate as by;
determining first and second zones in said cluster in response to an active node membership change involving one or more active nodes departing from or being added to said cluster as a result of an active node failing or becoming unreachable or as a result of a maintenance operation involving an active node;
said first zone being a fault tolerant zone comprising all active nodes that are operational;
said second zone being a fault containment zone comprising all active nodes participating in said membership change and some number of spare nodes in the event that said membership change involves a node departure;
implementing fast recovery/maintenance and high cluster application availability in said fault containment zone during cluster recovery or maintenance; and
maintaining continuous application cluster availability in said fault tolerant zone during cluster recovery or maintenance. - View Dependent Claims (8, 9, 10, 11, 12)
-
-
13. A server cluster having plural nodes adapted to provide cluster application services to clients that access said cluster, comprising:
-
program logic adapted to determine first and second zones in said cluster in response to an active node membership change involving one or more active nodes departing from or being added to said cluster as a result of an active node failing or becoming unreachable or as a result of a maintenance operation involving an active node;
said first zone being a fault tolerant zone comprising all active nodes that are operational;
said second zone being a fault containment zone comprising all active nodes participating in said membership change and some number of spare nodes in the event that said membership change involves a node departure;
program logic adapted to implement fast recovery/maintenance and high cluster application availability in said fault containment zone during cluster recovery or maintenance; and
program logic adapted to maintain continuous application cluster availability in said fault tolerant zone during cluster recovery or maintenance. - View Dependent Claims (14, 15, 16, 17, 18, 19)
-
-
20. A computer program product for use in a server cluster having plural nodes implementing a server tier in a client-server computing architecture to provide cluster recovery, comprising:
-
one or more data storage media;
means recorded on said data storage media for programming a data processing platform to operate as by;
determining a first group of N active nodes that each run a software stack comprising a cluster management tier and a cluster application tier that actively provides services on behalf of one or more client applications running in a client application tier on said clients;
determining a second group of M spare nodes that each run a software stack comprising a cluster management tier and a cluster application tier that does not actively provide services on behalf of client applications;
determining first and second zones in said cluster in response to an active node membership change involving one or more active nodes departing from or being added to said first group as a result of an active node failing or becoming unreachable or as a result of a maintenance operation involving an active node;
said first zone being a fault tolerant zone comprising all active nodes that are operational;
said second zone being a fault containment zone comprising all active nodes participating in said membership change and at most a corresponding number of said spare nodes in the event that said membership change involves a node departure;
implementing fast recovery/maintenance and high cluster application availability in said fault containment zone during recovery or maintenance by;
failing over the client application services provided by any departing nodes in said fault containment group to at most a corresponding number of said spare nodes in said fault containment group in order to maintain transactional isolation between said fault tolerant group and said fault containment group; and
maintaining continuous application cluster availability in said fault tolerant zone during recovery or maintenance by;
using a request/response-based cluster recovery communication protocol to monitor node membership integrity in said fault tolerant zone and remove nodes that fail to provide a node response, thus guaranteeing cluster membership integrity;
guaranteeing communication continuity in said fault tolerant zone through absolute node identification independent of cluster size and retention of pre-recovery communication sessions in the fault-tolerant zone; and
implementing concurrent protocol scoping to limit application failover and recovery protocols to the cluster application and cluster management tiers of said fault containment group and normal transactional application protocols to the cluster application and cluster management tiers of said fault tolerant group.
-
Specification