Fault-tolerance and fault-containment models for zoning clustered application silos into continuous availability and high availability zones in clustered systems during recovery and maintenance
First Claim
1. A computer program product recorded on one or more data storage media for use in a server cluster having plural nodes, comprising:
- said one or more data storage media;
program logic recorded on said data storage media for programming a data processing platform to operate as by;
maintaining a set of active nodes that each run a software stack that includes a cluster management tier and a cluster application tier, said cluster application tier of said active nodes actively providing services on behalf of client applications;
maintaining a set of spare nodes that each run a software stack that includes said cluster management tier and said cluster application tier, said cluster application tier of said spare nodes being continuously operational during steady-state cluster application transaction processing, but not actively providing transaction services on behalf of client applications prior to assuming an application workload from another node;
dynamically logically defining first and second zones in said cluster in response to an active node membership change involving one or more active nodes departing from or being added to said cluster as a result of an active node failing or becoming unreachable or as a result of a maintenance operation involving an active node;
said first zone being a fault tolerant zone comprising all of said active nodes that are operational;
said second zone being a fault containment zone comprising all active nodes participating in said membership change and some number of said spare nodes in the event that said membership change involves a node departure;
implementing fast recovery/maintenance and high cluster application availability in said fault containment zone during cluster recovery or maintenance by initiating application failover and application recovery protocols that are implemented by said cluster application and cluster management tiers of nodes in said fault containment zone following said active node membership change;
maintaining continuous application cluster availability in said fault tolerant zone during cluster recovery or maintenance by continuing without interruption normal transactional application and related intra-cluster messaging protocols that were being implemented by said cluster application and cluster management tiers of nodes in said fault tolerant zone prior to said active node membership change; and
said cluster management tier of nodes in said fault tolerant zone and said fault containment zone initiating cluster recovery protocols following said active node membership change, said cluster recovery protocols being transparent to said cluster application tier of nodes in said fault tolerant zone so as not to interfere with said normal transactional application and related intra-cluster messaging protocols implemented by nodes in said fault tolerant zone;
whereby group integrity is maintained and transactional application communication messaging continues without interruption in nodes of said fault tolerant zone as cluster recovery is performed.
0 Assignments
0 Petitions
Accused Products
Abstract
A cluster recovery and maintenance technique for a server cluster having plural nodes implementing a server tier in a client-server computing architecture. A first group of N active nodes each run a software stack comprising a cluster management tier and a cluster application tier that actively provides services on behalf of one or more client applications running in a client application tier on the clients. A second group of M spare nodes each run a software stack comprising a cluster management tier and a cluster application tier that does not actively provide client application services. First and second zones in the cluster are determined in response to an active node membership change involving one or more active nodes departing from or being added to the first group as a result of an active node failing or becoming unreachable or as a result of a maintenance operation involving an active node.
50 Citations
11 Claims
-
1. A computer program product recorded on one or more data storage media for use in a server cluster having plural nodes, comprising:
-
said one or more data storage media; program logic recorded on said data storage media for programming a data processing platform to operate as by; maintaining a set of active nodes that each run a software stack that includes a cluster management tier and a cluster application tier, said cluster application tier of said active nodes actively providing services on behalf of client applications; maintaining a set of spare nodes that each run a software stack that includes said cluster management tier and said cluster application tier, said cluster application tier of said spare nodes being continuously operational during steady-state cluster application transaction processing, but not actively providing transaction services on behalf of client applications prior to assuming an application workload from another node; dynamically logically defining first and second zones in said cluster in response to an active node membership change involving one or more active nodes departing from or being added to said cluster as a result of an active node failing or becoming unreachable or as a result of a maintenance operation involving an active node; said first zone being a fault tolerant zone comprising all of said active nodes that are operational; said second zone being a fault containment zone comprising all active nodes participating in said membership change and some number of said spare nodes in the event that said membership change involves a node departure; implementing fast recovery/maintenance and high cluster application availability in said fault containment zone during cluster recovery or maintenance by initiating application failover and application recovery protocols that are implemented by said cluster application and cluster management tiers of nodes in said fault containment zone following said active node membership change; maintaining continuous application cluster availability in said fault tolerant zone during cluster recovery or maintenance by continuing without interruption normal transactional application and related intra-cluster messaging protocols that were being implemented by said cluster application and cluster management tiers of nodes in said fault tolerant zone prior to said active node membership change; and said cluster management tier of nodes in said fault tolerant zone and said fault containment zone initiating cluster recovery protocols following said active node membership change, said cluster recovery protocols being transparent to said cluster application tier of nodes in said fault tolerant zone so as not to interfere with said normal transactional application and related intra-cluster messaging protocols implemented by nodes in said fault tolerant zone; whereby group integrity is maintained and transactional application communication messaging continues without interruption in nodes of said fault tolerant zone as cluster recovery is performed. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A server cluster system having plural nodes adapted to provide cluster application services to clients that access said cluster, comprising:
-
program logic executable on at least one of said nodes to perform operations, said operations comprising; maintaining a set of active nodes that each run a software stack that includes a cluster management tier and a cluster application tier, said cluster application tier of said active nodes actively providing services on behalf of client applications; maintaining a set of spare nodes that each run a software stack that includes said cluster management tier and said cluster application tier, said cluster application tier of said spare nodes being continuously operational during steady-state cluster application transaction processing, but not actively providing transaction services on behalf of client applications prior to assuming an application workload from another node; dynamically logically defining first and second zones in said cluster in response to an active node membership change involving one or more active nodes departing from or being added to said cluster as a result of an active node failing or becoming unreachable or as a result of a maintenance operation involving an active node; said first zone being a fault tolerant zone comprising all of said active nodes that are operational; said second zone being a fault containment zone comprising all active nodes participating in said membership change and some number of said spare nodes in the event that said membership change involves a node departure; implementing fast recovery/maintenance and high cluster application availability in said fault containment zone during cluster recovery or maintenance by initiating application failover and application recovery protocols that are implemented by said cluster application and cluster management tiers of nodes in said fault containment zone following said active node membership change; maintaining continuous application cluster availability in said fault tolerant zone during cluster recovery or maintenance by continuing without interruption normal transactional application and related intra-cluster messaging protocols that were being implemented by said cluster application and cluster management tiers of nodes in said fault tolerant zone prior to said active node membership change; and said cluster management tier of nodes in said fault tolerant zone and said fault containment zone initiating cluster recovery protocols following said active node membership change, said cluster recovery protocols being transparent to said cluster application tier of nodes in said fault tolerant zone so as not to interfere with said normal transactional application and related intra-cluster messaging protocols implemented by nodes in said fault tolerant zone; whereby group integrity is maintained and transactional application communication messaging continues without interruption in nodes of said fault tolerant zone as cluster recovery is performed. - View Dependent Claims (7, 8, 9, 10)
-
-
11. A computer program recorded on one or more data storage media product for use in a server cluster having plural nodes implementing a server tier in a client-server computing architecture to provide cluster recovery, comprising:
-
said one or more data storage media; program logic recorded on said data storage media for programming a data processing platform to operate as by; determining a first group of N active nodes that each run a software stack comprising a cluster management tier and a cluster application tier that actively provides services on behalf of one or more client applications running in a client application tier on said clients; determining a second group of M spare nodes that each run a software stack comprising said cluster management tier and said cluster application tier, said cluster application tier of said spare nodes being continuously operational during steady-state cluster application transaction processing, but not actively providing transaction services on behalf of client applications prior to assuming an application workload from another node; dynamically logically defining first and second zones in said cluster in response to an active node membership change involving one or more active nodes departing from or being added to said first group as a result of an active node failing or becoming unreachable or as a result of a maintenance operation involving an active node; said first zone being a fault tolerant zone comprising all of said active nodes that are operational; said second zone being a fault containment zone comprising all active nodes participating in said membership change and at most a corresponding number of said spare nodes in the event that said membership change involves a node departure; implementing fast recovery/maintenance and high cluster application availability in said fault containment zone during recovery or maintenance by initiating application failover and application recovery protocols that are implemented by said cluster application and cluster management tiers of nodes in said fault containment zone following said active node membership change; maintaining transactional isolation between said fault tolerant zone and said fault containment zone by failing over the client application services provided by any departing nodes in said fault containment zone to at most a corresponding number of said spare nodes in said fault containment zone; maintaining continuous application cluster availability in said fault tolerant zone during recovery or maintenance by continuing without interruption normal transactional application and related intra-cluster messaging protocols that were being implemented by said cluster application and cluster management tiers of nodes in said fault tolerant zone prior to said active node membership change; said cluster management tier of nodes in said fault tolerant zone and said fault containment zone initiating cluster recovery protocols following said active node membership change, said cluster recovery protocols being transparent to said cluster application tier of nodes in said fault tolerant zone so as not to interfere with said normal transactional application and related intra-cluster messaging protocols implemented by nodes in said fault tolerant zone; maintaining transactional continuity in said fault tolerant zone by guaranteeing cluster membership integrity in said fault tolerant zone using a request/response-based cluster recovery communication protocol to monitor node membership integrity in said fault tolerant zone and remove nodes that fail to provide a node response, thus guaranteeing cluster membership integrity; and maintaining transactional continuity in said fault tolerant zone by guaranteeing communication continuity in said fault tolerant zone through absolute node identification independent of cluster size and retention of pre-recovery communication sessions in the fault-tolerant zone; whereby group integrity is maintained and transactional application communication messaging continues without interruption in nodes of said fault tolerant zone as cluster recovery is performed.
-
Specification