Fault-tolerance and fault-containment models for zoning clustered application silos into continuous availability and high availability zones in clustered systems during recovery and maintenance

US 8,286,026 B2
Filed: 02/13/2012
Issued: 10/09/2012
Est. Priority Date: 06/29/2005
Status: Expired due to Fees

First Claim

Patent Images

1. A computer program product recorded on one or more data storage media for use in a server cluster having plural nodes, comprising:

said one or more data storage media;

program logic recorded on said data storage media for programming a data processing platform to operate as by;

maintaining a set of active nodes that each run a software stack that includes a cluster management tier and a cluster application tier, said cluster application tier of said active nodes actively providing services on behalf of client applications;

maintaining a set of spare nodes that each run a software stack that includes said cluster management tier and said cluster application tier, said cluster application tier of said spare nodes being continuously operational during steady-state cluster application transaction processing, but not actively providing transaction services on behalf of client applications prior to assuming an application workload from another node;

dynamically logically defining first and second zones in said cluster in response to an active node membership change involving one or more active nodes departing from or being added to said cluster as a result of an active node failing or becoming unreachable or as a result of a maintenance operation involving an active node;

said first zone being a fault tolerant zone comprising all of said active nodes that are operational;

said second zone being a fault containment zone comprising all active nodes participating in said membership change and some number of said spare nodes in the event that said membership change involves a node departure;

implementing fast recovery/maintenance and high cluster application availability in said fault containment zone during cluster recovery or maintenance by initiating application failover and application recovery protocols that are implemented by said cluster application and cluster management tiers of nodes in said fault containment zone following said active node membership change;

maintaining continuous application cluster availability in said fault tolerant zone during cluster recovery or maintenance by continuing without interruption normal transactional application and related intra-cluster messaging protocols that were being implemented by said cluster application and cluster management tiers of nodes in said fault tolerant zone prior to said active node membership change; and

said cluster management tier of nodes in said fault tolerant zone and said fault containment zone initiating cluster recovery protocols following said active node membership change, said cluster recovery protocols being transparent to said cluster application tier of nodes in said fault tolerant zone so as not to interfere with said normal transactional application and related intra-cluster messaging protocols implemented by nodes in said fault tolerant zone;

whereby group integrity is maintained and transactional application communication messaging continues without interruption in nodes of said fault tolerant zone as cluster recovery is performed.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A cluster recovery and maintenance technique for a server cluster having plural nodes implementing a server tier in a client-server computing architecture. A first group of N active nodes each run a software stack comprising a cluster management tier and a cluster application tier that actively provides services on behalf of one or more client applications running in a client application tier on the clients. A second group of M spare nodes each run a software stack comprising a cluster management tier and a cluster application tier that does not actively provide client application services. First and second zones in the cluster are determined in response to an active node membership change involving one or more active nodes departing from or being added to the first group as a result of an active node failing or becoming unreachable or as a result of a maintenance operation involving an active node.

50 Citations

View as Search Results

11 Claims

1. A computer program product recorded on one or more data storage media for use in a server cluster having plural nodes, comprising:
- said one or more data storage media;
  
  program logic recorded on said data storage media for programming a data processing platform to operate as by;
  
  maintaining a set of active nodes that each run a software stack that includes a cluster management tier and a cluster application tier, said cluster application tier of said active nodes actively providing services on behalf of client applications;
  
  maintaining a set of spare nodes that each run a software stack that includes said cluster management tier and said cluster application tier, said cluster application tier of said spare nodes being continuously operational during steady-state cluster application transaction processing, but not actively providing transaction services on behalf of client applications prior to assuming an application workload from another node;
  
  dynamically logically defining first and second zones in said cluster in response to an active node membership change involving one or more active nodes departing from or being added to said cluster as a result of an active node failing or becoming unreachable or as a result of a maintenance operation involving an active node;
  
  said first zone being a fault tolerant zone comprising all of said active nodes that are operational;
  
  said second zone being a fault containment zone comprising all active nodes participating in said membership change and some number of said spare nodes in the event that said membership change involves a node departure;
  
  implementing fast recovery/maintenance and high cluster application availability in said fault containment zone during cluster recovery or maintenance by initiating application failover and application recovery protocols that are implemented by said cluster application and cluster management tiers of nodes in said fault containment zone following said active node membership change;
  
  maintaining continuous application cluster availability in said fault tolerant zone during cluster recovery or maintenance by continuing without interruption normal transactional application and related intra-cluster messaging protocols that were being implemented by said cluster application and cluster management tiers of nodes in said fault tolerant zone prior to said active node membership change; and
  
  said cluster management tier of nodes in said fault tolerant zone and said fault containment zone initiating cluster recovery protocols following said active node membership change, said cluster recovery protocols being transparent to said cluster application tier of nodes in said fault tolerant zone so as not to interfere with said normal transactional application and related intra-cluster messaging protocols implemented by nodes in said fault tolerant zone;
  
  whereby group integrity is maintained and transactional application communication messaging continues without interruption in nodes of said fault tolerant zone as cluster recovery is performed.
- View Dependent Claims (2, 3, 4, 5)
- - 2. A program product in accordance with claim 1 wherein transactional isolation is maintained between said fault tolerant zone and said fault containment zone by failing over client application services provided by any departing node(s) in said fault containment zone to a corresponding number of said spare nodes in said fault containment zone.
  - 3. A program product in accordance with claim 1 wherein transactional continuity is maintained in said fault tolerant zone by guaranteeing cluster membership integrity in said fault tolerant zone as a result of exploiting a request/response-based cluster recovery communication protocol to monitor node membership integrity in said fault tolerant zone and removing from said fault tolerant zone a node that fails to provide a node response pursuant to said request/response-based communication protocol.
  - 4. A program product in accordance with claim 1 wherein transactional continuity is maintained in said fault tolerant zone by guaranteeing communication continuity in said fault tolerant zone through absolute node identification independent of cluster size and retention of pre-recovery/maintenance communication sessions in the fault-tolerant zone.
  - 5. A program product in accordance with claim 1 further including rejoining one or more departing nodes after repair or maintenance into said cluster as spare nodes.

6. A server cluster system having plural nodes adapted to provide cluster application services to clients that access said cluster, comprising:
- program logic executable on at least one of said nodes to perform operations, said operations comprising;
  
  maintaining a set of active nodes that each run a software stack that includes a cluster management tier and a cluster application tier, said cluster application tier of said active nodes actively providing services on behalf of client applications;
  
  maintaining a set of spare nodes that each run a software stack that includes said cluster management tier and said cluster application tier, said cluster application tier of said spare nodes being continuously operational during steady-state cluster application transaction processing, but not actively providing transaction services on behalf of client applications prior to assuming an application workload from another node;
  
  dynamically logically defining first and second zones in said cluster in response to an active node membership change involving one or more active nodes departing from or being added to said cluster as a result of an active node failing or becoming unreachable or as a result of a maintenance operation involving an active node;
  
  said first zone being a fault tolerant zone comprising all of said active nodes that are operational;
  
  said second zone being a fault containment zone comprising all active nodes participating in said membership change and some number of said spare nodes in the event that said membership change involves a node departure;
  
  implementing fast recovery/maintenance and high cluster application availability in said fault containment zone during cluster recovery or maintenance by initiating application failover and application recovery protocols that are implemented by said cluster application and cluster management tiers of nodes in said fault containment zone following said active node membership change;
  
  maintaining continuous application cluster availability in said fault tolerant zone during cluster recovery or maintenance by continuing without interruption normal transactional application and related intra-cluster messaging protocols that were being implemented by said cluster application and cluster management tiers of nodes in said fault tolerant zone prior to said active node membership change; and
  
  said cluster management tier of nodes in said fault tolerant zone and said fault containment zone initiating cluster recovery protocols following said active node membership change, said cluster recovery protocols being transparent to said cluster application tier of nodes in said fault tolerant zone so as not to interfere with said normal transactional application and related intra-cluster messaging protocols implemented by nodes in said fault tolerant zone;
  
  whereby group integrity is maintained and transactional application communication messaging continues without interruption in nodes of said fault tolerant zone as cluster recovery is performed.
- View Dependent Claims (7, 8, 9, 10)
- - 7. A system in accordance with claim 6 wherein transactional isolation is maintained between said fault tolerant zone and said fault containment zone by failing over client application services provided by any departing node(s) in said fault containment zone to a corresponding number of said spare nodes in said fault containment zone.
  - 8. A system in accordance with claim 6 wherein transactional continuity is maintained in said fault tolerant zone by guaranteeing cluster membership integrity in said fault tolerant zone as a result of exploiting a request/response-based cluster recovery communication protocol to monitor node membership integrity in said fault tolerant zone and removing from said fault tolerant zone a node that fails to provide a node response pursuant to said request/response-based communication protocol.
  - 9. A system in accordance with claim 6 wherein transactional continuity is maintained in said fault tolerant zone by guaranteeing communication continuity in said fault tolerant zone through absolute node identification independent of cluster size and retention of pre-recovery/maintenance communication sessions in the fault-tolerant zone.
  - 10. A system in accordance with claim 6 further including rejoining one or more departing nodes after repair or maintenance into said cluster as spare nodes.

11. A computer program recorded on one or more data storage media product for use in a server cluster having plural nodes implementing a server tier in a client-server computing architecture to provide cluster recovery, comprising:
- said one or more data storage media;
  
  program logic recorded on said data storage media for programming a data processing platform to operate as by;
  
  determining a first group of N active nodes that each run a software stack comprising a cluster management tier and a cluster application tier that actively provides services on behalf of one or more client applications running in a client application tier on said clients;
  
  determining a second group of M spare nodes that each run a software stack comprising said cluster management tier and said cluster application tier, said cluster application tier of said spare nodes being continuously operational during steady-state cluster application transaction processing, but not actively providing transaction services on behalf of client applications prior to assuming an application workload from another node;
  
  dynamically logically defining first and second zones in said cluster in response to an active node membership change involving one or more active nodes departing from or being added to said first group as a result of an active node failing or becoming unreachable or as a result of a maintenance operation involving an active node;
  
  said first zone being a fault tolerant zone comprising all of said active nodes that are operational;
  
  said second zone being a fault containment zone comprising all active nodes participating in said membership change and at most a corresponding number of said spare nodes in the event that said membership change involves a node departure;
  
  implementing fast recovery/maintenance and high cluster application availability in said fault containment zone during recovery or maintenance by initiating application failover and application recovery protocols that are implemented by said cluster application and cluster management tiers of nodes in said fault containment zone following said active node membership change;
  
  maintaining transactional isolation between said fault tolerant zone and said fault containment zone by failing over the client application services provided by any departing nodes in said fault containment zone to at most a corresponding number of said spare nodes in said fault containment zone;
  
  maintaining continuous application cluster availability in said fault tolerant zone during recovery or maintenance by continuing without interruption normal transactional application and related intra-cluster messaging protocols that were being implemented by said cluster application and cluster management tiers of nodes in said fault tolerant zone prior to said active node membership change;
  
  said cluster management tier of nodes in said fault tolerant zone and said fault containment zone initiating cluster recovery protocols following said active node membership change, said cluster recovery protocols being transparent to said cluster application tier of nodes in said fault tolerant zone so as not to interfere with said normal transactional application and related intra-cluster messaging protocols implemented by nodes in said fault tolerant zone;
  
  maintaining transactional continuity in said fault tolerant zone by guaranteeing cluster membership integrity in said fault tolerant zone using a request/response-based cluster recovery communication protocol to monitor node membership integrity in said fault tolerant zone and remove nodes that fail to provide a node response, thus guaranteeing cluster membership integrity; and
  
  maintaining transactional continuity in said fault tolerant zone by guaranteeing communication continuity in said fault tolerant zone through absolute node identification independent of cluster size and retention of pre-recovery communication sessions in the fault-tolerant zone;
  
  whereby group integrity is maintained and transactional application communication messaging continues without interruption in nodes of said fault tolerant zone as cluster recovery is performed.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Rao, Sudhir G., Jackson, Bruce M.
Primary Examiner(s)
Baderman, Scott
Assistant Examiner(s)
Truong, Loan L.T.

Application Number

US13/372,209
Publication Number

US 20120166866A1
Time in Patent Office

239 Days
Field of Search

714/4.1, 714/4.11, 714/4.12
US Class Current

714/4.1
CPC Class Codes

G06F 11/1425   by reconfiguration of node ...

G06F 11/2028   eliminating a faulty proces...

G06F 11/2041   with more than one idle spa...

H04L 41/0654   using network fault recover...

Fault-tolerance and fault-containment models for zoning clustered application silos into continuous availability and high availability zones in clustered systems during recovery and maintenance

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

50 Citations

11 Claims

Specification

Use Cases

Quick Links

Others

Fault-tolerance and fault-containment models for zoning clustered application silos into continuous availability and high availability zones in clustered systems during recovery and maintenance

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

50 Citations

11 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others