Proactive method for ensuring availability in a clustered system

US 6,986,076 B1
Filed: 05/28/2002
Issued: 01/10/2006
Est. Priority Date: 05/28/2002
Status: Expired due to Term

First Claim

Patent Images

1. In a computer system including at least two server nodes, each of which can execute clustered server software, a method for monitoring failure situations to reduce downtime, said method comprising the steps of:

(a) detecting an event causing one of said failure situations;

(b) determining if said event affects one of said server nodes, and if so;

(c) determining if said event exceeds a threshold value, and if so;

(d) executing a proactive failover;

(e) determining if said event does not affect one of said server nodes, and if so;

(f) determining if said event affects the condition of the cluster service, and if so;

(g) identifying and initiating an appropriate action to fix said condition or provide a workaround that will preempt an impending failure of the cluster system, or restart a failed cluster system.

View all claims

12 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The method of the present invention is useful in a computer system including at least two server nodes, each of which can execute clustered server software. The program executes a method for monitoring failure situations to reduce downtime. The method includes the step of detecting an event causing one of the failure situations, and then the method determines if the event affects one of the server nodes. If it is determined the event does affect one of the server nodes, the method then determines if the event exceeds a threshold value. If it is determined the event exceeds a threshold value, the method executes a proactive failover. If the event is not specific to a cluster node, but indicates an impending or actual failure of the cluster software, the method identifies and initiates an appropriate action to fix the condition or provide a workaround (if available) that will preempt an impending failure of the cluster system or would enable a restarting of a failed cluster software.

Citations

9 Claims

1. In a computer system including at least two server nodes, each of which can execute clustered server software, a method for monitoring failure situations to reduce downtime, said method comprising the steps of:
- (a) detecting an event causing one of said failure situations;
  
  (b) determining if said event affects one of said server nodes, and if so;
  
  (c) determining if said event exceeds a threshold value, and if so;
  
  (d) executing a proactive failover;
  
  (e) determining if said event does not affect one of said server nodes, and if so;
  
  (f) determining if said event affects the condition of the cluster service, and if so;
  
  (g) identifying and initiating an appropriate action to fix said condition or provide a workaround that will preempt an impending failure of the cluster system, or restart a failed cluster system.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method as in claim 1 wherein said threshold value is selected by a user to represent unacceptable server conditions.
  - 3. The method as in claim 1 wherein said step (a) of detecting said event includes the steps of:
    - (a1) listening for an SNMP event;
      
      (a2) listening for an event log event.
  - 4. The method as in claim 3 wherein said step (a1) of listening for said SNMP event includes the steps of:
    - (a1a) determining if SNMP service software is installed, and if so;
      
      (a1b) determining if SNMP agent software is installed, and if so;
      
      (a1c) initiating a thread to receive and process SNMP traps.
  - 5. The method as in claim 4 wherein it is determined in step (a1a) that said SNMP service software is not installed, then further comprising the step of:
    - (a1a1) creating a notification to install said SNMP service software.
  - 6. The method as in claim 4 wherein it is determined in step (a1b) that said SNMP agent software is not installed, then further comprising the step of:
    - (a2b1) creating a notification to install said SNMP agent software.

7. In a computer system including at least two server nodes, each of which can execute clustered server software, a method for monitoring failure situations to reduce downtime, said method comprising the steps of:
- (a) detecting an event causing one of said failure situations said detecting including the steps of;
  
  (a1) listening for a Simple Network Management Protocol (SNMP) event;
  
  (a2) listening for an event log event;
  
  wherein said step (a2) for listening for said event log event further includes the steps of;
  
  (a2a) opening a connection to a Windows Management Instrumentation (WMI) service;
  
  (a2b) subscribing to receive event log messages from said WMI service;
  
  (b) determining if said event affects one of said server nodes, and if so;
  
  (c) determining if said event exceeds a threshold value, and if so;
  
  (d) executing a proactive failover.
- View Dependent Claims (8, 9)
- - 8. The method as in claim 7 wherein said step (c), of determining if said event exceeds a threshold value, further includes the steps of:
    - (c1) ensuring said node affected by said event owns a cluster group;
      
      (c2) ensuring there is a remaining node to failover to;
      
      (c3) ensuring said remaining node is clear of critical events.
  - 9. The method as in claim 7 wherein said step (d) of executing a proactive failover includes the steps of:
    - (d1) initiating a failover process for each cluster group in an offline state;
      
      (d2) logging the result for each of said failover processes.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Unisys Corporation
Inventors
Vellore, Prabhakar Krishnamurthy, Smith, Norman Roy
Primary Examiner(s)
BADERMAN, SCOTT T

Application Number

US10/156,486
Time in Patent Office

1,323 Days
Field of Search

714/4, 714/13, 714/47
US Class Current

714/4.11
CPC Class Codes

G06F 11/004   Error avoidance G06F11/07 a...

G06F 11/2028   eliminating a faulty proces...

G06F 11/3495   for systems

G06F 2201/86   Event-based monitoring

H04L 41/0631   using root cause analysis; ...

H04L 41/0659   by isolating or reconfiguri...

H04L 43/0817   by checking functioning

H04L 43/10   Active monitoring, e.g. hea...

H04L 43/16   Threshold monitoring

Proactive method for ensuring availability in a clustered system

First Claim

12 Assignments

0 Petitions

Accused Products

Abstract

Citations

9 Claims

Specification

Solutions

Use Cases

Quick Links

Proactive method for ensuring availability in a clustered system

First Claim

12 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

9 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links