Method and system for providing high availability to computer applications

US 8,037,367 B1
Filed: 12/15/2008
Issued: 10/11/2011
Est. Priority Date: 08/26/2004
Status: Active Grant

First Claim

Patent Images

1. A method for distributed system level and application level fault detection for one or more applications running on one or more nodes, the method comprising:

pre-loading system high availability shared libraries;

pre-loading fault detectors for each one or more applications into the address spaces of said one or more applications on said one or more nodes;

registering the applications with high availability protection;

detecting unplanned exits and crash faults for the one or more applications at each local node;

detecting hung applications faults for the one or more applications using at least one of a script or binary at each local node;

detecting node crash faults by each local node for the one or more nodes using node-to-node heart-beating, andsaid fault detection requires no modifications of said one or more applications to contain availability code.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for distributed fault detection. In an exemplary method, unplanned application exits and crashes may be detected at a node local level. Further, application hangs may be detected using at least one of a script and a binary at the node local level. Also, node crashes and operating system crashes may be detected using node to node heart-beating.

44 Citations

View as Search Results

20 Claims

1. A method for distributed system level and application level fault detection for one or more applications running on one or more nodes, the method comprising:
- pre-loading system high availability shared libraries;
  
  pre-loading fault detectors for each one or more applications into the address spaces of said one or more applications on said one or more nodes;
  
  registering the applications with high availability protection;
  
  detecting unplanned exits and crash faults for the one or more applications at each local node;
  
  detecting hung applications faults for the one or more applications using at least one of a script or binary at each local node;
  
  detecting node crash faults by each local node for the one or more nodes using node-to-node heart-beating, andsaid fault detection requires no modifications of said one or more applications to contain availability code.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method according to claim 1, wherein the fault detectors for unplanned exits faults, crash faults, and hung applications faults run as one of an operating system service and a daemon.
  - 3. The method according to claim 1, wherein fault detectors for unplanned exits faults, crash faults, and hung applications faults run as a one of a kernel service and a kernel module.
  - 4. The method according to claim 1, wherein the script for fault detection is a shell script.
  - 5. The method according to claim 1, wherein the binary for fault detection is a custom health-check designed to invoke one or more components of said one or more applications.
  - 6. The method according to claim 1, wherein the node to node heart-beating is conducted over a network.
  - 7. The method according to claim 1, wherein the node to node heart-beating is conducted using shared storage.
  - 8. The method according to claim 1, wherein all fault-detection for said one or more applications is node local;
    - unplanned exit faults and crash faults are detected by a broken communication link; and
      
      hung application faults are detected by invoking one or more features of said one or more applications.
  - 9. The method according to claim 1, wherein all fault detection is coordinated through a central fault monitor.

10. A communication network including distributed system level and application level fault detection, the network comprising:
- a first server structured to run applications;
  
  a second server in communication with the first server, and structured to operate as a back-up server for the first server,wherein each of the first server and the second server includes an Availability Manager for system level availability and a Duration Interface for application level availability coupled to the Availability Manager;
  
  each of the first server and the second server includes an operating system;
  
  each of the first server and second server is structured to pre-load fault detectors into the address space of an application;
  
  each of the first server and second server is structured to pre-load system high availability shared libraries;
  
  registering the applications with high availability protection;
  
  said a fault detectors are structured to detect faults in the communication network using at least one of heart-beats and communication between the first server Availability Manager and the second server Availability Manager;
  
  said fault detectors are structured to detect unplanned application exits, crash faults, and hung applications faults;
  
  wherein said fault detection requires no modifications of said applications to contain availability code.
- View Dependent Claims (11, 12, 13, 14, 15)
- - 11. The communication network according to claim 10, further comprising Operating System Library layers coupled to the Availability Manager.
  - 12. The communication network according to claim 10, wherein the fault detector is structured to detect unplanned application exits by the Availability Manager when a communication link between the Availability Manager and the Duration Interface is broken.
  - 13. The communication network according to claim 10, wherein the fault detector is structured to detect application hangs in conjunction with health-checks configured to invoke one or more features of an application on the first server and the second server where an application is running.
  - 14. The communication network according to claim 13, wherein the health-checks invoke at least one of a feature and component of the application and, if an invalid result is produced, create a fault event.
  - 15. The communication network according to claim 13, wherein the health checks are at least one of an executable, script and macro that are capable of calculating and returning integer values.

16. A computer readable storage medium including a computer program having instructions for distributed system level and application level fault detection for one or more applications running on one or more nodes,wherein the computer program performs steps comprising:
- pre-loading system high availability shared libraries;
  
  pre-loading fault detectors for the each one or more applications into the address spaces of said one or more applications on said one or more nodes;
  
  registering the one or more applications with high availability protection;
  
  detecting unplanned application exits and crashes at a node local level;
  
  detecting application hangs using at least one of a script and a binary at the node local level;
  
  and detecting node crashes and operating system crashes using node to node heart-beating,wherein said fault detection requires no modifications to said one or more applications to contain availability code.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The computer readable medium according to claim 16, wherein the computer program further performs a step of detecting unplanned application exits by an Availability Manager when a communication link between the Availability Manager and a Duration Interface is broken.
  - 18. The computer readable medium according to claim 16, wherein the computer program further performs the step of detecting application hangs in conjunction with health-checks configured to invoke one or more features of an application on nodes where an application is running.
  - 19. The computer readable medium according to claim 18, wherein the health-checks invoke at least one of a feature and component of the application and, if an invalid result is produced, create a fault event.
  - 20. The computer readable medium according to claim 16, wherein the script is a shell script, and the binary is a custom health-check designed to invoke one or more components of said one or more applications.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Red Hat, Inc. (International Business Machines Corporation)
Original Assignee
Open Invention Network LLC
Inventors
Havemose, Allan
Primary Examiner(s)
Baderman; Scott
Assistant Examiner(s)
Leibovich; Yair

Application Number

US12/334,651
Time in Patent Office

1,030 Days
Field of Search

714/55, 714/47.1
US Class Current

714/55
CPC Class Codes

G06F 11/0709   in a distributed system con...

G06F 11/1402   Saving, restoring, recoveri...

G06F 11/141   for bus or memory accesses

G06F 11/1438   Restarting or rejuvenating

G06F 11/1482   by means of middleware or O...

G06F 11/20   using active fault-masking,...

G06F 11/2002   where interconnections or c...

G06F 11/2023   Failover techniques

G06F 11/2028   eliminating a faulty proces...

G06F 11/203   using migration

G06F 11/2046   where the redundant compone...

G06F 11/30   Monitoring

H04L 61/5007   Internet protocol [IP] addr...

Method and system for providing high availability to computer applications

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

44 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for providing high availability to computer applications

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

44 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links