Method and system for providing high availability to computer applications
First Claim
Patent Images
1. A method for distributed system level and application level fault detection for one or more applications running on one or more nodes, the method comprising:
- pre-loading system high availability shared libraries;
pre-loading fault detectors for each one or more applications into the address spaces of said one or more applications on said one or more nodes;
registering the applications with high availability protection;
detecting unplanned exits and crash faults for the one or more applications at each local node;
detecting hung applications faults for the one or more applications using at least one of a script or binary at each local node;
detecting node crash faults by each local node for the one or more nodes using node-to-node heart-beating, andsaid fault detection requires no modifications of said one or more applications to contain availability code.
2 Assignments
0 Petitions
Accused Products
Abstract
A system and method for distributed fault detection. In an exemplary method, unplanned application exits and crashes may be detected at a node local level. Further, application hangs may be detected using at least one of a script and a binary at the node local level. Also, node crashes and operating system crashes may be detected using node to node heart-beating.
44 Citations
20 Claims
-
1. A method for distributed system level and application level fault detection for one or more applications running on one or more nodes, the method comprising:
-
pre-loading system high availability shared libraries; pre-loading fault detectors for each one or more applications into the address spaces of said one or more applications on said one or more nodes; registering the applications with high availability protection; detecting unplanned exits and crash faults for the one or more applications at each local node; detecting hung applications faults for the one or more applications using at least one of a script or binary at each local node; detecting node crash faults by each local node for the one or more nodes using node-to-node heart-beating, and said fault detection requires no modifications of said one or more applications to contain availability code. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A communication network including distributed system level and application level fault detection, the network comprising:
-
a first server structured to run applications; a second server in communication with the first server, and structured to operate as a back-up server for the first server, wherein each of the first server and the second server includes an Availability Manager for system level availability and a Duration Interface for application level availability coupled to the Availability Manager; each of the first server and the second server includes an operating system; each of the first server and second server is structured to pre-load fault detectors into the address space of an application; each of the first server and second server is structured to pre-load system high availability shared libraries; registering the applications with high availability protection; said a fault detectors are structured to detect faults in the communication network using at least one of heart-beats and communication between the first server Availability Manager and the second server Availability Manager; said fault detectors are structured to detect unplanned application exits, crash faults, and hung applications faults; wherein said fault detection requires no modifications of said applications to contain availability code. - View Dependent Claims (11, 12, 13, 14, 15)
-
-
16. A computer readable storage medium including a computer program having instructions for distributed system level and application level fault detection for one or more applications running on one or more nodes,
wherein the computer program performs steps comprising: -
pre-loading system high availability shared libraries; pre-loading fault detectors for the each one or more applications into the address spaces of said one or more applications on said one or more nodes; registering the one or more applications with high availability protection; detecting unplanned application exits and crashes at a node local level; detecting application hangs using at least one of a script and a binary at the node local level; and detecting node crashes and operating system crashes using node to node heart-beating, wherein said fault detection requires no modifications to said one or more applications to contain availability code. - View Dependent Claims (17, 18, 19, 20)
-
Specification