Fault tolerant distributed computing applications
First Claim
1. A computer-implemented method of enhancing fault-tolerance of a distributed computing application, the method comprising:
- running a monitoring program on a node in a network in connection with running software of the distributed computing application on the node;
in the monitoring program, recurrently checking continued operation of the distributed computing application'"'"'s software on the node; and
in the event of failure, initiating by the monitoring program an action to restore the distributed computing application.
1 Assignment
0 Petitions
Accused Products
Abstract
A technique for enhancing fault-tolerance of a distributed computing application, including applications provided via an application service provider (ASP) model, utilizes a separate monitoring program to monitor continued operation of the distributed application software (e.g., an ASP agent) on a node of the distributed application. The application software signals its continued operation by periodically generating a “heart beat” event. On failure of the application software on the node, the monitoring program takes action to restore the application on the node, such as by restarting the application, reinstalling the application software, logging failure and/or transmitting an alert to the application'"'"'s administrator.
-
Citations
23 Claims
-
1. A computer-implemented method of enhancing fault-tolerance of a distributed computing application, the method comprising:
-
running a monitoring program on a node in a network in connection with running software of the distributed computing application on the node;
in the monitoring program, recurrently checking continued operation of the distributed computing application'"'"'s software on the node; and
in the event of failure, initiating by the monitoring program an action to restore the distributed computing application. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A computer-implemented method of enhancing fault-tolerance of an application provided at nodes of a distributed network via an application service provider model, the method comprising:
-
periodically during execution of an application service provider agent program on a node, generating an event signaling continued operation of said agent program on the node;
at periodic intervals, checking that the event was generated during a current interval;
if the event was not generated in the interval, restoring the application service provider agent to operation by;
at least once restarting the application service provider agent;
if restarting does not restore the application service provider agent, reinstalling software of the application service provider agent on the node and restarting the application service provider agent;
if reinstalling the application service provider agent does not restore the application service provider agent, transmitting notification of the application service provider agent'"'"'s failure on the node to a data center for the application service provider.
-
-
14. A fault-tolerant application service providing system of distributed computing nodes communicating via a data network, comprising:
-
an application service providing data center;
a computing node interconnected via the data network with the application service providing data center;
on the computing node, an application service providing agent for providing an application on the computing node administered via the application service providing data center;
a monitor program on the computing node for monitoring continued operation of the application service providing agent, and operating upon detecting failure of the application service providing agent to initiate a restorative action to restore the application service providing agent to operation on the node. - View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22)
-
-
23. A computer-readable media for carrying a fault-tolerance enhancing program for a distributed computing application, the program comprising for execution at a computing node on a data network:
-
means for monitoring continued operation of the distributed computing application at the computing node to detect failure of the distributed computing application to continually operate on the computing node;
means responsive to the failure being detected, for initiating actions to restore the distributed computing application to operation on the computing node; and
means responsive to failure to restore operation of the distributed computing application on the computing node, for transmitting information of the failure to a distributed computing application administering server on the data network.
-
Specification