Fault tolerant distributed computing applications

US 20040153703A1
Filed: 04/22/2003
Published: 08/05/2004
Est. Priority Date: 04/23/2002
Status: Abandoned Application

First Claim

Patent Images

1. A computer-implemented method of enhancing fault-tolerance of a distributed computing application, the method comprising:

running a monitoring program on a node in a network in connection with running software of the distributed computing application on the node;

in the monitoring program, recurrently checking continued operation of the distributed computing application'"'"'s software on the node; and

in the event of failure, initiating by the monitoring program an action to restore the distributed computing application.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A technique for enhancing fault-tolerance of a distributed computing application, including applications provided via an application service provider (ASP) model, utilizes a separate monitoring program to monitor continued operation of the distributed application software (e.g., an ASP agent) on a node of the distributed application. The application software signals its continued operation by periodically generating a “heart beat” event. On failure of the application software on the node, the monitoring program takes action to restore the application on the node, such as by restarting the application, reinstalling the application software, logging failure and/or transmitting an alert to the application'"'"'s administrator.

Citations

23 Claims

1. A computer-implemented method of enhancing fault-tolerance of a distributed computing application, the method comprising:
- running a monitoring program on a node in a network in connection with running software of the distributed computing application on the node;
  
  in the monitoring program, recurrently checking continued operation of the distributed computing application'"'"'s software on the node; and
  
  in the event of failure, initiating by the monitoring program an action to restore the distributed computing application.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1 wherein the distributed computing application includes an administrative agent for an application service provider.
  - 3. The method of claim 1 further comprising:
    - in the distributed computing application running on the node, recurrently signaling its continued operation; and
      
      in the monitoring program, monitoring for receipt of the distributed computing application'"'"'s signaling within a monitoring interval to check the distributed computing application'"'"'s continued operation on the node.
  - 4. The method of claim 1 wherein the action to restore the distributed computing application comprises restarting the distributed computing application on the node.
  - 5. The method of claim 1 wherein the action to restore the distributed computing application comprises iteratively attempting to restart the distributed computing application on the node at increasingly longer intervals.
  - 6. The method of claim 1 wherein the action to restore the distributed computing application comprises, while the distributed computing application remains inoperative, attempting to restart the distributed computing application one or more times in a plurality of restart modes, at least one of the restart modes having a longer interval between restart attempts than in another of the restart modes.
  - 7. The method of claim 1 wherein the action to restore the distributed computing application comprises reinstalling the software for the distributed computing application on the node.
  - 8. The method of claim 1 wherein the action to restore the distributed computing application comprises reinstalling a latest update version of the software for the distributed computing application on the node.
  - 9. The method of claim 1 wherein the action to restore the distributed computing application comprises reinstalling a version of the software for the distributed computing application on the node that was previously known to run without failure on the node.
  - 10. The method of claim 1 wherein the action to restore the distributed computing application comprises logging information of the failure.
  - 11. The method of claim 1 wherein the action to restore the distributed computing application comprises transmitting information of the failure to an administrative server or data center for the distributed computing application.
  - 12. The method of claim 1 wherein the action to restore the distributed computing application comprises sending an alert to a human administrator of the distributed computing application.

13. A computer-implemented method of enhancing fault-tolerance of an application provided at nodes of a distributed network via an application service provider model, the method comprising:
- periodically during execution of an application service provider agent program on a node, generating an event signaling continued operation of said agent program on the node;
  
  at periodic intervals, checking that the event was generated during a current interval;
  
  if the event was not generated in the interval, restoring the application service provider agent to operation by;
  
  at least once restarting the application service provider agent;
  
  if restarting does not restore the application service provider agent, reinstalling software of the application service provider agent on the node and restarting the application service provider agent;
  
  if reinstalling the application service provider agent does not restore the application service provider agent, transmitting notification of the application service provider agent'"'"'s failure on the node to a data center for the application service provider.

14. A fault-tolerant application service providing system of distributed computing nodes communicating via a data network, comprising:
- an application service providing data center;
  
  a computing node interconnected via the data network with the application service providing data center;
  
  on the computing node, an application service providing agent for providing an application on the computing node administered via the application service providing data center;
  
  a monitor program on the computing node for monitoring continued operation of the application service providing agent, and operating upon detecting failure of the application service providing agent to initiate a restorative action to restore the application service providing agent to operation on the node.
- View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22)
- - 15. The fault-tolerant application service providing system of claim 14 wherein the monitor program further operates to report failure of the application service providing agent on the node to the application service providing data center.
  - 16. The fault-tolerant application service providing system of claim 14 wherein the monitor program further operates to report failure of the application service providing agent on the node to the application service providing data center when the restorative action fails to restore the application service providing agent to operation on the node.
  - 17. The fault-tolerant application service providing system of claim 14 wherein the restorative action comprises restarting the application service providing agent on the node.
  - 18. The fault-tolerant application service providing system of claim 14 wherein the restorative action comprises initiating restarts of the application service providing agent on the node, initially at shorter restart intervals and later at longer intervals, thereby permitting a temporary low resource availability condition to be alleviated.
  - 19. The fault-tolerant application service providing system of claim 14 wherein the restorative action comprises obtaining from the application service providing data center and reinstalling a current version of the application service providing agent on the node.
  - 20. The fault-tolerant application service providing system of claim 14 wherein the restorative action comprises reinstalling a version of the application service providing agent on the node that is recorded to have most recently successfully operated on the node.
  - 21. The fault-tolerant application service providing system of claim 14 wherein the restorative action comprises logging failure of the application service providing agent on the node.
  - 22. The fault-tolerant application service providing system of claim 14 wherein the restorative action comprises uploading information of the failure to the application service providing data center.

23. A computer-readable media for carrying a fault-tolerance enhancing program for a distributed computing application, the program comprising for execution at a computing node on a data network:
- means for monitoring continued operation of the distributed computing application at the computing node to detect failure of the distributed computing application to continually operate on the computing node;
  
  means responsive to the failure being detected, for initiating actions to restore the distributed computing application to operation on the computing node; and
  
  means responsive to failure to restore operation of the distributed computing application on the computing node, for transmitting information of the failure to a distributed computing application administering server on the data network.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Secure Resolutions Incorporated
Original Assignee
Secure Resolutions Incorporated
Inventors
Vigue, Charles Leslie, Huang, Ricky Y., Melchione, Daniel Joseph

Application Number

US10/421,493
Publication Number

US 20040153703A1
Time in Patent Office

Days
Field of Search
US Class Current

714/4
CPC Class Codes

G06F 11/0748   in a remote unit communicat...

G06F 11/0793   Remedial or corrective acti...

G06F 11/3006   where the computing system ...

G06F 11/302   where the computing system ...

G06F 11/3055   Monitoring arrangements for...

G06F 11/3089   Monitoring arrangements det...

Fault tolerant distributed computing applications

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

23 Claims

Specification

Solutions

Use Cases

Quick Links

Fault tolerant distributed computing applications

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

23 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links