Method and apparatus for managing software errors in a computer system
First Claim
Patent Images
1. A method for managing a system, comprising:
- monitoring a plurality of software applications running in the system for software errors;
predicting whether software errors detected would result in a failure of one of the plurality of software applications; and
initiating fault recovery in response to a failure prediction by performing audits to check communication links with other applications with which the one of the plurality of software applications is interfacing if a predicted failure is due to errors in interprocess communication mechanisms, and performing one of restarting the one of the plurality of software applications, and initiating failover of the one of the plurality of software applications prior to its failure to change a condition of the system.
1 Assignment
0 Petitions
Accused Products
Abstract
A method for managing a system includes monitoring a plurality of applications running in the system for errors. A prediction is made as to whether errors detected would result in a failure. Fault recovery is initiated in response to a failure prediction. According to one aspect of the present invention, monitoring the plurality of applications includes reading error recorders associated with error occurrence. Other embodiments are described and claimed.
52 Citations
25 Claims
-
1. A method for managing a system, comprising:
-
monitoring a plurality of software applications running in the system for software errors; predicting whether software errors detected would result in a failure of one of the plurality of software applications; and initiating fault recovery in response to a failure prediction by performing audits to check communication links with other applications with which the one of the plurality of software applications is interfacing if a predicted failure is due to errors in interprocess communication mechanisms, and performing one of restarting the one of the plurality of software applications, and initiating failover of the one of the plurality of software applications prior to its failure to change a condition of the system. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 23, 24, 25)
-
-
12. An article of manufacture comprising a machined accessible medium including sequences of instructions, the sequences of instructions including instructions which when executed cause the machine to perform:
-
monitoring a plurality of software applications running in the system for software errors; predicting whether software errors detected would result in a failure of one of the plurality of software applications; and initiating fault recovery in response to a failure prediction prior to a failure of the one of the software applications by performing audits to check communication links with other applications with which the one of the plurality of software applications is interfacing if a predicted failure is due to errors in interprocess communication mechanisms, and performing one of restarting the one of the plurality of software applications, and initiating failover of the one of the plurality of software applications. - View Dependent Claims (13, 14, 15, 16)
-
-
17. A computer system, comprising:
-
a bus; a memory; a processor; and a fault prediction module that includes a fault detection unit to monitor a plurality of software applications running in the system for software errors, a failure prediction unit to predict whether software errors detected will result in a failure in one of the plurality of software applications, and a fault recovery unit to initiate fault recovery in response to a failure prediction to change a condition of one of the computer system and the one of the plurality of software applications prior to a failure of the one of the plurality of software application by performing audits to check communication links with other applications with which the one of the plurality of software applications is interfacing if a predicted failure is due to errors in interprocess communication mechanisms, and performing one of restarting the one of the plurality of software applications, and initiating failover of the one of the plurality of software applications. - View Dependent Claims (18, 19, 20, 21, 22)
-
Specification