Method and system for transparent symptom-based selective software rejuvenation
First Claim
Patent Images
1. A method for increased software dependability, comprising:
- learning how to predict an outage of a software system running on a computer;
based on said learning, predicting an imminent outage;
avoiding the outage; and
rejuvenating said software without modifying an application running in said software system.
1 Assignment
0 Petitions
Accused Products
Abstract
A method (and system) for increased software dependability, includes learning how to predict an outage of a software system running on a computer, and, based on the learning, predicting an imminent outage, and avoiding the outage.
227 Citations
42 Claims
-
1. A method for increased software dependability, comprising:
-
learning how to predict an outage of a software system running on a computer;
based on said learning, predicting an imminent outage;
avoiding the outage; and
rejuvenating said software without modifying an application running in said software system. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
selectively rejuvenating said system such that only that part of the system that is causing aging is rejuvenated.
-
-
8. The method according to claim 1, wherein an aggregation of indicators is performed during said learning to provide a reliable predictor of impending outage.
-
9. A method for software rejuvenation, comprising:
-
waiting for symptoms associated with an imminent outage of software of a software system;
determining whether a fail-to node has adequate resources to accept a failover workload;
if said determining determines that the fail-to node cannot accept the failover workload, sending an alert that adequate resources do not exist to support fault tolerance requirements;
suspending rejuvenation until an operator acknowledges and corrects a deficiency; and
rejuvenating said software without modifying an application running in said software system. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33)
if the determining determines that the fail-to node can accept the failover workload, then a rejuvenation agent on a first node instructing a cluster manager to shut down an open application in a pre-planned manner on the first node; and
restarting the application on a second node.
-
-
11. The method according to claim 9, further comprising:
-
if the determining determines that the fail-to node can accept the failover workload, then a rejuvenation agent on a node instructing a cluster manager to shut down an open application in a pre-planned manner on the node; and
restarting the application on the node.
-
-
12. The method according to claim 10, wherein said first node comprises a primary node and said second node comprises a secondary node, said method further comprising:
designating, by the cluster manager, the secondary node as a new primary node, and the primary node as a new secondary node.
-
13. The method according to claim 9, wherein said rejuvenation is performed in one of a clustered environment and a single node environment.
-
14. The method according to claim 9, wherein said rejuvenation is devoid of changing any of a source code and an executable code of an application running on said system.
-
15. The method according to claim 9, further comprising:
predicting an impending outage due to resource exhaustion.
-
16. The method according to claim 15, wherein said predicting comprises incorporating one of effects of variance on an extrapolated trend, incorporating time integral tests for secondary indicators, and including increased degrees of variance as symptomatic of outages.
-
17. The method according to claim 15, wherein said predicting comprises using a plurality of indicators in combination to form a predictor of said outage.
-
18. The method according to claim 17, wherein no one of said indicators is necessarily at a global extreme.
-
19. The method according to claim 15, wherein said predicting comprises using a single indicator which is approaching a predetermined threshold.
-
20. The method according to claim 17, wherein said indicators are used to identify which of any of at least one of a subsystem, a process, and a thread are causing the resource exhaustion.
-
21. The method according to claim 9, wherein said avoiding comprises automatically performing selective software rejuvenation, without operator intervention.
-
22. The method according to claim 9, wherein said rejuvenation is performed for one of a portion of said system and an entirety of said system.
-
23. The method according to claim 9, wherein said rejuvenation is invoked within a cluster environment, and cluster management failover services are used to stop an offending subsystem controllably and to restart said offending subsystem on one of a same node and another node in the cluster.
-
24. The method according to claim 23, further comprising:
prior to invoking rejuvenation in the cluster, checking a fail-to node of the cluster to confirm that said fail-to node has adequate resources to accept the failed-over workload.
-
25. The method according to claim 24, further comprising:
if the resource check fails, then informing a system operator that the fail-to node cannot accept the failed-over workload, and alerting the operator of the system'"'"'s inability to perform rejuvenation.
-
26. The method according to claim 25, wherein said operator takes corrective action to restore the system'"'"'s fault resilience by at least one of adding processors, adding memory, adding I/O devices, adding storage, and rejuvenating the fail-to node to free resources consumed by aging on said fail-to node.
-
27. The method according to claim 9, wherein said avoiding includes rejuvenating at least part of said system, said rejuvenation being performed by rejuvenating only prior to an unplanned outage.
-
28. The method according to claim 27, further comprising:
identifying exactly which of at least one of a subsystem, process, and thread is responsible for the resource exhaustion, such that only an offending one of said at least one of said subsystem, process, and thread is rejuvenated.
-
29. The method according to claim 28, wherein said identifying comprises non-intrusively monitoring and analyzing a state of said software system so as to predict an impending resource exhaustion-induced outage.
-
30. The method according to claim 28, wherein an aggregation of indicators is performed during said identifying to provide a reliable predictor of impending outage.
-
31. The method according to claim 30, further comprising:
when said aggregation of said indicators approaches a region associated with an increased likelihood of unplanned outage, notifying said system operator to initiate a planned outage.
-
32. The method according to claim 31, wherein said rejuvenation based on identification of said indicators is performed during a next acceptable interval.
-
33. The method according to claim 9, wherein said rejuvenation is performed, transparently to an application software of said system, based on measuring an earlier one of at least one of elapsed time and indicative symptoms, and by signaling an impending unplanned outage to one of an operator and a cluster management software to perform a planned rejuvenation.
-
34. A method for software rejuvenation, comprising:
-
waiting for symptoms associated with an imminent outage of software of a software system;
determining whether a fail-to node has adequate resources to accept a failover workload;
if the determining determines the fail-to node can accept the failover workload, then a rejuvenation agent on a first node instructing a cluster manager to shut down an open application in a pre-planned manner on the first node; and
restarting the application on a second node without modifying the application running on said first node. - View Dependent Claims (35, 36, 37)
if said determining determines that the fail-to node cannot accept the failover workload, then sending an alert that adequate resources do not exist to support fault tolerance requirements; and
suspending rejuvenation until an operator acknowledges and corrects a deficiency.
-
-
36. The method according to claim 35, wherein said first node comprises a primary node and said second node comprises a secondary node, further comprising:
designating, by the cluster manager, the second node as a new primary node, and the first node as a new secondary node.
-
37. The method according to claim 34, further comprising:
after said waiting, selecting an appropriate rejuvenation time.
-
38. A system for increased software dependability, comprising:
-
a learning unit for learning how to predict an outage of a software system running on a computer;
a predictor for predicting, based on an output from said learning unit, an imminent outage of said software system; and
a rejuvenation agent for avoiding the outage, wherein the rejuvenation agent rejuvenates said software without modifying an application running in said software system.
-
-
39. A system for increasing software dependability, comprising:
-
a sensing unit for sensing symptoms associated with an imminent outage of said software;
a determiner for determining whether a fail-to node has adequate resources to accept a failover workload upon said sensing unit sensing said symptoms; and
a rejuvenation agent, based on an output from said determining unit that the fail-to node cannot accept the failover workload, and for sending an alert that adequate resources do not exist to support fault tolerance requirements, said rejuvenating agent suspending rejuvenation until an operator acknowledges and corrects a deficiency, wherein said rejuvenation agent rejuvenates said software without modifying an application running in said software system.
-
-
40. A system for increased software dependability, comprising:
-
means for learning how to predict an outage of a software system running on a computer;
means for predicting, based on an output from said learning means, an imminent outage of said software system;
means for avoiding the outage; and
means for performing software rejuvenation without modifying an application running in said software system.
-
-
41. A signal-bearing medium tangibly embodying a program of machine-readable instructions executable by a digital processing apparatus to perform a method for increasing software dependability, said method comprising:
-
learning how to predict an outage of a software system running on a computer;
based on said leaning, predicting an imminent outage;
avoiding the outage; and
rejuvenating said software without modifying an application running in said software system.
-
-
42. A signal-bearing medium tangibly embodying a program of machine-readable instructions executable by a digital processing apparatus to perform a method for computer-implemented dependability of software, said method comprising:
-
waiting for symptoms associated with an imminent outage of a software system;
determining whether a fail-to node has adequate resources to accept a failover workload;
if said determining determines that the fail-to node cannot accept the failover workload, sending an alert that adequate resources doe not exist to support fault tolerance requirements;
suspending rejuvenation until an operator acknowledges and corrects a deficiency; and
rejuvenating said software without modifying an application running in said software system.
-
Specification