Method and system for transparent time-based selective software rejuvenation
First Claim
Patent Images
1. A method of enhancing software dependability, comprising:
- measuring a time elapsed in a software system running on a computer;
determining whether said elapsed time matches a threshold; and
when said elapsed time matches said threshold, rejuvenating at least a portion of said software system to reduce a likelihood of an outage and without modifying an application running in said software system.
1 Assignment
0 Petitions
Accused Products
Abstract
A method of enhancing software dependability, includes measuring an elapsed time in a software system running on a computer, determining whether the elapsed time matches a threshold, and when the elapsed time matches the threshold, rejuvenating at least a portion of the software system to reduce the likelihood of an outage and without modifying an application running in the software system.
-
Citations
28 Claims
-
1. A method of enhancing software dependability, comprising:
-
measuring a time elapsed in a software system running on a computer;
determining whether said elapsed time matches a threshold; and
when said elapsed time matches said threshold, rejuvenating at least a portion of said software system to reduce a likelihood of an outage and without modifying an application running in said software system.
-
-
2. A method for software rejuvenation, comprising:
-
waiting for a selected inter-rejuvenation interval to expire in a software system;
determining whether a fail-to node has adequate resources to accept a failover workload;
if said determining determines tat the fail-to node cannot accept the failover workload, then sending an alert that adequate resources do not exist to support fault tolerance requirements;
suspending rejuvenation until an operator acknowledges and corrects the deficiency; and
rejuvenating said software without modifying an application running in said software system. - View Dependent Claims (3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
if the determining determines the fail-to node can accept the failover workload, then a rejuvenation agent on a first node instructing a cluster manager to shut down an open application in a pre-planned manner on the first node; and
restarting the application on a second node.
-
-
4. The method according to claim 2, further comprising:
-
if the determining determines the fail-to node can accept the failover workload, then a rejuvenation agent on a node instructing a cluster manager to shut down an open application in a pre-planned manner on the node; and
subsequently restarting the application on said node.
-
-
5. The method according to claim 2, wherein said software rejuvenation is performed at an application software level.
-
6. The method according to claim 3, wherein said first node comprises a primary node and said second node comprises a secondary node, said method further comprising:
designating, by the cluster manager, the secondary node as a new primary node, and the primary node as a new secondary node.
-
7. The method according to claim 2, wherein said rejuvenation is performed in a clustered environment.
-
8. The method according to claim 2, wherein said rejuvenation is devoid of changing an application running on said system.
-
9. The method according to claim 2, further comprising:
automatically performing selective software rejuvenation, on a periodic basis, without operator intervention, and at a time which is deemed least disruptive to system operation.
-
10. The method according to claim 9, wherein said rejuvenation is performed based on one of a time elapsed since a last rejuvenation, and said system having completed a particular workload.
-
11. The method according to claim 10, wherein said rejuvenation is performed for one of a portion of said system and an entirety of said system.
-
12. The method according to claim 2, wherein said rejuvenation is performed transparently to an application program running on said system, such that no changes to an application software of said software system are required.
-
13. The method according to claim 2, wherein said rejuvenation is invoked within a cluster environment, and
wherein cluster management failover services are used to controllably terminate one of an offending subsystem and an application software, and to restart said one of said subsystem and application software on a same or another node in the cluster. -
14. The method according to claim 2, further comprising:
prior to invoking rejuvenation in the cluster, checking a fail-to node of the cluster to confirm whether said fail-to node has adequate resources to accept the failed-over workload.
-
15. The method according to claim 14, further comprising:
if the resource check fails, then informing a system operator that the failover cannot occur, and alerting the operator of the system'"'"'s inability to perform rejuvenation.
-
16. The method according to claim 15, wherein said operator takes corrective action to restore the system'"'"'s fault resilience by at least one of adding processors, adding memory, adding input/output (I/O) devices, adding storage, and rejuvenating the fail-to node to free up resources consumed by aging on the fail-to node.
-
17. The method according to claim 2, wherein said rejuvenation is performed, transparently to an application software of said system, based on measuring elapsed time, and by signaling to one of an operator and cluster management software to perform a planned rejuvenation.
-
18. The method according to claim 2, further comprising:
scheduling said rejuvenation to occur at a time of least system workload.
-
19. The method according to claim 2, further comprising:
selectively rejuvenating said system such that only that part of the system that is causing aging is rejuvenated.
-
20. The method according to claim 2, further comprising:
performing said rejuvenation without modifying an application software of said software system.
-
21. A method for software rejuvenation, comprising:
-
waiting for a selected inter-rejuvenation interval to expire in a software system;
determining whether a fail-to node has adequate resources to accept a failover workload;
if said determining determines that the fail-to node can accept the failover workload, then a rejuvenation agent on a primary node instructing a cluster manager to shut down an open application in a pre-planned manner on the primary node without modifying an application running in said software system; and
restarting the application on one of the primary node and a secondary node. - View Dependent Claims (22)
if said determining determines that the fail-to node cannot accept the failover workload, then sending an alert that adequate resources do not exist to support fault tolerance requirements; and
suspending rejuvenation until an operator acknowledges and corrects the deficiency.
-
-
23. A system for increasing software dependability, comprising:
-
a timer for measuring an elapsed time in a software system running on a computer; and
a management interface, coupled to said timer, for determining whether said elapsed time matches a threshold, wherein when said elapsed time matches said threshold, said management interface rejuvenates at least a portion of said software system to reduce the likelihood of an outage and without modifying an application running in said software system.
-
-
24. A system for software rejuvenation, comprising:
-
a determiner for determining whether a fail-to node has adequate resources to accept a failover workload, upon expiration of an inter-rejuvenation interval; and
a rejuvenation agent on a primary node instructing a cluster manager to shut down an open application in a pre-planned manner on the primary node, when said determiner determines that said fail-to node can accept the failover workload, said rejuvenation agent restarting the application on one of the primary node and a secondary node without modifying the application running on said primary node.
-
-
25. A system for enhancing software dependability, comprising:
-
means for measuring a time elapsed in a software system running on a computer;
means for determining whether said elapsed time matches a threshold; and
means for rejuvenating at least a portion of said software system, when said elapsed time matches said threshold, to reduce a likelihood of an outage and without modifying an application running in said software system.
-
-
26. A signal-bearing medium tangibly embodying a program of machine-readable instructions executable by a digital processing apparatus to perform a method for computer-implemented dependability of software, said method comprising:
-
measuring an elapsed time in a software system running on a computer;
determining whether said elapsed time matches a threshold; and
when said elapsed time matches said threshold, rejuvenating at least a portion of said software system to reduce the likelihood of an outage and without modifying an application running in said software system.
-
-
27. A signal-bearing medium tangibly embodying a program of machine-readable instruction executable by a digital processing apparatus to perform a method for computer-implemented dependability of software, said method comprising:
-
waiting for a selected inter-rejuvenation interval to expire in a software system;
determining whether a fail-to node has adequate resources to accept a failover workload;
if said determining determines that the fail-to node cannot accept the failover workload, then sending an alert that adequate resources do not exist to support fault tolerance requirements;
suspending rejuvenation until an operator acknowledges and corrects the deficiency; and
rejuvenating said software without modifying an application running in said software system.
-
-
28. A signal-bearing medium tangibly embodying a program of machine-readable instructions executable by a digital processing apparatus to perform a method for computer-implemented dependability of software, said method comprising:
-
waiting for a selected inter-rejuvenation interval to expire in a software system;
determining whether a fail-to node has adequate resources to accept a failover workload;
if said determining determines that the fail-to node can accept the failover workload, then a rejuvenation agent on a primary node instructing a cluster manager to shut down an open application in a pre-planned manner on the primary node without modifying the application running on said primary node; and
restarting the application on one of the primary node and a secondary node.
-
Specification