Proactive and adaptive cloud monitoring
First Claim
1. A method comprising:
- determining that an active functional component of a computer system has reached a likelihood of failure based at least in part on a first value of a metric exceeding a metric threshold;
wherein the first value of the metric reflects performance of the active functional component at a time before a first action is caused to be performed in an attempt to reduce the likelihood of failure of the active functional component;
causing the first action to be performed in an attempt to reduce the likelihood of failure of the active functional component;
after causing the first action to be performed, determining that the active functional component has again reached the likelihood of failure based at least in part on a second value of the metric exceeding the metric threshold;
wherein the second value of the metric reflects performance of the active functional component at a time after the first action is caused to be performed;
causing a second action, that is different from the first action, to be performed in an attempt to reduce the likelihood of failure of the active functional component;
after causing the second action to be performed, obtaining a third value for the metric reflecting performance of the active functional component at a time after the second action is caused to be performed; and
using a machine learning component to determine a third action to be performed to reduce the likelihood of failure of the active functional component based at least in part on the second and the third values of the metric and the first and second actions caused to be performed; and
wherein the method is performed by one or more computing devices.
0 Assignments
0 Petitions
Accused Products
Abstract
Processes, computer-readable media, and machines are disclosed for reducing a likelihood that active functional components fail in a computing system. An active monitoring component receives metrics associated with different active functional components of a computing system. The different active functional components contribute to different functionalities of the system. Based at least in part on the metrics associated with a particular active functional component, the active monitoring component determines that the particular active functional component has reached a likelihood of failure but has not failed. In response to determining that the particular active functional component has reached the likelihood of failure but has not failed, the active monitoring component causes a set of actions that are predicted to reduce the likelihood of failure.
52 Citations
20 Claims
-
1. A method comprising:
-
determining that an active functional component of a computer system has reached a likelihood of failure based at least in part on a first value of a metric exceeding a metric threshold; wherein the first value of the metric reflects performance of the active functional component at a time before a first action is caused to be performed in an attempt to reduce the likelihood of failure of the active functional component; causing the first action to be performed in an attempt to reduce the likelihood of failure of the active functional component; after causing the first action to be performed, determining that the active functional component has again reached the likelihood of failure based at least in part on a second value of the metric exceeding the metric threshold; wherein the second value of the metric reflects performance of the active functional component at a time after the first action is caused to be performed; causing a second action, that is different from the first action, to be performed in an attempt to reduce the likelihood of failure of the active functional component; after causing the second action to be performed, obtaining a third value for the metric reflecting performance of the active functional component at a time after the second action is caused to be performed; and using a machine learning component to determine a third action to be performed to reduce the likelihood of failure of the active functional component based at least in part on the second and the third values of the metric and the first and second actions caused to be performed; and wherein the method is performed by one or more computing devices. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 19, 20)
-
-
9. A method comprising:
-
determining that an active functional component of a computer system has reached a likelihood of failure based at least in part on a first value of a first metric exceeding a first metric threshold; wherein the first value of the first metric reflects performance of the active functional component at a time before a first action is caused to be performed a first time in an attempt to reduce the likelihood of failure of the active functional component; causing the first action to be performed the first time in an attempt to reduce the likelihood of failure of the active functional component; after causing the first action to be performed the first time, determining that the active functional component has again reached the likelihood of failure based at least in part on a second value of a second metric exceeding a second metric threshold that is different than the first metric threshold; wherein the second value of the second metric reflects performance of the active functional component at a time after the first action is caused to be performed the first time; causing the first action to be performed a second time in an attempt to reduce the likelihood of failure of the active functional component; after causing the first action to be performed the second time, obtaining a third value for a third metric reflecting performance of the active functional component at a time after the first action is caused to be performed the second time; and using a machine learning component to determine a third metric threshold for triggering the first action to be performed a third time to reduce the likelihood of failure of the active functional component based at least in part on the second value of the second metric, the third value of the third metric, and the first and second metric thresholds; and wherein the method is performed by one or more computing devices. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. One or more non-transitory computer readable media storing instructions which, when executed by one or more processors cause performance of:
-
determining that an active functional component of a computer system has reached a likelihood of failure based at least in part on a first value of a metric exceeding a metric threshold; wherein the first value of the metric reflects performance of the active functional component at a time before a first action is caused to be performed in an attempt to reduce the likelihood of failure of the active functional component; causing the first action to be performed in an attempt to reduce the likelihood of failure of the active functional component; after causing the first action to be performed, determining that the active functional component has again reached the likelihood of failure based at least in part on a second value of the metric exceeding the metric threshold; wherein the second value of the metric reflects performance of the active functional component at a time after the first action is caused to be performed; causing a second action, that is different from the first action, to be performed in an attempt to reduce the likelihood of failure of the active functional component; after causing the second action to be performed, obtaining a third value for the metric reflecting performance of the active functional component at a time after the second action is caused to be performed; and using a machine learning component to determine a third action to be performed to reduce the likelihood of failure of the active functional component based at least in part on the second and the third values of the metric and the first and second actions caused to be performed.
-
-
18. One or more non-transitory computer readable media storing instructions which, when executed by one or more processors cause performance of:
-
determining that an active functional component of a computer system has reached a likelihood of failure based at least in part on a first value of a first metric exceeding a first metric threshold; wherein the first value of the first metric reflects performance of the active functional component at a time before a first action is caused to be performed a first time in an attempt to reduce the likelihood of failure of the active functional component; causing the first action to be performed the first time in an attempt to reduce the likelihood of failure of the active functional component; after causing the first action to be performed the first time, determining that the active functional component has again reached the likelihood of failure based at least in part on a second value of a second metric exceeding a second metric threshold that is different than the first metric threshold; wherein the second value of the second metric reflects performance of the active functional component at a time after the first action is caused to be performed the first time; causing the first action to be performed a second time in an attempt to reduce the likelihood of failure of the active functional component; after causing the first action to be performed the second time, obtaining a third value for a third metric reflecting performance of the active functional component at a time after the first action is caused to be performed the second time; and using a machine learning component to determine a third metric threshold for triggering the first action to be performed a third time to reduce the likelihood of failure of the active functional component based at least in part on the second value of the second metric, the third value of the third metric, and the first and second metric thresholds.
-
Specification