Systems and methods for real-time monitoring of virtualized environments
First Claim
1. A computer system comprising:
- computer hardware including a computer processor; and
a capacity manager comprising instructions executable by the computer processor to cause the computer hardware to perform operations comprising;
monitoring capacity measurements, including host CPU utilization, host memory utilization, virtual machine CPU utilization, virtual machine CPU ready, virtual machine memory utilization, and virtual machine disk latency, of multiple hosts and multiple virtual machines on a computer network;
receiving alerts that each indicate that an event has occurred in which at least one of the hosts or virtual machines has exceeded a set threshold for one of the capacity measurements for at least a set amount of time;
correlating multiple events and performing a root cause analysis and a first impact analysis upon the correlated set of events, wherein the root cause analysis and first impact analysis include executing, for each of the alerts, a series of determinations in a decision tree that is defined for the specific type of alert that is being analyzed, so as to identify a root cause of the event associated with the alert and to identify other objects within the computer network that are impacted by the events; and
generating a recommended response to mitigate a problem associated with the events, wherein generation of the recommended response includes performing a second impact analysis that determines what impact each of multiple potential responses would have upon the computer system and selecting as the recommended response a first potential response that would have a more positive impact than a second potential response.
23 Assignments
0 Petitions
Accused Products
Abstract
A method of root cause analysis in a virtual machine environment includes receiving a plurality of events from a system monitoring the virtualized environment. The events may include alarms or alerts, such as alarms or alerts associated with a resource reaching or exceeding a threshold. The capacity manager consumes these events and performs event correlation to produce a set of correlated events. The capacity manager performs a root cause analysis on the set of correlated events to identify one or more root causes. The capacity manager further performs an impact analysis to determine how the root cause impacts the system, such as other virtual machines, hosts or resource in the virtual environment. Based on the root cause and impact analysis, the capacity manager makes one or more recommendations to address issues with or to improve the operations and/or performance of the virtualized environment.
-
Citations
18 Claims
-
1. A computer system comprising:
-
computer hardware including a computer processor; and a capacity manager comprising instructions executable by the computer processor to cause the computer hardware to perform operations comprising; monitoring capacity measurements, including host CPU utilization, host memory utilization, virtual machine CPU utilization, virtual machine CPU ready, virtual machine memory utilization, and virtual machine disk latency, of multiple hosts and multiple virtual machines on a computer network; receiving alerts that each indicate that an event has occurred in which at least one of the hosts or virtual machines has exceeded a set threshold for one of the capacity measurements for at least a set amount of time; correlating multiple events and performing a root cause analysis and a first impact analysis upon the correlated set of events, wherein the root cause analysis and first impact analysis include executing, for each of the alerts, a series of determinations in a decision tree that is defined for the specific type of alert that is being analyzed, so as to identify a root cause of the event associated with the alert and to identify other objects within the computer network that are impacted by the events; and generating a recommended response to mitigate a problem associated with the events, wherein generation of the recommended response includes performing a second impact analysis that determines what impact each of multiple potential responses would have upon the computer system and selecting as the recommended response a first potential response that would have a more positive impact than a second potential response. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A method comprising accessing computer-executable instructions from computer storage and executing the computer-executable instructions on at least one computer processor to cause computer hardware to perform operations comprising:
-
monitoring capacity measurements of multiple hosts and multiple virtual machines on a computer network; receiving alerts that each indicate that an event has occurred in which at least one of the hosts or virtual machines has exceeded a set threshold for one of the capacity measurements for at least a set amount of time; correlating multiple events and performing a root cause analysis and a first impact analysis upon the correlated set of events, wherein the root cause analysis and first impact analysis include executing, for each of the alerts, a series of determinations in a decision tree that is defined for the specific type of alert that is being analyzed, so as to identify a root cause of the event associated with the alert and to identify other objects within the computer network that are impacted by the events; and generating a recommended response to mitigate a problem associated with the events, wherein generation of the recommended response includes performing a second impact analysis that determines what impact each of multiple potential responses would have upon the computer system. - View Dependent Claims (8, 9, 10, 11, 12)
-
-
13. A tangible computer-readable medium that stores thereon a plurality of computer-executable instructions configured, when executed by a computer processor, to cause computer hardware to perform operations comprising:
-
monitoring capacity measurements of at least one host and at least one virtual machine on a computer network; receiving alerts that each indicate that an event has occurred in which at least one of the hosts or virtual machines has exceeded a set threshold for one of the capacity measurements for at least a set amount of time; correlating multiple events and performing a root cause analysis and a first impact analysis upon the correlated set of events, wherein the root cause analysis and first impact analysis include executing, for each of the alerts, a series of determinations in a decision tree that is defined for the specific type of alert that is being analyzed, so as to identify a root cause of the event associated with the alert and to identify other objects within the computer network that are impacted by the events; and generating a recommended response to mitigate a problem associated with the events, wherein generation of the recommended response includes performing a second impact analysis that determines what impact each of multiple potential responses would have upon the computer system and selecting as the recommended response a first potential response that would have a more positive impact than a second potential response. - View Dependent Claims (14, 15, 16, 17, 18)
-
Specification