Systems and methods for real-time monitoring of virtualized environments

US 8,738,972 B1
Filed: 02/03/2012
Issued: 05/27/2014
Est. Priority Date: 02/04/2011
Status: Active Grant

First Claim

Patent Images

1. A computer system comprising:

computer hardware including a computer processor; and

a capacity manager comprising instructions executable by the computer processor to cause the computer hardware to perform operations comprising;

monitoring capacity measurements, including host CPU utilization, host memory utilization, virtual machine CPU utilization, virtual machine CPU ready, virtual machine memory utilization, and virtual machine disk latency, of multiple hosts and multiple virtual machines on a computer network;

receiving alerts that each indicate that an event has occurred in which at least one of the hosts or virtual machines has exceeded a set threshold for one of the capacity measurements for at least a set amount of time;

correlating multiple events and performing a root cause analysis and a first impact analysis upon the correlated set of events, wherein the root cause analysis and first impact analysis include executing, for each of the alerts, a series of determinations in a decision tree that is defined for the specific type of alert that is being analyzed, so as to identify a root cause of the event associated with the alert and to identify other objects within the computer network that are impacted by the events; and

generating a recommended response to mitigate a problem associated with the events, wherein generation of the recommended response includes performing a second impact analysis that determines what impact each of multiple potential responses would have upon the computer system and selecting as the recommended response a first potential response that would have a more positive impact than a second potential response.

View all claims

23 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method of root cause analysis in a virtual machine environment includes receiving a plurality of events from a system monitoring the virtualized environment. The events may include alarms or alerts, such as alarms or alerts associated with a resource reaching or exceeding a threshold. The capacity manager consumes these events and performs event correlation to produce a set of correlated events. The capacity manager performs a root cause analysis on the set of correlated events to identify one or more root causes. The capacity manager further performs an impact analysis to determine how the root cause impacts the system, such as other virtual machines, hosts or resource in the virtual environment. Based on the root cause and impact analysis, the capacity manager makes one or more recommendations to address issues with or to improve the operations and/or performance of the virtualized environment.

Citations

18 Claims

1. A computer system comprising:
- computer hardware including a computer processor; and
  
  a capacity manager comprising instructions executable by the computer processor to cause the computer hardware to perform operations comprising;
  
  monitoring capacity measurements, including host CPU utilization, host memory utilization, virtual machine CPU utilization, virtual machine CPU ready, virtual machine memory utilization, and virtual machine disk latency, of multiple hosts and multiple virtual machines on a computer network;
  
  receiving alerts that each indicate that an event has occurred in which at least one of the hosts or virtual machines has exceeded a set threshold for one of the capacity measurements for at least a set amount of time;
  
  correlating multiple events and performing a root cause analysis and a first impact analysis upon the correlated set of events, wherein the root cause analysis and first impact analysis include executing, for each of the alerts, a series of determinations in a decision tree that is defined for the specific type of alert that is being analyzed, so as to identify a root cause of the event associated with the alert and to identify other objects within the computer network that are impacted by the events; and
  
  generating a recommended response to mitigate a problem associated with the events, wherein generation of the recommended response includes performing a second impact analysis that determines what impact each of multiple potential responses would have upon the computer system and selecting as the recommended response a first potential response that would have a more positive impact than a second potential response.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The computer system of claim 1, wherein correlation of multiple events includes identifying a significance, a usefulness, and a type for each event.
  - 3. The computer system of claim 1, wherein the capacity manager also filters the events to remove at least some data associated with the events.
  - 4. The computer system of claim 1, wherein the root cause analysis further comprises arranging the set of correlated events in sequence.
  - 5. The computer system of claim 1, wherein the root cause analysis further comprises taking into account a degree to which a capacity measurement for an event is extreme in comparison to capacity measurements for other events.
  - 6. The computer system of claim 1, wherein one or both of the first and second impact analysis includes performing predictive modeling using a plurality of variables, systems constraints, collected metrics, user behavior, and historical data.

7. A method comprising accessing computer-executable instructions from computer storage and executing the computer-executable instructions on at least one computer processor to cause computer hardware to perform operations comprising:
- monitoring capacity measurements of multiple hosts and multiple virtual machines on a computer network;
  
  receiving alerts that each indicate that an event has occurred in which at least one of the hosts or virtual machines has exceeded a set threshold for one of the capacity measurements for at least a set amount of time;
  
  correlating multiple events and performing a root cause analysis and a first impact analysis upon the correlated set of events, wherein the root cause analysis and first impact analysis include executing, for each of the alerts, a series of determinations in a decision tree that is defined for the specific type of alert that is being analyzed, so as to identify a root cause of the event associated with the alert and to identify other objects within the computer network that are impacted by the events; and
  
  generating a recommended response to mitigate a problem associated with the events, wherein generation of the recommended response includes performing a second impact analysis that determines what impact each of multiple potential responses would have upon the computer system.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The method of claim 7, wherein correlation of multiple events includes identifying a significance, a usefulness, and a type for each event.
  - 9. The method of claim 7, further comprising filtering the events to remove at least some data associated with the events.
  - 10. The method of claim 7, wherein the root cause analysis further comprises arranging the set of correlated events in sequence.
  - 11. The method of claim 7, wherein the root cause analysis further comprises taking into account a degree to which a capacity measurement for an event is extreme in comparison to capacity measurements for other events.
  - 12. The method of claim 7, wherein one or both of the first and second impact analysis includes performing predictive modeling using a plurality of variables, systems constraints, collected metrics, user behavior, and historical data.

13. A tangible computer-readable medium that stores thereon a plurality of computer-executable instructions configured, when executed by a computer processor, to cause computer hardware to perform operations comprising:
- monitoring capacity measurements of at least one host and at least one virtual machine on a computer network;
  
  receiving alerts that each indicate that an event has occurred in which at least one of the hosts or virtual machines has exceeded a set threshold for one of the capacity measurements for at least a set amount of time;
  
  correlating multiple events and performing a root cause analysis and a first impact analysis upon the correlated set of events, wherein the root cause analysis and first impact analysis include executing, for each of the alerts, a series of determinations in a decision tree that is defined for the specific type of alert that is being analyzed, so as to identify a root cause of the event associated with the alert and to identify other objects within the computer network that are impacted by the events; and
  
  generating a recommended response to mitigate a problem associated with the events, wherein generation of the recommended response includes performing a second impact analysis that determines what impact each of multiple potential responses would have upon the computer system and selecting as the recommended response a first potential response that would have a more positive impact than a second potential response.
- View Dependent Claims (14, 15, 16, 17, 18)
- - 14. The computer-readable medium of claim 13, wherein correlation of multiple events includes identifying a significance, a usefulness, and a type for each event.
  - 15. The computer-readable medium of claim 13, wherein the operations further comprise filtering the events to remove at least some data associated with the events.
  - 16. The computer-readable medium of claim 13, wherein the root cause analysis further comprises arranging the set of correlated events in sequence.
  - 17. The computer-readable medium of claim 13, wherein the root cause analysis further comprises taking into account a degree to which a capacity measurement for an event is extreme in comparison to capacity measurements for other events.
  - 18. The computer-readable medium of claim 13, wherein one or both of the first and second impact analysis includes performing predictive modeling using a plurality of variables, systems constraints, collected metrics, user behavior, and historical data.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Quest Software, Inc.
Original Assignee
Dell Software, Inc. (Dell Technologies Inc.)
Inventors
Bakman, Alexander, Latimer, Kenneth J. Jr., Kachkaev, Alexey
Primary Examiner(s)
LE, DIEU MINH T

Application Number

US13/366,166
Time in Patent Office

844 Days
Field of Search

714/47.2, 714/47.1, 714/48, 714/47.3, 714/50
US Class Current

714/47.2
CPC Class Codes

G06F 11/0712   in a virtual computing plat...

G06F 11/079   Root cause analysis, i.e. e...

G06F 11/3433   for load management allocat...

G06F 11/3442   for planning or managing th...

G06F 11/3447   Performance evaluation by m...

G06F 16/248   Presentation of query results

G06F 2201/81   Threshold

G06F 2201/815   Virtual

Systems and methods for real-time monitoring of virtualized environments

First Claim

23 Assignments

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for real-time monitoring of virtualized environments

First Claim

23 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links