Apparatus and method for event correlation and problem reporting

US 6,868,367 B2
Filed: 03/27/2003
Issued: 03/15/2005
Est. Priority Date: 05/25/1994
Status: Expired due to Term

First Claim

Patent Images

1. A computer implemented method to analyze events in a system, the method comprising the steps of:

creating one or more configuration non-specific representations of types of managed components, creating one or more configuration non-specific representations of events of said types of managed components, and creating configuration non-specific representations of relations along which the events propagate amongst the types of managed components, said configuration non-specific representations of types of managed components, said configuration non-specific representations of events of said types of managed components, and said configuration non-specific representations of relations along which the events propagate amongst the types of managed components each being explicit and manipulatable by a first executable computer code, partitioning a system domain representative of the system into a plurality of subdomains, each subdomain including a subset of instances of managed components;

for each subdomain, producing a data structure for determining cause and/or effect relationships between sets of events in the subdomain by combining a plurality of said configuration non-specific representations based on information of specific instances of managed components in the subdomain;

for each subdomain, executing a second computer code using the said data structure to determine corresponding events caused by the one or more other events; and

combining said corresponding events of each of said subdomains to determine cause and/or effect relationships between sets of events in the overall system.

View all claims

10 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An apparatus and method is provided for efficiently determining the source of problems in a complex system based on observable events. By splitting the problem identification process into two separate activities of (1) generating efficient codes for problem identification and (2) decoding the problems at runtime, the efficiency of the problem identification process is significantly increased. Various embodiments of the invention contemplate creating a causality matrix which relates observable symptoms to likely problems in the system, reducing the causality matrix into a minimal codebook by eliminating redundant or unnecessary information, monitoring the observable symptoms, and decoding problems by comparing the observable symptoms against the minimal codebook using various best-fit approaches. The minimal codebook also identifies those observable symptoms for which the greatest benefit will be gained if they were monitored as compared to others.

By defining a distance measure between symptoms and codes in the codebook, the invention can tolerate a loss of symptoms or spurious symptoms without failure. Changing the radius of the codebook allows the ambiguity of problem identification to be adjusted easily. The invention also allows probabilistic and temporal correlations to be monitored. Due to the degree of data reduction prior to runtime, extremely large and complex systems involving many observable events can be efficiently monitored with much smaller computing resources than would otherwise be possible.

Citations

68 Claims

1. A computer implemented method to analyze events in a system, the method comprising the steps of:
- creating one or more configuration non-specific representations of types of managed components, creating one or more configuration non-specific representations of events of said types of managed components, and creating configuration non-specific representations of relations along which the events propagate amongst the types of managed components, said configuration non-specific representations of types of managed components, said configuration non-specific representations of events of said types of managed components, and said configuration non-specific representations of relations along which the events propagate amongst the types of managed components each being explicit and manipulatable by a first executable computer code, partitioning a system domain representative of the system into a plurality of subdomains, each subdomain including a subset of instances of managed components;
  
  for each subdomain, producing a data structure for determining cause and/or effect relationships between sets of events in the subdomain by combining a plurality of said configuration non-specific representations based on information of specific instances of managed components in the subdomain;
  
  for each subdomain, executing a second computer code using the said data structure to determine corresponding events caused by the one or more other events; and
  
  combining said corresponding events of each of said subdomains to determine cause and/or effect relationships between sets of events in the overall system.

2. A method for analyzing events in a system, the method comprising the steps of:
- (1) partitioning a system domain representative of the system into a plurality of subdomains, each of said subdomains generating domain events, each domain event comprising one of the events in the system;
  
  (2) for each subdomain, providing a computer-accessible codebook comprising a matrix of values, wherein each value corresponds to a mapping between one of said domain events and one of a plurality of likely other events in said system;
  
  (3) for each subdomain, monitoring event data values representing said domain events generated by said subdomain;
  
  (4) for each subdomain, determining a mismatch measure between each of said matrix of values in said codebook for said subdomain and said event data values for said subdomain through the use of a computer, and selecting the event having the smallest mismatch measure as the most likely cause event; and
  
  (5) combining said selected mostly likely cause event in each of said subdomains to determine one or more likely cause events in said system.
- View Dependent Claims (3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
- - 3. The method of claim 2, further comprising the step of generating a subdomain report identifying the most likely event determined for each subdomain.
  - 4. The method of claim 2, wherein the step of determining a mismatch measure step comprises the step of:
    - determining a Hamming distance between said matrix of values in said codebook for said subdomain and said event data values.
  - 5. The method of claim 2, wherein the step of determining a mismatch measure comprises the step of:
    - adding individual mismatch measures across a plurality of pairs, wherein each pair comprising one of the plurality of event data values and one of the plurality of values.
  - 6. The method of claim 5, wherein step (4) further comprises the step of:
    - using a mismatch measure that gives a different weight to absence of an event than to presence of an event.
  - 7. The method of claim 2, wherein step (2) comprises the step of:
    - specifying each of said values in said matrix of values as a probability, said probability reflecting a likelihood that an event was caused by a at least one other event.
  - 8. The method of claim 7, wherein step (2) comprises the step of:
    - specifying each of said values in said matrix of values as a pair of data, said pair of data comprising a first datum designating said probability and a second datum designating a temporal indicator corresponding to a time frame within which said probability holds true.
  - 9. The method of claim 7, wherein each of said probabilities is a discrete value.
  - 10. The method of claim 2 wherein said system comprises a network of computer nodes, and wherein step (3) comprises the step of receiving messages from said computer nodes, said messages comprising a subset of said event data values.
  - 11. The method of claim 2, wherein said system comprises a telecommunication network, and wherein step (3) comprises the step of receiving signals from equipment in said telecommunication network, said signals comprising a subset of said plurality of event data values.
  - 12. The method of claim 2, wherein said system comprises a computer having peripherals, and wherein step (3) comprises the step of receiving signals from said peripherals, said signals comprising a subset of said event data values.
  - 13. The method of claim 2, wherein said system comprises a plurality of satellites, and wherein step (3) comprises the step of receiving signals from said plurality of satellites, said signals comprising a subset of said event data values.
  - 14. The method of claim 2, wherein said system comprises a human patient, and wherein step (3) comprises the step of receiving signals from sensors coupled to said human patient, said signals comprising a subset of said event data values.
  - 15. The method of claim 2, wherein the step of determining a mismatch measure step comprises the step of:
    - looking up a predetermined measure from a pre-computed table.
  - 16. The method of claim 2, further comprising the step of:
    - providing a causality matrix comprising a larger set of values than said codebook, said larger set of values also corresponding to mappings between said domain event and said plurality of likely other events corresponding thereto; and
      
      generating said codebook by reducing said larger set of values contained in said causality matrix into said codebook.
  - 17. The method of claim 16, wherein the step of:
    - generating said codebook comprises the step of eliminating redundant rows and columns from said causality matrix.
  - 18. The method of claim 16, wherein the step of generating said codebook comprises the step of:
    - reducing the number of rows in said causality matrix in accordance with a desired degree of distinction between groups of said plurality of likely other events.
  - 19. The method of claim 2, further comprising the step of:
    - providing a causality graph comprising a plurality of nodes each corresponding to an event, a plurality of directed edges each pointing from one of the plurality of nodes to another node of the plurality of nodes and corresponding to a causal relation between two or more of said events, wherein certain ones of said nodes are marked as first event nodes and others are marked as second event nodes; and
      
      generating said codebook by traversing said directed edges leading from said first event nodes to said second event nodes.
  - 20. The method of claim 19, wherein the step of generating said codebook comprises the steps ofeliminating from said causality graph event nodes that may be reached via directed edges from said first event nodes;
    - and eliminating event nodes that lead via directed edges to said second event nodes.
  - 21. The method of claim 19, wherein the step of generating said codebook comprises the step of:
    - eliminating from said causality graph event nodes in accordance with a desired degree of distinction between groups of said plurality of likely other events.
  - 22. The method according to claim 2, wherein the step of generating said codebook comprises the step of;
    - providing a computer-accessible codebook comprising a matrix of values which has been reduced from a causality matrix by eliminating redundant information from the causality matrix.

23. A method for analyzing events in a system, the method comprising the steps of:
- (1) partitioning a system domain into a plurality of subdomains, each said subdomain generating domain events, each domain event comprising one of the events in the system, (2) for each subdomain, generating a subdomain causality matrix comprising a first matrix of values each of said values corresponding to a mapping between one of said domain events and one of a plurality of likely other events in said system domain;
  
  (3) for each subdomain, reducing said subdomain causality matrix into a codebook comprising a second matrix of values derived from said first matrix of values, the second matrix of values being fewer in number than said first matrix of values;
  
  (4) for each subdomain, monitoring a plurality of domain event data values representing said domain events generated by said system over time;
  
  (5) for each subdomain, determining a mismatch measure between each of a plurality of groups of said matrix of values in said codebook and said plurality of domain event data values through the use of a computer, and selecting one of said plurality of likely events corresponding to one of said plurality of groups having the smallest mismatch measure;
  
  (6) combining said selected ones of said plurality of likely events in each subdomain to determine one or more likely events in said system.
- View Dependent Claims (24, 25, 26, 27, 28, 29, 30)
- - 24. The method of claim 23, further comprising the step of:
    - reporting said selected likely event determined in step (5) for each subdomain.
  - 25. The method of claim 23, wherein step (3) comprises the step of:
    - eliminating redundant rows and columns from said first matrix.
  - 26. The method of claim 23, further comprising the step of:
    - selecting a desired degree of distinction between each of said groups of said plurality of events, each group corresponding to a different likely other event, and wherein step (3) comprises the step of deleting values from said first matrix which do not satisfy said desired degree of distinction, said deletions made on the basis of comparisons between one or more of said values from said first matrix with said desired degree of distinction.
  - 27. The method of claim 26, wherein said comparisons are made with respect to a Hamming distance determined with respect to one or more of said values from said first matrix.
  - 28. The method of claim 26, wherein said comparisons are made by using a mismatch measure that gives a different weight to absence of an event than to presence of an event.
  - 29. The method of claim 23, whereineach of said values in said first matrix is a probability, said probability reflecting a likelihood that a corresponding event was caused by a corresponding other event.
  - 30. The method of claim 29, wherein each of said values as is a discrete probability value.

31. An apparatus for analyzing events in a system, said system comprising a system domain partitioned into a plurality of subdomains, each said subdomain generating domain events, each domain event comprising one of the events in the system, the apparatus comprising:
- a storage device for storing for each subdomain, a domain codebook comprising a matrix of values each value corresponding to a mapping between one of said plurality of domain events and one likely other event of a plurality of likely other events in said system;
  
  monitoring means for monitoring for each subdomain, a plurality of domain event data values representing said plurality of domain events generated by said system;
  
  means for determining for each subdomain, a mismatch measure between each of a plurality of groups of said values in said domain codebook and said plurality of domain event data values, and selecting one of said plurality of likely events corresponding to one of said plurality of groups having the smallest mismatch measure;
  
  means for combining the selected one of said plurality of likely events in each subdomain to determine one or more likely events in said system.
- View Dependent Claims (32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51)
- - 32. The apparatus of claim 31, wherein said means for determining a mismatch measure comprises means for determining a Hamming distance between each of said plurality of groups and said plurality of domain event data values.
  - 33. The apparatus of claim 31, wherein said means for determining a mismatch measure adds individual mismatch measures across a plurality of pairs, each pair comprising one of the plurality of event data values and one of the plurality of values.
  - 34. The apparatus of claim 33, wherein said mismatch measure is a best fit match determination which gives a different weight to absence of an event data value than to presence of an event data value.
  - 35. The apparatus of claim 34 further comprising:
    - means for outputting as likely cause events all likely events in said domain codebook which fall within a predetermined tolerance from said best fit match.
  - 36. The apparatus of claim 31, wherein each of said values contained in said matrix of values represents a probability reflecting a likelihood that an event was caused by a corresponding other event.
  - 37. The apparatus of claim 36, wherein each of said values contained in said matrix of values is represented as a pair of data, said pair of data comprising a first datum designating said probability and a second datum designating a temporal indicator corresponding to a time frame within which said probability holds.
  - 38. The apparatus of claim 36, wherein each of said values is specified as a discrete probability value.
  - 39. The apparatus of claim 31, wherein said system comprises a network of computer nodes, and wherein said monitoring means comprises means for receiving messages from each of said computer nodes in said network, said messages comprising a subset of said plurality of domain event data values.
  - 40. The apparatus of claim 31, wherein said system comprises a telecommunication network, and wherein said monitoring means comprises means for receiving signals from equipment in said telecommunication network, said signals comprising a subset of said plurality of domain event data values.
  - 41. The apparatus of claim 31, wherein said system comprises a computer having peripherals, and wherein said monitoring means comprises means for receiving signals from said peripherals, said signals comprising a subset of said plurality of domain event data values.
  - 42. The apparatus of claim 31, wherein said system comprises a plurality of satellites, and wherein said monitoring means comprises means for receiving signals from plurality of said satellites, said signals comprising a subset of said plurality of domain event data values.
  - 43. The apparatus of claim 31, wherein said system comprises a human patient, and wherein said monitoring means comprises means for receiving signals from sensors coupled to said human patient, said signals comprising a subset of said plurality of domain event data values.
  - 44. The apparatus of claim 31, wherein said mismatch measure is determined by looking up a predetermined measure from a pre-computed table.
  - 45. The apparatus of claim 31, further comprising:
    - means for storing a causality matrix containing values corresponding to mappings between said domain events and said plurality of likely other events corresponding thereto; and
      
      means for generating said domain codebook by reducing said larger set of values contained in said causality matrix.
  - 46. The apparatus of claim 45, wherein said domain codebook is generated by eliminating redundant rows and columns from said causality matrix.
  - 47. The apparatus of claim 45, wherein said domain codebook is generated by reducing the number of rows in said causality matrix in accordance with a desired degree of distinction between groups of said plurality of events.
  - 48. The apparatus of claim 31, further comprising:
    - means for storing a causality graph comprising a plurality of nodes each corresponding to an event, and a plurality of directed edges each pointing from one of the plurality of nodes to another one of the plurality of nodes and corresponding to a causal relation between two of said events, wherein certain nodes are marked as first events and certain nodes are marked as second events;
      
      and wherein said domain codebook is generated by traversing said plurality of directed edges in said causality graph leading from nodes marked as said first events to nodes marked as said second events.
  - 49. The apparatus of claim 48, wherein said domain codebook is generated by eliminating from said causality graph event nodes that may be reached via directed edges from the first event nodes, and by eliminating event nodes that lead via directed edges to the second event nodes.
  - 50. The apparatus of claim 48, wherein said domain codebook is generated by eliminating from said causality graph event nodes in accordance with a desired degree of distinction between groups of said plurality of events.
  - 51. The apparatus according to claim 31, wherein the values in the domain codebook have been obtained from a causality matrix by eliminating redundant information from the causality matrix.

52. Apparatus for analyzing events in a system, said system partitioned into a plurality of subdomains, each said subdomain generating domain events, each domain event comprising one of the events in the system, the apparatus comprising:
- generating means for generating for each subdomain, a domain causality matrix comprising a first matrix of values each value corresponding to a mapping between one of said plurality of domain events and one likely other event of a plurality of likely other events in said system;
  
  reducing means for reducing for each subdomain;
  
  said domain causality matrix into a codebook comprising a second matrix of values fewer in number than said first matrix of values;
  
  monitoring means for monitoring for each subdomain, through the use of a computer, a plurality of domain event data values representing said plurality of domain events generated by said system over time;
  
  means for determining for each subdomain, a mismatch measure between each of a plurality of groups of said values in said corresponding codebook and said plurality of domain event data values, and selecting one of said plurality of likely events corresponding to one of said plurality of groups having the smallest mismatch measure; and
  
  means for combining said selected ones of said plurality of likely events to determine one or more likely events in said system.
- View Dependent Claims (53, 54, 55, 56, 57, 58)
- - 53. The apparatus of claim 52, wherein said reducing means eliminates redundant rows and columns from said first matrix.
  - 54. The apparatus of claim 52, wherein said reducing means comprises:
    - means for inputting a desired degree of distinction between each of said groups of said plurality of events, each group corresponding to a different likely other events, and wherein said reducing means deletes values from said first matrix which do not satisfy said desired degree of distinction, said deletions made on the basis of comparisons between one or more of said values from said first matrix with said desired degree of distinction.
  - 55. The apparatus of claim 54, wherein said comparisons are made with respect to a Hamming distance determined with respect to one or more of said values from said first matrix.
  - 56. The apparatus of claim 54, wherein each of said values in said first matrix comprises a probability reflecting a likelihood that an event was caused by a corresponding other event.
  - 57. The apparatus of claim 56, wherein said probabilities comprise a discrete value.
  - 58. The apparatus of claim 54, wherein said mismatch measure gives a different weight to absence of an event than to presence of an event.

59. A method of analyzing a system, said system being partitioned into a plurality of subdomains, the method comprising the steps of:
- providing a causality mapping relating possible events in said system to symptoms likely generated by said events;
  
  detecting symptoms generated in at least two of said subdomains of said system; and
  
  for each of said at least two subdomains, determining a mismatch measure between said symptoms and events, and selecting a possible event having the smallest mismatch measure in said subdomains with respect to said causality mapping, and combining results of said analyses of each of said at least two subdomains to identify one or more likely events in said system.
- View Dependent Claims (60, 61, 62)
- - 60. The method of claim 59 wherein said causality mapping comprises a computer-accessible codebook comprising a matrix of values relating symptoms to events.
  - 61. The method of claim 59 wherein performing said analysis comprises the step of:
    - determining the most likely event for a set of detected symptoms.
  - 62. The method of claim 59, wherein determining a mismatch measure comprises determining a Hamming distance between symptoms and events.

63. An apparatus for analyzing a system, said system being partitioned into a plurality of subdomains, the apparatus comprising:
- a storage device for storing a causality mapping relating events in said system to symptoms likely generated by said events;
  
  a plurality of monitors for detecting symptoms generated in said subdomains of said system; and
  
  an event correlator associated with each of said subdomains, for determining a mismatch measure between said symptoms and events, and selects an event having the smallest mismatch measure in said associated subdomain with respect to said causality mapping to identify one or more likely events in said system.
- View Dependent Claims (64, 65, 66)
- - 64. The apparatus of claim 63 wherein said causality mapping comprises a computer-accessible codebook comprising a matrix of values relating symptoms to events.
  - 65. The apparatus of claim 63 wherein said local event correlators determine a most likely event for a set of detected symptoms.
  - 66. The apparatus of claim 63, wherein said mismatch measure is determined by determining a Hamming distance between said symptoms and said events.

67. A computer implemented method to determine the effects of one or more events in a system of managed components, the method comprising the steps of:
- creating one or more configuration non-specific representations of types of managed components;
  
  creating one or more configuration non-specific representations of events of said types of managed components;
  
  creating configuration non-specific representations of relations along which the events and/or effects of said events propagate amongst the types of managed components;
  
  said configuration non-specific representations of types of managed components, said configuration non-specific representations of events of said types of managed components, and said configuration non-specific representations of relations along which the events and/or effects propagate amongst the types of managed components each being explicit and manipulatable by a first executable computer code;
  
  producing a data structure for determining the effects of an event by combining one or more of said configuration non-specific representations based on information of specific instances of managed components in the system; and
  
  executing a second computer code utilizing said data structure to determine the corresponding effects on one or more managed components caused by the one or more events.

68. A computer implemented method to determine the effects of one or more events in a system of managed components, the method comprising the steps of:
- creating one or more configuration non-specific representations of types of managed components;
  
  creating one or more configuration non-specific representations of events of said types of managed components; and
  
  creating configuration non-specific representations of relations along which the events and/or effects of said events propagate amongst the types of managed components;
  
  said configuration non-specific representations of types of managed components, said configuration non-specific representations of events of said types of managed components, and said configuration non-specific representations of relations along which the events and/or effects propagate amongst the types of managed components each being explicit and manipulatable by a first executable computer code;
  
  partitioning a system domain into a plurality of smaller domains, each said smaller domain containing a subset of instances of managed components of the system;
  
  for each smaller domain, producing a data structure for determining the effects of one or more events by combining a plurality of said configuration non-specific representation based on information of specific instances of managed components in the smaller domains;
  
  for each smaller domain, executing a second computer code utilizing said data structures to determine the corresponding effects on one or more managed components caused by the one or more events in the smaller domain; and
  
  combining the determined effects in each said smaller domain into one or more effects in said system.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
VMware, Inc. (Broadcom, Inc.)
Original Assignee
System Management Arts, Inc. (Dell Technologies Inc.)
Inventors
Kliger, Shmuel, Yemini, Yechiam, Yemini, Shaula
Primary Examiner(s)
WACHSMAN, HAL D

Application Number

US10/400,718
Publication Number

US 20030204370A1
Time in Patent Office

719 Days
Field of Search

702179-187, 702/196, 702/58, 702/59, 702119-123, 702/126, 702/FOR.135, 702/FOR.139, 702/FOR.163, 702/FOR.171, 706 50- 52, 706 19- 21, 706/59, 706/61, 706/924, 706/920, 706/922, 714/25, 714/26, 714/37, 714/31, 714/44, 714/48, 714/49, 717/124, 717/106, 703/16, 703/17, 703 2- 4, 703/6, 703/7
US Class Current

702/183
CPC Class Codes

G06F 11/2257   using expert systems

G06F 11/2273   Test methods

G06F 11/3466   Performance evaluation by t...

G06F 2201/86   Event-based monitoring

Apparatus and method for event correlation and problem reporting

First Claim

10 Assignments

0 Petitions

Accused Products

Abstract

Citations

68 Claims

Specification

Solutions

Use Cases

Quick Links

Apparatus and method for event correlation and problem reporting

First Claim

10 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

68 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links