Apparatus and method for event correlation and problem reporting
First Claim
1. A computer implemented method to analyze events in a system, the method comprising the steps of:
- creating one or more configuration non-specific representations of types of managed components, creating one or more configuration non-specific representations of events of said types of managed components, and creating configuration non-specific representations of relations along which the events propagate amongst the types of managed components, said configuration non-specific representations of types of managed components, said configuration non-specific representations of events of said types of managed components, and said configuration non-specific representations of relations along which the events propagate amongst the types of managed components each being explicit and manipulatable by a first executable computer code, partitioning a system domain representative of the system into a plurality of subdomains, each subdomain including a subset of instances of managed components;
for each subdomain, producing a data structure for determining cause and/or effect relationships between sets of events in the subdomain by combining a plurality of said configuration non-specific representations based on information of specific instances of managed components in the subdomain;
for each subdomain, executing a second computer code using the said data structure to determine corresponding events caused by the one or more other events; and
combining said corresponding events of each of said subdomains to determine cause and/or effect relationships between sets of events in the overall system.
10 Assignments
0 Petitions
Accused Products
Abstract
An apparatus and method is provided for efficiently determining the source of problems in a complex system based on observable events. By splitting the problem identification process into two separate activities of (1) generating efficient codes for problem identification and (2) decoding the problems at runtime, the efficiency of the problem identification process is significantly increased. Various embodiments of the invention contemplate creating a causality matrix which relates observable symptoms to likely problems in the system, reducing the causality matrix into a minimal codebook by eliminating redundant or unnecessary information, monitoring the observable symptoms, and decoding problems by comparing the observable symptoms against the minimal codebook using various best-fit approaches. The minimal codebook also identifies those observable symptoms for which the greatest benefit will be gained if they were monitored as compared to others.
By defining a distance measure between symptoms and codes in the codebook, the invention can tolerate a loss of symptoms or spurious symptoms without failure. Changing the radius of the codebook allows the ambiguity of problem identification to be adjusted easily. The invention also allows probabilistic and temporal correlations to be monitored. Due to the degree of data reduction prior to runtime, extremely large and complex systems involving many observable events can be efficiently monitored with much smaller computing resources than would otherwise be possible.
-
Citations
68 Claims
-
1. A computer implemented method to analyze events in a system, the method comprising the steps of:
-
creating one or more configuration non-specific representations of types of managed components, creating one or more configuration non-specific representations of events of said types of managed components, and creating configuration non-specific representations of relations along which the events propagate amongst the types of managed components, said configuration non-specific representations of types of managed components, said configuration non-specific representations of events of said types of managed components, and said configuration non-specific representations of relations along which the events propagate amongst the types of managed components each being explicit and manipulatable by a first executable computer code, partitioning a system domain representative of the system into a plurality of subdomains, each subdomain including a subset of instances of managed components;
for each subdomain, producing a data structure for determining cause and/or effect relationships between sets of events in the subdomain by combining a plurality of said configuration non-specific representations based on information of specific instances of managed components in the subdomain;
for each subdomain, executing a second computer code using the said data structure to determine corresponding events caused by the one or more other events; and
combining said corresponding events of each of said subdomains to determine cause and/or effect relationships between sets of events in the overall system.
-
-
2. A method for analyzing events in a system, the method comprising the steps of:
-
(1) partitioning a system domain representative of the system into a plurality of subdomains, each of said subdomains generating domain events, each domain event comprising one of the events in the system;
(2) for each subdomain, providing a computer-accessible codebook comprising a matrix of values, wherein each value corresponds to a mapping between one of said domain events and one of a plurality of likely other events in said system;
(3) for each subdomain, monitoring event data values representing said domain events generated by said subdomain;
(4) for each subdomain, determining a mismatch measure between each of said matrix of values in said codebook for said subdomain and said event data values for said subdomain through the use of a computer, and selecting the event having the smallest mismatch measure as the most likely cause event; and
(5) combining said selected mostly likely cause event in each of said subdomains to determine one or more likely cause events in said system. - View Dependent Claims (3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
-
-
23. A method for analyzing events in a system, the method comprising the steps of:
-
(1) partitioning a system domain into a plurality of subdomains, each said subdomain generating domain events, each domain event comprising one of the events in the system, (2) for each subdomain, generating a subdomain causality matrix comprising a first matrix of values each of said values corresponding to a mapping between one of said domain events and one of a plurality of likely other events in said system domain;
(3) for each subdomain, reducing said subdomain causality matrix into a codebook comprising a second matrix of values derived from said first matrix of values, the second matrix of values being fewer in number than said first matrix of values;
(4) for each subdomain, monitoring a plurality of domain event data values representing said domain events generated by said system over time;
(5) for each subdomain, determining a mismatch measure between each of a plurality of groups of said matrix of values in said codebook and said plurality of domain event data values through the use of a computer, and selecting one of said plurality of likely events corresponding to one of said plurality of groups having the smallest mismatch measure;
(6) combining said selected ones of said plurality of likely events in each subdomain to determine one or more likely events in said system. - View Dependent Claims (24, 25, 26, 27, 28, 29, 30)
-
-
31. An apparatus for analyzing events in a system, said system comprising a system domain partitioned into a plurality of subdomains, each said subdomain generating domain events, each domain event comprising one of the events in the system, the apparatus comprising:
-
a storage device for storing for each subdomain, a domain codebook comprising a matrix of values each value corresponding to a mapping between one of said plurality of domain events and one likely other event of a plurality of likely other events in said system;
monitoring means for monitoring for each subdomain, a plurality of domain event data values representing said plurality of domain events generated by said system;
means for determining for each subdomain, a mismatch measure between each of a plurality of groups of said values in said domain codebook and said plurality of domain event data values, and selecting one of said plurality of likely events corresponding to one of said plurality of groups having the smallest mismatch measure;
means for combining the selected one of said plurality of likely events in each subdomain to determine one or more likely events in said system. - View Dependent Claims (32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51)
-
-
52. Apparatus for analyzing events in a system, said system partitioned into a plurality of subdomains, each said subdomain generating domain events, each domain event comprising one of the events in the system, the apparatus comprising:
-
generating means for generating for each subdomain, a domain causality matrix comprising a first matrix of values each value corresponding to a mapping between one of said plurality of domain events and one likely other event of a plurality of likely other events in said system;
reducing means for reducing for each subdomain;
said domain causality matrix into a codebook comprising a second matrix of values fewer in number than said first matrix of values;
monitoring means for monitoring for each subdomain, through the use of a computer, a plurality of domain event data values representing said plurality of domain events generated by said system over time;
means for determining for each subdomain, a mismatch measure between each of a plurality of groups of said values in said corresponding codebook and said plurality of domain event data values, and selecting one of said plurality of likely events corresponding to one of said plurality of groups having the smallest mismatch measure; and
means for combining said selected ones of said plurality of likely events to determine one or more likely events in said system. - View Dependent Claims (53, 54, 55, 56, 57, 58)
-
-
59. A method of analyzing a system, said system being partitioned into a plurality of subdomains, the method comprising the steps of:
-
providing a causality mapping relating possible events in said system to symptoms likely generated by said events;
detecting symptoms generated in at least two of said subdomains of said system; and
for each of said at least two subdomains, determining a mismatch measure between said symptoms and events, and selecting a possible event having the smallest mismatch measure in said subdomains with respect to said causality mapping, and combining results of said analyses of each of said at least two subdomains to identify one or more likely events in said system. - View Dependent Claims (60, 61, 62)
-
-
63. An apparatus for analyzing a system, said system being partitioned into a plurality of subdomains, the apparatus comprising:
-
a storage device for storing a causality mapping relating events in said system to symptoms likely generated by said events;
a plurality of monitors for detecting symptoms generated in said subdomains of said system; and
an event correlator associated with each of said subdomains, for determining a mismatch measure between said symptoms and events, and selects an event having the smallest mismatch measure in said associated subdomain with respect to said causality mapping to identify one or more likely events in said system. - View Dependent Claims (64, 65, 66)
-
-
67. A computer implemented method to determine the effects of one or more events in a system of managed components, the method comprising the steps of:
-
creating one or more configuration non-specific representations of types of managed components;
creating one or more configuration non-specific representations of events of said types of managed components;
creating configuration non-specific representations of relations along which the events and/or effects of said events propagate amongst the types of managed components;
said configuration non-specific representations of types of managed components, said configuration non-specific representations of events of said types of managed components, and said configuration non-specific representations of relations along which the events and/or effects propagate amongst the types of managed components each being explicit and manipulatable by a first executable computer code;
producing a data structure for determining the effects of an event by combining one or more of said configuration non-specific representations based on information of specific instances of managed components in the system; and
executing a second computer code utilizing said data structure to determine the corresponding effects on one or more managed components caused by the one or more events.
-
-
68. A computer implemented method to determine the effects of one or more events in a system of managed components, the method comprising the steps of:
-
creating one or more configuration non-specific representations of types of managed components;
creating one or more configuration non-specific representations of events of said types of managed components; and
creating configuration non-specific representations of relations along which the events and/or effects of said events propagate amongst the types of managed components;
said configuration non-specific representations of types of managed components, said configuration non-specific representations of events of said types of managed components, and said configuration non-specific representations of relations along which the events and/or effects propagate amongst the types of managed components each being explicit and manipulatable by a first executable computer code;
partitioning a system domain into a plurality of smaller domains, each said smaller domain containing a subset of instances of managed components of the system;
for each smaller domain, producing a data structure for determining the effects of one or more events by combining a plurality of said configuration non-specific representation based on information of specific instances of managed components in the smaller domains;
for each smaller domain, executing a second computer code utilizing said data structures to determine the corresponding effects on one or more managed components caused by the one or more events in the smaller domain; and
combining the determined effects in each said smaller domain into one or more effects in said system.
-
Specification