Grouping failures to infer common causes

US 7,529,974 B2
Filed: 11/30/2006
Issued: 05/05/2009
Est. Priority Date: 11/30/2006
Status: Active Grant

First Claim

Patent Images

1. A computer-executable method, comprising:

in a system of interrelated components, monitoring numerous components over time to detect a failure status of each of the numerous components with respect to intervals of the time;

for each interval of the time, receiving a failure indication for each component that is in failure during that interval;

forming one or more groups of the received failure indications, each group inferring a cause of failure common to the group;

arranging the received failure indications into a first matrix representing components versus time, the first matrix indicating the time intervals during which each component is in failure;

arranging the inferred causes of failure into a second matrix representing inferred causes of failure versus time;

correlating the components in the first matrix to the inferred causes of failure in the second matrix via a 3-dimensional intermediate matrix representing time slices, each time slice containing probability-based hypothetical groupings of the failure indications received at the time of the time slice and corresponding inferred causes of failure; and

ranking candidate values for each time slice to distinguish more probable hypothetical groupings of the failure indications from less probable hypothesized groupings of the failure indications.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods establish groups among numerous indications of failure in order to infer a cause of failure common to each group. In one implementation, a system computes the groups such that each group has the maximum likelihood of resulting from a common failure. Indications of failure are grouped by probability, even when a group'"'"'s inferred cause of failure is not directly observable in the system. In one implementation, related matrices provide a system for receiving numerous health indications from each of numerous autonomous systems connected with the Internet. A correlational matrix links input (failure symptoms) and output (known or unknown root causes) through probability-based hypothetical groupings of the failure indications. The matrices are iteratively refined according to self-consistency and parsimony metrics to provide most likely groupings of indicators and most likely causes of failure.

Citations

19 Claims

1. A computer-executable method, comprising:
- in a system of interrelated components, monitoring numerous components over time to detect a failure status of each of the numerous components with respect to intervals of the time;
  
  for each interval of the time, receiving a failure indication for each component that is in failure during that interval;
  
  forming one or more groups of the received failure indications, each group inferring a cause of failure common to the group;
  
  arranging the received failure indications into a first matrix representing components versus time, the first matrix indicating the time intervals during which each component is in failure;
  
  arranging the inferred causes of failure into a second matrix representing inferred causes of failure versus time;
  
  correlating the components in the first matrix to the inferred causes of failure in the second matrix via a 3-dimensional intermediate matrix representing time slices, each time slice containing probability-based hypothetical groupings of the failure indications received at the time of the time slice and corresponding inferred causes of failure; and
  
  ranking candidate values for each time slice to distinguish more probable hypothetical groupings of the failure indications from less probable hypothesized groupings of the failure indications.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The computer-executable method as recited in claim 1, wherein the identity of the inferred cause of failure is unknown.
  - 3. The computer-executable method as recited in claim 1, further comprising:
    - compiling a record of the inferred causes of failure associated with the groups;
      
      prioritizing the inferred causes of failure based on the groups; and
      
      weighing whether to invest a resource in fixing a particular inferred cause of failure based on the prioritizing.
  - 4. The computer-executable method as recited in claim 1, wherein the candidate values for time slices are ranked according to a self-consistency metric and a parsimony metric, the highest ranked time slice providing the most likely hypothetical groupings of the failure indications and the most likely corresponding inferred causes of failure, for a given set of failure indications received as input at a particular time.
  - 5. The computer-executable method as recited in claim 4, wherein the self-consistency metric analyzes each time slice against an expectation that each inferred cause of failure expresses itself in the first, second, and intermediate matrices with approximately the same set of failure indications each time the inferred cause of failure occurs, wherein each time slice is ranked in relation to how closely the time slice matches the self-consistency of the matrices.
  - 6. The computer-executable method as recited in claim 4, wherein the parsimony metric analyzes each time slice against an expectation that a less complex hypothetical grouping of failure indications is more probable than a more complex hypothetical grouping of failure indications for inferring the same cause of failure.
  - 7. The computer-executable method as recited in claim 6, wherein the parsimony metric uses priors on a number of the inferred causes of failure or uses a cost function to obtain a complexity ranking.
  - 8. The computer-executable method as recited in claim 6, wherein the parsimony metric prefers a hypothetical grouping of failure indications with additional groups when probabilities drawn from a random process such as a Beta Process indicate the additional groups are more likely than hypothetical groupings that do not use the additional groups.
  - 9. The computer-executable method as recited in claim 4, wherein the intermediate matrix comprises a binary matrix, and for a given component at a given time, if the component is not in failure then a correspondence to an inferred cause of failure is a value of zero in the intermediate matrix and wherein if the component is in failure then a correspondence to an inferred cause of failure is a value of one in the intermediate matrix,
  - 10. The computer-executable method as recited in claim 4, further comprising iteratively refining the intermediate matrix and the second matrix until the time slices of the intermediate matrix reach acceptable levels or an acceptable rate of change in the self-consistency metric and the parsimony metric.
  - 11. The computer-executable method as recited in claim 10, wherein iteratively refining the intermediate matrix and the second matrix includes iteratively erasing horizontal time slices of a 3-dimensional intermediate matrix space and regenerating the slices consistent with remaining values in the intermediate matrix, including:
    - computing a vector of probabilities for each inferred cause of failure, the vector comprising the probability that a component fails when the inferred cause of failure is active;
      
      computing new values for the intermediate matrix for a given time based on the vector;
      
      removing inferred causes of failure that are not associated with an indication of failure; and
      
      updating the second matrix based on changes in the intermediate matrix.
  - 12. The computer-executable method as recited in claim 11, further comprising recomputing vertical slices of the intermediate matrix, including:
    - erasing a slice of the intermediate matrix corresponding to a component;
      
      if the component is not in failure at a given time, then leaving associations between the component and inferred causes of failure unchanged in the intermediate matrix;
      
      if the component is in failure at a given time, and there is an inferred cause of failure in the second matrix for that time, then setting at least one association in the intermediate matrix between the component and the inferred cause of failure equal to a value of one; and
      
      generating a new inferred cause of failure according to a Beta Process if a probability of a new inferred cause of failure is greater than a probability that a failure indication is caused by existing inferred causes of failure.
  - 13. The computer-executable method as recited in claim 1, wherein the system of interrelated components comprises the Internet, and the interrelated components comprise autonomous systems communicatively coupled with the Internet.
  - 14. The computer-executable method as recited in claim 1, wherein the computer-executable method outputs one of:
    - a set of inferred causes of failure representing distinct causes of the failure indications;
      
      for each inferred cause of failure and each component, the probability that the component fails when the inferred cause of failure occurs;
      
      for each point in time, a list of inferred causes of failure that are currently causing failure indications at that point in time; and
      
      for each failure indication, a list of inferred causes of failure that are most likely to be the cause.

15. A system, comprising:
- an input matrix to arrange failure indications received from sensors monitoring a network with respect to time;
  
  an output matrix to arrange inferred causes of failure to be associated with the failure indications with respect to time; and
  
  a 3-dimensional intermediate matrix to associate the failure indications with the inferred causes of failure, the 3-dimensional intermediate matrix representing time slices, each time slice containing probability-based hypothetical groupings of the failure indications received at the time of the time slice and corresponding inferred causes of failure.
- View Dependent Claims (16, 17, 18)
- - 16. The system as recited in claim 15, further comprising:
    - an iterator to refine the intermediate matrix and the output matrix based on refining slices of the intermediate matrix according to;
      
      self-consistency of the input, output, and intermediate matrices; and
      
      parsimony of the associations between the failure indications and the inferred causes of failure, wherein a less complex hypothetical grouping of failure indications has a higher probability than a more complex hypothetical grouping of the failure indications of being caused by a same cause of failure.
  - 17. The system as recited in claim 15, further comprising a database of inferred known causes and inferred unknown causes.
  - 18. The system as recited in claim 15, further comprising a result iterator, to persist correlations between groups of failure indications and the inferred causes of failure between runtimes of the system.

19. A system, comprising:
- a first component for receiving numerous health indications from each of numerous components connected with the Internet; and
  
  a second component for grouping failure incidents among the health indications into groups, such that each group implies a cause of failure common to the group;
  
  a third component for arranging the received failure indications into a first matrix representing components versus time, the first matrix indicating the time intervals during which each component is in failure;
  
  a fourth component for arranging the inferred causes of failure into a second matrix representing inferred causes of failure versus time;
  
  a fifth component for correlating the components in the first matrix to the interred causes of failure in the second matrix via a 3-dimensional intermediate matrix representing time slices, each time slice containing probability-based hypothetical groupings of the failure indications received at the time of the time slice and corresponding inferred causes of failure; and
  
  a sixth component for ranking candidate values for each time slice to distinguish more probable hypothetical groupings of the failure indications from less probable hypothesized groupings of the failure indications.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Kiciman, Emre, Thibaux, Romain, Maltz, David A.
Primary Examiner(s)
Beausoliel; Robert
Assistant Examiner(s)
Mehrmanesh; Elmira

Application Number

US11/565,538
Publication Number

US 20080133288A1
Time in Patent Office

887 Days
Field of Search

714/26, 714/43, 714/47
US Class Current

714/26
CPC Class Codes

G06Q 10/04 Forecasting or optimisation...

Grouping failures to infer common causes

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Grouping failures to infer common causes

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links