IDENTIFYING LIKELY FAULTY COMPONENTS IN A DISTRIBUTED SYSTEM

US 20130332399A1
Filed: 03/15/2013
Published: 12/12/2013
Est. Priority Date: 06/06/2012
Status: Active Grant

First Claim

Patent Images

1. A method of predicting component failure, the method comprising:

receiving, by a communication protocol and with a virtual network controller that includes an analytics plane to analyze operations of a plurality of components in one or more virtual networks, a first parameter set from each of the components, wherein a parameter set from a component includes one or more quantitative parameters that each describes a state of the component;

receiving, by the communication protocol and with the virtual network controller, an indication of detected component failure for one or more of the components;

training, with the virtual network controller and using the first parameter sets and the indication of detected component failure, a trainable automated classifier to develop a classifying structure that distinguishes between component parameter sets that logically associate with a detected component failure and component parameter sets that do not logically associate with a detected component failure;

receiving, by the communication protocol and with the virtual network controller, a second parameter set from each of the components; and

predicting, with the virtual network controller using the trainable automated classifier and the classifying structure, a failure of a first one of the components.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In general, techniques are described for automatically identifying likely faulty components in massively distributed complex systems. In some examples, snapshots of component parameters are automatically repeatedly fed to a pre-trained classifier and the classifier indicates whether each received snapshot is likely to belong to a fault and failure class or to a non-fault/failure class. Components whose snapshots indicate a high likelihood of fault or failure are investigated, restarted or taken off line as a pre-emptive measure. The techniques may be applied in a massively distributed complex system such as a data center.

Citations

20 Claims

1. A method of predicting component failure, the method comprising:
- receiving, by a communication protocol and with a virtual network controller that includes an analytics plane to analyze operations of a plurality of components in one or more virtual networks, a first parameter set from each of the components, wherein a parameter set from a component includes one or more quantitative parameters that each describes a state of the component;
  
  receiving, by the communication protocol and with the virtual network controller, an indication of detected component failure for one or more of the components;
  
  training, with the virtual network controller and using the first parameter sets and the indication of detected component failure, a trainable automated classifier to develop a classifying structure that distinguishes between component parameter sets that logically associate with a detected component failure and component parameter sets that do not logically associate with a detected component failure;
  
  receiving, by the communication protocol and with the virtual network controller, a second parameter set from each of the components; and
  
  predicting, with the virtual network controller using the trainable automated classifier and the classifying structure, a failure of a first one of the components.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, wherein predicting a failure of a first one of the components comprises classifying the second parameter set for the first one of the components to a likely bad class according to the classifying structure.
  - 3. The method of claim 1,wherein the classifying structure comprises one or more classification separation surfaces, andwherein predicting a failure of a first one of the components comprises classifying the second parameter set for the first one of the components to a likely bad class according to one of the classification separation surfaces.
  - 4. The method of claim 3,wherein the one of the classification separation surfaces is associated with a tolerance amount, andwherein classifying the second parameter set for the first one of the components to a likely bad class comprises determining the second parameter set exceeds the tolerance amount.
  - 5. The method of claim 1,wherein the trainable automated classifier comprises one or more support vector machines, andwherein training the trainable automated classifier comprises inputting the first parameter sets and the indication of detected component failure to the support vector machines to produce the classifying structure.
  - 6. The method of claim 1,wherein the virtual network controller is a distributed virtual network controller comprising a plurality of virtual network controller nodes, andwherein each of the virtual network controller nodes comprises an analytics virtual machine that exchanges at least some analytics information to implement the analytics plane.
  - 7. The method of claim 1, wherein the plurality of components includes virtual network elements that include one or more of servers, top-of-rack (TOR) switches, or chassis switches.
  - 8. The method of claim 1, wherein the virtual network controller uses a software-defined network protocol to receive the first parameter set from each of the components.
  - 9. The method of claim 1, wherein the components execute one of a forwarding plane, control plane, or configuration plane for the virtual networks.

10. A method for identifying likely faulty components in a massively distributed system, the method comprising:
- (a) subdividing the system into a plurality of tiers;
  
  (b) for each respective tier, identifying respective quantitative parameters of respective components of the respective tier whose quantitative values are likely to act as indicators of component failure;
  
  (c) for each respective tier, automatically repeatedly capturing sample snapshots of the identified respective quantitative parameters of the tier components;
  
  (d) for each respective tier, automatically repeatedly detecting component failures;
  
  (e) for each respective detected component failure, logically associating the detected component failure with one or more of the respective captured parameter snapshots that immediately preceded the respective component failure;
  
  (f) automatically repeatedly training a trainable automated classifier to develop a classifying structure that distinguishes between first component parameter sets that logically associate with a detected failure and second component parameter sets that do not logically associate with a detected failure;
  
  (g) after said training, placing the trained classifier in a prediction mode wherein the trained classifier is automatically repeatedly fed with the automatically repeatedly captured sample snapshots and wherein the trained classifier uses its developed classifying structure to classify the in-prediction-mode sample snapshots as correlating to likely failure or as correlating to likely non-failure;
  
  (h) investigating those of the in-prediction-mode sample snapshots that were correlated to failure as being likely to be fault-indicating parameter sets; and
  
  (i) taking preemptive measures for those of the respective tier components that were determined to be more highly likely to enter a failure mode based on the in-prediction-mode indication that the corresponding sample snapshots correlate to failure.

11. A virtual network controller comprising:
- an analytics plane;
  
  a control plane;
  
  one or more processors configured to execute the analytics plane to analyze operations of a plurality of components in one or more virtual networks,wherein the control plane receives, by a communication protocol, a first parameter set from each of the components, wherein a parameter set from a component includes one or more quantitative parameters that each describe a state of the component,wherein the control plane receives, by the communication protocol, an indication of detected component failure for one or more of the components, andwherein the control plane provides the first parameter sets and the indication of detected component failure to the analytics plane;
  
  a trainable automated classifier,wherein the analytics plane trains, using the first parameter sets and the indication of detected component failure, the trainable automated classifier to develop a classifying structure that distinguishes between first component parameter sets that logically associate with a detected component failure and second component parameter sets that do not logically associate with a detected component failure,wherein the control plane receives, by the communication protocol, a second parameter set from each of the components and provides the second parameter sets to the analytics plane, andwherein the analytics plane predicts, using the trainable automated classifier and the classifying structure, a failure of a first one of the components.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
- - 12. The virtual network controller of claim 11, wherein predicting a failure of a first one of the components comprises classifying the second parameter set for the first one of the components to a likely bad class according to the classifying structure.
  - 13. The virtual network controller of claim 11,wherein the classifying structure comprises one or more classification separation surfaces, andwherein the analytics plane predicts the failure of a first one of the components by classifying the second parameter set for the first one of the components to a likely bad class according to one of the classification separation surfaces.
  - 14. The virtual network controller of claim 13,wherein the one of the classification separation surfaces is associated with a tolerance amount, andwherein classifying the second parameter set for the first component to a likely bad class comprises determining the second parameter set exceeds the tolerance amount.
  - 15. The virtual network controller of claim 11,wherein the trainable automated classifier comprises one or more support vector machines, andwherein the analytics plane trains the trainable automated classifier by inputting the first parameter sets and the indication of detected component failure to the support vector machines to produce to the classifying structure.
  - 16. The virtual network controller of claim 11, further comprising:
    - a plurality of virtual network controller nodes that implement a distributed virtual network controller,wherein each of the virtual network controller nodes comprises an analytics virtual machine that exchange at least some analytics information to implement the analytics plane.
  - 17. The virtual network controller of claim 11, wherein the plurality of components include virtual network elements that include one or more of servers, top-of-rack (TOR) switches, or chassis switches.
  - 18. The virtual network controller of claim 11, wherein the virtual network controller uses a software-defined network protocol to receive the first parameter set from each of the components.
  - 19. The virtual network controller of claim 11, wherein the components execute one of a forwarding plane, control plane, or configuration plane for the virtual networks.

20. A non-transitory computer-readable medium comprising instructions that, when executed, cause one or more programmable processors to:
- receive, by a communication protocol and with a virtual network controller that includes an analytics plane to analyze operations of a plurality of components in one or more virtual networks, a first parameter set from each of the components, wherein a parameter set from a component includes one or more quantitative parameters that each describes a state of the component;
  
  receive, by the communication protocol and with the virtual network controller, an indication of detected component failure for one or more of the components;
  
  train, with the virtual network controller and using the first parameter sets and the indication of detected component failure, a trainable automated classifier to develop a classifying structure that distinguishes between component parameter sets that logically associate with a detected component failure and component parameter sets that do not logically associate with a detected component failure;
  
  receive, by the communication protocol and with the virtual network controller, a second parameter set from each of the components; and
  
  predict, with the virtual network controller using the trainable automated classifier and the classifying structure, a failure of a first one of the components.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Juniper Networks Incorporated
Original Assignee
Juniper Networks Incorporated
Inventors
Reddy, Rajashekar, Nakil, Harshad Bhaskar

Granted Patent

US 9,064,216 B2
Time in Patent Office

Days
Field of Search
US Class Current

706/12
CPC Class Codes

G06F 11/008   Reliability or availability...

G06N 20/00   Machine learning

H04L 41/0631   using root cause analysis; ...

H04L 41/122   of virtualised topologies, ...

H04L 41/147   for predicting network beha...

H04L 41/40   using virtualisation of net...

H04L 43/04   Processing captured monitor...

H04L 43/0852   Delays

H04L 43/10   Active monitoring, e.g. hea...

H04L 43/20   the monitoring system or th...

H04L 45/16   Multipoint routing

H04L 45/38   Flow based routing

H04L 45/42   Centralised routing

H04L 45/48   Routing tree calculation

H04L 45/586   of virtual routers

H04L 61/103   across network layers, e.g....

H04L 69/40   for recovering from a failu...

IDENTIFYING LIKELY FAULTY COMPONENTS IN A DISTRIBUTED SYSTEM

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

IDENTIFYING LIKELY FAULTY COMPONENTS IN A DISTRIBUTED SYSTEM

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links