IDENTIFYING LIKELY FAULTY COMPONENTS IN A DISTRIBUTED SYSTEM
First Claim
1. A method of predicting component failure, the method comprising:
- receiving, by a communication protocol and with a virtual network controller that includes an analytics plane to analyze operations of a plurality of components in one or more virtual networks, a first parameter set from each of the components, wherein a parameter set from a component includes one or more quantitative parameters that each describes a state of the component;
receiving, by the communication protocol and with the virtual network controller, an indication of detected component failure for one or more of the components;
training, with the virtual network controller and using the first parameter sets and the indication of detected component failure, a trainable automated classifier to develop a classifying structure that distinguishes between component parameter sets that logically associate with a detected component failure and component parameter sets that do not logically associate with a detected component failure;
receiving, by the communication protocol and with the virtual network controller, a second parameter set from each of the components; and
predicting, with the virtual network controller using the trainable automated classifier and the classifying structure, a failure of a first one of the components.
1 Assignment
0 Petitions
Accused Products
Abstract
In general, techniques are described for automatically identifying likely faulty components in massively distributed complex systems. In some examples, snapshots of component parameters are automatically repeatedly fed to a pre-trained classifier and the classifier indicates whether each received snapshot is likely to belong to a fault and failure class or to a non-fault/failure class. Components whose snapshots indicate a high likelihood of fault or failure are investigated, restarted or taken off line as a pre-emptive measure. The techniques may be applied in a massively distributed complex system such as a data center.
-
Citations
20 Claims
-
1. A method of predicting component failure, the method comprising:
-
receiving, by a communication protocol and with a virtual network controller that includes an analytics plane to analyze operations of a plurality of components in one or more virtual networks, a first parameter set from each of the components, wherein a parameter set from a component includes one or more quantitative parameters that each describes a state of the component; receiving, by the communication protocol and with the virtual network controller, an indication of detected component failure for one or more of the components; training, with the virtual network controller and using the first parameter sets and the indication of detected component failure, a trainable automated classifier to develop a classifying structure that distinguishes between component parameter sets that logically associate with a detected component failure and component parameter sets that do not logically associate with a detected component failure; receiving, by the communication protocol and with the virtual network controller, a second parameter set from each of the components; and predicting, with the virtual network controller using the trainable automated classifier and the classifying structure, a failure of a first one of the components. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A method for identifying likely faulty components in a massively distributed system, the method comprising:
-
(a) subdividing the system into a plurality of tiers; (b) for each respective tier, identifying respective quantitative parameters of respective components of the respective tier whose quantitative values are likely to act as indicators of component failure; (c) for each respective tier, automatically repeatedly capturing sample snapshots of the identified respective quantitative parameters of the tier components; (d) for each respective tier, automatically repeatedly detecting component failures; (e) for each respective detected component failure, logically associating the detected component failure with one or more of the respective captured parameter snapshots that immediately preceded the respective component failure; (f) automatically repeatedly training a trainable automated classifier to develop a classifying structure that distinguishes between first component parameter sets that logically associate with a detected failure and second component parameter sets that do not logically associate with a detected failure; (g) after said training, placing the trained classifier in a prediction mode wherein the trained classifier is automatically repeatedly fed with the automatically repeatedly captured sample snapshots and wherein the trained classifier uses its developed classifying structure to classify the in-prediction-mode sample snapshots as correlating to likely failure or as correlating to likely non-failure; (h) investigating those of the in-prediction-mode sample snapshots that were correlated to failure as being likely to be fault-indicating parameter sets; and (i) taking preemptive measures for those of the respective tier components that were determined to be more highly likely to enter a failure mode based on the in-prediction-mode indication that the corresponding sample snapshots correlate to failure.
-
-
11. A virtual network controller comprising:
-
an analytics plane; a control plane; one or more processors configured to execute the analytics plane to analyze operations of a plurality of components in one or more virtual networks, wherein the control plane receives, by a communication protocol, a first parameter set from each of the components, wherein a parameter set from a component includes one or more quantitative parameters that each describe a state of the component, wherein the control plane receives, by the communication protocol, an indication of detected component failure for one or more of the components, and wherein the control plane provides the first parameter sets and the indication of detected component failure to the analytics plane; a trainable automated classifier, wherein the analytics plane trains, using the first parameter sets and the indication of detected component failure, the trainable automated classifier to develop a classifying structure that distinguishes between first component parameter sets that logically associate with a detected component failure and second component parameter sets that do not logically associate with a detected component failure, wherein the control plane receives, by the communication protocol, a second parameter set from each of the components and provides the second parameter sets to the analytics plane, and wherein the analytics plane predicts, using the trainable automated classifier and the classifying structure, a failure of a first one of the components. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A non-transitory computer-readable medium comprising instructions that, when executed, cause one or more programmable processors to:
-
receive, by a communication protocol and with a virtual network controller that includes an analytics plane to analyze operations of a plurality of components in one or more virtual networks, a first parameter set from each of the components, wherein a parameter set from a component includes one or more quantitative parameters that each describes a state of the component; receive, by the communication protocol and with the virtual network controller, an indication of detected component failure for one or more of the components; train, with the virtual network controller and using the first parameter sets and the indication of detected component failure, a trainable automated classifier to develop a classifying structure that distinguishes between component parameter sets that logically associate with a detected component failure and component parameter sets that do not logically associate with a detected component failure; receive, by the communication protocol and with the virtual network controller, a second parameter set from each of the components; and predict, with the virtual network controller using the trainable automated classifier and the classifying structure, a failure of a first one of the components.
-
Specification