Automated Datacenter Network Failure Mitigation
First Claim
1. A method performed at least in part by at least one processor, comprising:
- monitoring a network;
detecting a failure;
determining a component set corresponding to the failure, in which the component set comprises one or more suspected faulty components; and
taking automated action on the component set to mitigate the failure, including when the component set comprises a plurality of components, iterating through one or more of the components applying one or more mitigation actions until the failure is mitigated.
2 Assignments
0 Petitions
Accused Products
Abstract
The subject disclosure is directed towards a technology that automatically mitigates datacenter failures, instead of relying on human intervention to diagnose and repair the network. Via a mitigation pipeline, when a network failure is detected, a candidate set of components that are likely to be the cause of the failure is identified, with mitigation actions iteratively targeting each component to attempt to alleviate the problem. The impact to the network is estimated to ensure that the redundancy present in the network will be able to handle the mitigation action without adverse disruption to the network.
12 Citations
20 Claims
-
1. A method performed at least in part by at least one processor, comprising:
-
monitoring a network; detecting a failure; determining a component set corresponding to the failure, in which the component set comprises one or more suspected faulty components; and taking automated action on the component set to mitigate the failure, including when the component set comprises a plurality of components, iterating through one or more of the components applying one or more mitigation actions until the failure is mitigated. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system comprising:
-
a failure detector configured to process network state data to determine a state indicative of a network failure; a planner configured to determine a mitigation plan for mitigating the network failure, including a plan to iterate through a plurality of suspected faulty components to apply one or more mitigation actions to one or more components until the failure is mitigated, the planner coupled to an impact estimator configured to determine an impact if an action is taken, the planner further configured to adjust the plan based upon the impact; and a plan executor, the plan executor configured to access the mitigation plan and take one or more actions identified in the plan on a network component set comprising the plurality of suspected faulty components to mitigate the failure. - View Dependent Claims (12, 13, 14, 15, 16)
-
-
17. One or more computer-readable devices having computer-executable instructions, which when executed by at least one computer perform steps, comprising:
-
a) determining that a network failure corresponding to a component set has occurred; b) providing a mitigation plan, the mitigation plan comprising one or more mitigation actions that if taken on one or more suspected faulty components of the component set are likely to mitigate the failure; c) estimating whether a selected action of the mitigation plan, if taken on a component, will adversely impact the network, and if so, discarding that action, and if not, keeping the action for execution; d) performing the selected action on a suspected faulty component and determining whether the action mitigated the failure, and if so, advancing to step e), and if not, returning to step c) to select another action until the failure is mitigated or no other action in the plan remains to be performed; and e) recording or updating, or both recording and updating, information regarding the failure, the plan, and one or more mitigation-related actions. - View Dependent Claims (18, 19, 20)
-
Specification