Predicting infrastructure failures in a data center for hosted service mitigation actions
First Claim
1. A data center, comprising:
- one or computing devices comprising one or more respective hardware processors and memory and configured to implement;
a plurality of infrastructure management systems, configured to manage a plurality of different types of infrastructure resources for the data center;
a plurality of services hosted using the infrastructure resources at the data center, wherein each of the plurality of services provides a different network-based service to clients remote to the data center; and
an infrastructure monitor, configured to;
collect operational metrics from the infrastructure management systems for the plurality of different types of infrastructure resources;
generate one or more failure models from the operational metrics;
evaluate the one or more failure models to predict whether an infrastructure failure event is to occur at a future time; and
responsive to a prediction of an occurrence of the infrastructure failure event, send respective notifications of the predicted infrastructure failure event via a programmatic interface to at least some of the plurality of services;
each respective service of the plurality of services, configured to;
receive the respective notification of the predicted infrastructure failure event via the programmatic interface;
determine a respective mitigation action for the respective service; and
automatically perform the respective mitigation action responsive to receipt of the respective notification of the predicted infrastructure failure event to mitigate one or more consequences associated with occurrence of the predicted infrastructure failure event.
1 Assignment
0 Petitions
Accused Products
Abstract
A data center may predict infrastructure failures in order to perform mitigation actions at services hosted at the data center. Operational metrics for different infrastructure systems of a data center may be collected and analyzed to generate failure models. The failure models may be evaluated to predict infrastructure failure events. The predicted infrastructure failure events may be programmatically provided to the services. The services may evaluate the prediction and select mitigation actions to perform. For data centers implemented as part of a provider network with services hosted across multiple data centers, mitigation actions may be performed at multiple data centers for a service in response to a predicted failure event at one data center.
92 Citations
20 Claims
-
1. A data center, comprising:
one or computing devices comprising one or more respective hardware processors and memory and configured to implement; a plurality of infrastructure management systems, configured to manage a plurality of different types of infrastructure resources for the data center; a plurality of services hosted using the infrastructure resources at the data center, wherein each of the plurality of services provides a different network-based service to clients remote to the data center; and an infrastructure monitor, configured to; collect operational metrics from the infrastructure management systems for the plurality of different types of infrastructure resources; generate one or more failure models from the operational metrics; evaluate the one or more failure models to predict whether an infrastructure failure event is to occur at a future time; and responsive to a prediction of an occurrence of the infrastructure failure event, send respective notifications of the predicted infrastructure failure event via a programmatic interface to at least some of the plurality of services; each respective service of the plurality of services, configured to; receive the respective notification of the predicted infrastructure failure event via the programmatic interface; determine a respective mitigation action for the respective service; and automatically perform the respective mitigation action responsive to receipt of the respective notification of the predicted infrastructure failure event to mitigate one or more consequences associated with occurrence of the predicted infrastructure failure event. - View Dependent Claims (2, 3, 4)
-
5. A method, comprising:
performing, by one or more computing devices; receiving a stream of operational metrics from a plurality of different infrastructure systems of a data center, wherein the data center hosts a plurality of different services using the infrastructure resources, wherein each of the plurality of services provides a different network-based service to clients remote to the data center; generating one or more failure models from the stream of operational metrics for the plurality of different types of infrastructure resources; evaluating the one or more failure models to predict whether an infrastructure failure event is to occur at a future time; and responsive to a prediction of an occurrence of the infrastructure failure event, reporting the prediction of the infrastructure failure event via a programmatic interface to at least some of the services hosted at the data center for determination of respective mitigation actions for the respective services to take to mitigate one or more consequences associated with occurrence of the predicted infrastructure failure event. - View Dependent Claims (6, 7, 8, 9, 10, 11, 12, 13)
-
14. A non-transitory, computer-readable storage medium, storing program instructions that when executed by one or more computing devices cause the one or more computing devices to implement:
-
receiving, at a service hosted in one or more data centers using infrastructure resources of the one or more data centers, a prediction of an infrastructure failure event to occur at a future time, via a programmatic interface, wherein the service provides a network-based service to clients remote to the one or more data centers; evaluating, at the service, the prediction of the infrastructure failure event to determine whether to perform a mitigation action; based, at least in part, on the evaluation, selecting by the service, the mitigation action for the predicted infrastructure failure event; and performing, at the service, the selected mitigation action to mitigate one or more consequences associated with occurrence of the predicted infrastructure failure event. - View Dependent Claims (15, 16, 17, 18, 19, 20)
-
Specification