AUTOMATED LEARNING OF FAILURE RECOVERY POLICIES

US 20110214006A1
Filed: 02/26/2010
Published: 09/01/2011
Est. Priority Date: 02/26/2010
Status: Active Grant

First Claim

Patent Images

1. A system comprising:

a model learning component configured to access collected observable interactions of an existing repair policy with a process to build a model of the process, the model mapping states of the process to repair actions of the existing repair policy;

a policy computation component configured to compute a new policy based upon the model, the new policy identifying a number of times to retry a first one of the repair actions when the process is in a first one of the states of the process; and

a controller configured to apply the new policy to the process and, in an instance when the first state is identified, retry the first repair action the number of times identified by the new policy before escalating the first repair action to a second one of the repair actions; and

one or more processing units configured to execute at least one of the model learning component, the policy computation component, or the policy application component.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Described is automated learning of failure recovery policies based upon existing information regarding previous policies and actions. A learning mechanism automatically constructs a new policy for controlling a recovery process, based upon collected observable interactions of an existing policy with the process. In one aspect, the learning mechanism builds a partially observable Markov decision process (POMDP) model, and computes the new policy base upon the learned model. The new policy may perform automatic fault recovery, e.g., on a machine in a datacenter corresponding to the controlled process.

Citations

30 Claims

1. A system comprising:
- a model learning component configured to access collected observable interactions of an existing repair policy with a process to build a model of the process, the model mapping states of the process to repair actions of the existing repair policy;
  
  a policy computation component configured to compute a new policy based upon the model, the new policy identifying a number of times to retry a first one of the repair actions when the process is in a first one of the states of the process; and
  
  a controller configured to apply the new policy to the process and, in an instance when the first state is identified, retry the first repair action the number of times identified by the new policy before escalating the first repair action to a second one of the repair actions; and
  
  one or more processing units configured to execute at least one of the model learning component, the policy computation component, or the policy application component.
- View Dependent Claims (2, 5, 6, 7, 8, 9)
- - 2. The system of claim 1 further comprising a collection component configured to collect the observable interactions into a data structure for access by the model learning component.
  - 5. The system of claim 1, wherein the model comprises a partially observable Markov decision process.
  - 6. The system of claim 5, wherein the model learning component uses an expectation maximization algorithm to learn the partially observable Markov decision process.
  - 7. The system of claim 5, wherein the policy computation component computes the new policy using a point-based value iteration algorithm.
  - 8. The system of claim 7, wherein the point-based value iteration algorithm is executed by a cost-based indefinite-horizon formalization.
  - 9. The system of claim 1, wherein the process corresponds to a computing machine in a datacenter.

3. (canceled)
- View Dependent Claims (4)
- - 4. The system of claim 3 wherein the new policy performs automatic fault recovery.

10-20. -20. (canceled)

21. A method comprising:
- accessing collected observable interactions of an existing repair policy with a process to build a model of the process, the model mapping states of the process to repair actions of the existing repair policy;
  
  computing a new policy based upon the model, the new policy identifying a number of times to retry a first one of the repair actions when the process is in a first one of the states of the process; and
  
  applying the new policy to the process and, in an instance when the first state is identified, retrying the first repair action the number of times identified by the new policy before escalating the first repair action to a second one of the repair actions.
- View Dependent Claims (22, 23, 24, 25)
- - 22. The method according to claim 21, the first state reflecting an error message generated by the process.
  - 23. The method according to claim 21, the observable interactions being accessed from a recovery log associated with the process.
  - 24. The method according to claim 21, the model being computed based on a first action cost associated with the first repair action and a second action cost associated with the second repair action.
  - 25. The method according to claim 24, the second action cost being higher than the first action cost.

26. One or more computer-readable storage devices comprising instructions which, when executed by one or more processing units, cause the one or more processing units to perform:
- accessing collected observable interactions of an existing repair policy with a process to build a model of the process, the model mapping states of the process to repair actions of the existing repair policy;
  
  computing a new policy based upon the model, the new policy identifying a number of times to retry a first one of the repair actions when the process is in a first one of the states of the process; and
  
  applying the new policy to the process and, in an instance when the first state is identified, retrying the first repair action the number of times identified by the new policy before escalating the first repair action to a second one of the repair actions.
- View Dependent Claims (27, 28, 30)
- - 27. The one or more computer-readable storage devices according to claim 26, the first state reflecting an error message generated by the process.
  - 28. The one or more computer-readable storage devices according to claim 26, the observable interactions being accessed from a recovery log associated with the process.
  - 30. The one or more computer-readable storage devices according to claim 26, the second action cost being higher than the first action cost.

29. The one or more computer-readable storage devices according to claim 29, the model being computed based on a first action cost associated with the first repair action and a second action cost associated with the second repair action.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
ServiceNow Incorporated
Original Assignee
Microsoft Corporation
Inventors
Shani, Guy, Meek, Christopher A.

Granted Patent

US 8,024,611 B1
Time in Patent Office

Days
Field of Search
US Class Current

714/2
CPC Class Codes

G06F 11/0709 in a distributed system con...

G06F 11/079 Root cause analysis, i.e. e...

AUTOMATED LEARNING OF FAILURE RECOVERY POLICIES

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

30 Claims

Specification

Solutions

Use Cases

Quick Links

AUTOMATED LEARNING OF FAILURE RECOVERY POLICIES

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

30 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links