Action selection for reinforcement learning using influence diagrams

US 20060224535A1
Filed: 06/29/2005
Published: 10/05/2006
Est. Priority Date: 03/08/2005
Status: Active Grant

First Claim

Patent Images

1. An online reinforcement learning system comprising:

a model comprising an influence diagram with at least one chance node, the model receives an input and provides a probability distribution associated with uncertainty regarding parameters of the model;

a decision engine that selects an action based, at least in part, upon the probability distribution, the decision engine employs the Thompson strategy heuristic technique to maximize long term expected utility; and

, a reinforcement learning component that modifies at least one of the parameters of the model based upon feedback associated with the selected action.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for online reinforcement learning is provided. In particular, a method for performing the explore-vs.-exploit tradeoff is provided. Although the method is heuristic, it can be applied in a principled manner while simultaneously learning the parameters and/or structure of the model (e.g., Bayesian network model). The system includes a model which receives an input (e.g., from a user) and provides a probability distribution associated with uncertainty regarding parameters of the model to a decision engine. The decision engine can determine whether to exploit the information known to it or to explore to obtain additional information based, at least in part, upon the explore-vs.-exploit tradeoff (e.g., Thompson strategy). A reinforcement learning component can obtain additional information (e.g., feedback from a user) and update parameter(s) and/or the structure of the model. The system can be employed in scenarios in which an influence diagram is used to make repeated decisions and maximization of long-term expected utility is desired.

Citations

20 Claims

1. An online reinforcement learning system comprising:
- a model comprising an influence diagram with at least one chance node, the model receives an input and provides a probability distribution associated with uncertainty regarding parameters of the model;
  
  a decision engine that selects an action based, at least in part, upon the probability distribution, the decision engine employs the Thompson strategy heuristic technique to maximize long term expected utility; and
  
  , a reinforcement learning component that modifies at least one of the parameters of the model based upon feedback associated with the selected action.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 2. The system of claim 1, used when the parameters of the model are changing over time.
  - 3. The system of claim 1, where the decision engine decreases a variance of a distribution of the parameters as a last decision instance is approached.
  - 4. The system of claim 1, a variance of a distribution of the parameters artificially increased by the decision engine.
  - 5. The system of claim 1, the reinforcement learning component further modifies the structure of the model based, at least in part, upon feedback associated with the selected action.
  - 6. The system of claim 1, the feedback comprising an input from a user of the system.
  - 7. The system of claim 6, the input from the user comprising a verbal utterance.
  - 8. The system of claim 1, the feedback comprising a lack of an input from a user of the system in a threshold period of time.
  - 9. The system of claim 1, where one or more parameters of the model change over a period of time.
  - 10. The system of claim 1, the parameters defining distributions over variables, where the variables comprise chance variables, decision variables and/or value variables.
  - 11. The system of claim 1, employed repeatedly to facilitate decision making.
  - 12. The system of claim 11, the parameter(s) updated prior to a next repetition.
  - 13. The system of claim 1, the model comprising a Markov decision process represented as an Influence diagram.
  - 14. The system of claim 1 employed as part of a dialog system.
  - 15. The system of claim 1, the parameters defining distributions over discrete variables and continuous variables.
  - 16. The system of claim 15, uncertainty of the parameters expressed using Dirichlet priors for conditional distributions of discrete variables of the model, and, Normal-Wishart priors for distributions of continuous variables of the model.

17. A method facilitating online reinforcement learning comprising:
- determining a probability distribution associated with uncertainty regarding parameters of a model, the model comprising an influence diagram with at least one chance node;
  
  employing the Thompson strategy heuristic technique to select an action based, at least in part, upon the probability distribution; and
  
  , updating at least one parameter of the model based, at least in part, upon feedback associated with the selected action.
- View Dependent Claims (18, 19)
- - 18. The method of claim 17, the feedback comprising an input from a user or a lack of an input from the user in a threshold period of time.
  - 19. A computer readable medium having stored thereon computer executable instructions for carrying out the method of claim 17.

20. A data packet transmitted between two or more computer components that facilitates online reinforcement learning, the data packet comprising:
- an updated parameter of a model, the model comprising an influence diagram with at least one chance node, the parameter updated based, at least in part upon feedback associated with a selected action, selection is based upon the Thompson strategy heuristic technique and a probability distribution associated with uncertainty regarding parameters of a model.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Paek, Timothy S., Chickering, David M., Horvitz, Eric J.

Granted Patent

US 7,707,131 B2
Time in Patent Office

Days
Field of Search
US Class Current

706/16
CPC Class Codes

G06N 20/00 Machine learning

Action selection for reinforcement learning using influence diagrams

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Action selection for reinforcement learning using influence diagrams

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links