Thompson strategy based online reinforcement learning system for action selection

US 7,707,131 B2
Filed: 06/29/2005
Issued: 04/27/2010
Est. Priority Date: 03/08/2005
Status: Expired due to Fees

First Claim

Patent Images

1. An online reinforcement learning system comprising components embodied on a computer readable storage medium, the components when executed by one or more processors, update a model based upon reinforcement learning, the components comprising:

a model comprising an influence diagram with at least one chance node, the model receiving an input and providing a probability distribution associated with uncertainty regarding parameters of the model;

a decision engine that selects an action based, at least in part, upon the probability distribution, the decision engine employing a Thompson strategy heuristic technique to maximize long term expected utility when selecting the action, wherein the decision engine decreases a variance of a distribution of the parameters as a last decision instance is approached; and

a computer-implemented reinforcement learning component that modifies at least one of the parameters of the model based upon feedback associated with the selected action, the parameters defining distributions over discrete variables and continuous variables, uncertainty of the parameters expressed using Dirichlet priors for conditional distributions of discrete variables of the model, and, Normal-Wishart priors for distributions of continuous variables of the model, wherein the modified model is stored.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for online reinforcement learning is provided. In particular, a method for performing the explore-vs.-exploit tradeoff is provided. Although the method is heuristic, it can be applied in a principled manner while simultaneously learning the parameters and/or structure of the model (e.g., Bayesian network model).

The system includes a model which receives an input (e.g., from a user) and provides a probability distribution associated with uncertainty regarding parameters of the model to a decision engine. The decision engine can determine whether to exploit the information known to it or to explore to obtain additional information based, at least in part, upon the explore-vs.-exploit tradeoff (e.g., Thompson strategy). A reinforcement learning component can obtain additional information (e.g., feedback from a user) and update parameter(s) and/or the structure of the model. The system can be employed in scenarios in which an influence diagram is used to make repeated decisions and maximization of long-term expected utility is desired.

137 Citations

17 Claims

1. An online reinforcement learning system comprising components embodied on a computer readable storage medium, the components when executed by one or more processors, update a model based upon reinforcement learning, the components comprising:
- a model comprising an influence diagram with at least one chance node, the model receiving an input and providing a probability distribution associated with uncertainty regarding parameters of the model;
  
  a decision engine that selects an action based, at least in part, upon the probability distribution, the decision engine employing a Thompson strategy heuristic technique to maximize long term expected utility when selecting the action, wherein the decision engine decreases a variance of a distribution of the parameters as a last decision instance is approached; and
  
  a computer-implemented reinforcement learning component that modifies at least one of the parameters of the model based upon feedback associated with the selected action, the parameters defining distributions over discrete variables and continuous variables, uncertainty of the parameters expressed using Dirichlet priors for conditional distributions of discrete variables of the model, and, Normal-Wishart priors for distributions of continuous variables of the model, wherein the modified model is stored.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The system of claim 1, used when the parameters of the model are changing over time.
  - 3. The system of claim 1, wherein the decision engine employs a maximum a posterior of the parameters when there is only one more decision instance remaining.
  - 4. The system of claim 1, wherein the decision engine artificially increases the variance of a distribution of the parameters.
  - 5. The system of claim 1, wherein the computer-implemented reinforcement learning component further modifies the structure of the model based, at least in part, upon the feedback associated with the selected action.
  - 6. The system of claim 1, wherein the feedback comprises an input from a user of the system.
  - 7. The system of claim 6, wherein the input from the user comprises a verbal utterance.
  - 8. The system of claim 1, wherein the feedback comprises a lack of an input from a user of the system in a threshold period of time.
  - 9. The system of claim 1, where one or more parameters of the model change over a period of time.
  - 10. The system of claim 1, the parameters defining distributions over variables, where the variables comprise chance variables, decision variables and/or value variables.
  - 11. The system of claim 1, employed repeatedly to facilitate decision making.
  - 12. The system of claim 11, wherein the parameter(s) are updated prior to a next repetition.
  - 13. The system of claim 1, the model comprising a Markov decision process represented as an Influence diagram.
  - 14. The system of claim 1 employed as part of a dialog system.

15. An online reinforcement learning method comprising:
- determining a probability distribution associated with uncertainty regarding parameters of a model, the model comprising an influence diagram with at least one chance node;
  
  employing a computer-implemented Thompson strategy heuristic technique to select an action based, at least in part, upon the probability distribution, wherein a variance of a distribution of the parameters is artificially increased to be large enough that the model continues to adapt;
  
  updating at least one parameter of the model based, at least in part, upon feedback associated with the selected action, the parameters defining distributions over discrete variables and continuous variables, uncertainty of the parameters expressed using Dirichlet priors for conditional distributions of discrete variables of the model, and, Normal-Wishart priors for distributions of continuous variables of the model; and
  
  storing the updated model on a computer readable storage medium.
- View Dependent Claims (16, 17)
- - 16. The method of claim 15, wherein the feedback comprises an input from a user or a lack of an input from the user in a threshold period of time.
  - 17. A computer readable medium having stored thereon computer executable instructions for carrying out the method of claim 15.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Paek, Timothy S., Chickering, David M., Horvitz, Eric J.
Primary Examiner(s)
Vincent; David R
Assistant Examiner(s)
Kennedy; Adrian L

Application Number

US11/169,503
Publication Number

US 20060224535A1
Time in Patent Office

1,763 Days
Field of Search

706/52, 706/46, 706/45
US Class Current

706/45
CPC Class Codes

G06N 20/00 Machine learning

Thompson strategy based online reinforcement learning system for action selection

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

137 Citations

17 Claims

Specification

Use Cases

Quick Links

Others

Thompson strategy based online reinforcement learning system for action selection

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

137 Citations

17 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others