Action selection for reinforcement learning using influence diagrams
First Claim
1. An online reinforcement learning system comprising:
- a model comprising an influence diagram with at least one chance node, the model receives an input and provides a probability distribution associated with uncertainty regarding parameters of the model;
a decision engine that selects an action based, at least in part, upon the probability distribution, the decision engine employs the Thompson strategy heuristic technique to maximize long term expected utility; and
, a reinforcement learning component that modifies at least one of the parameters of the model based upon feedback associated with the selected action.
2 Assignments
0 Petitions
Accused Products
Abstract
A system and method for online reinforcement learning is provided. In particular, a method for performing the explore-vs.-exploit tradeoff is provided. Although the method is heuristic, it can be applied in a principled manner while simultaneously learning the parameters and/or structure of the model (e.g., Bayesian network model). The system includes a model which receives an input (e.g., from a user) and provides a probability distribution associated with uncertainty regarding parameters of the model to a decision engine. The decision engine can determine whether to exploit the information known to it or to explore to obtain additional information based, at least in part, upon the explore-vs.-exploit tradeoff (e.g., Thompson strategy). A reinforcement learning component can obtain additional information (e.g., feedback from a user) and update parameter(s) and/or the structure of the model. The system can be employed in scenarios in which an influence diagram is used to make repeated decisions and maximization of long-term expected utility is desired.
-
Citations
20 Claims
-
1. An online reinforcement learning system comprising:
-
a model comprising an influence diagram with at least one chance node, the model receives an input and provides a probability distribution associated with uncertainty regarding parameters of the model;
a decision engine that selects an action based, at least in part, upon the probability distribution, the decision engine employs the Thompson strategy heuristic technique to maximize long term expected utility; and
,a reinforcement learning component that modifies at least one of the parameters of the model based upon feedback associated with the selected action. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. A method facilitating online reinforcement learning comprising:
-
determining a probability distribution associated with uncertainty regarding parameters of a model, the model comprising an influence diagram with at least one chance node;
employing the Thompson strategy heuristic technique to select an action based, at least in part, upon the probability distribution; and
,updating at least one parameter of the model based, at least in part, upon feedback associated with the selected action. - View Dependent Claims (18, 19)
-
-
20. A data packet transmitted between two or more computer components that facilitates online reinforcement learning, the data packet comprising:
an updated parameter of a model, the model comprising an influence diagram with at least one chance node, the parameter updated based, at least in part upon feedback associated with a selected action, selection is based upon the Thompson strategy heuristic technique and a probability distribution associated with uncertainty regarding parameters of a model.
Specification