Learning controller with advantage updating algorithm
First Claim
1. A learning controller comprising:
- means for storing a value function V and an advantage function A in a function approximation memory system;
means for updating said value function V and said advantage function A according to reinforcements received from an environment;
said means for updating including learning means for performing an action ut in a state xt, leading to a state xt+Δ
t and a reinforcement R.sub.Δ
t (xt,ut);
said means for updating also including means for updating said advantage function A, and changing a maximum value, Amax, thereof;
said means for updating also including means for updating said value function V in response to said Amax value change;
means for normalizing update of said advantage function A, by choosing an action u randomly, with uniform probability; and
means for performing said action u and said normalizing update of said advantage function A in a state x;
said learning means and said normalizing update functioning according to an algorithm of;
##EQU32## where said ##EQU33## symbology represents a function-approximating supervised learning system, generating an output of X, being trained to generate a desired output of Y at a learning rate of a.
1 Assignment
0 Petitions
Accused Products
Abstract
A new algorithm for reinforcement learning, advantage updating, is proposed. Advantage updating is a direct learning technique; it does not require a model to be given or learned. It is incremental, requiring only a constant amount of calculation per time step, independent of the number of possible actions, possible outcomes from a given action, or number of states. Analysis and simulation indicate that advantage updating is applicable to reinforcement learning systems working in continuous time (or discrete time with small time steps) for which Q-learning is not applicable. Simulation results are presented indicating that for a simple linear quadratic regulator (LQR) problem with no noise and large time steps, advantage updating learns slightly faster than Q-learning. When there is noise or small time steps, advantage updating learns more quickly than Q-learning by a factor of more than 100,000. Convergence properties and implementation issues are discussed. New convergence results are presented for R-learning and algorithms based upon change in value. It is proved that the learning rule for advantage updating converges to the optimal policy with probability one.
29 Citations
10 Claims
-
1. A learning controller comprising:
-
means for storing a value function V and an advantage function A in a function approximation memory system; means for updating said value function V and said advantage function A according to reinforcements received from an environment; said means for updating including learning means for performing an action ut in a state xt, leading to a state xt+Δ
t and a reinforcement R.sub.Δ
t (xt,ut);said means for updating also including means for updating said advantage function A, and changing a maximum value, Amax, thereof; said means for updating also including means for updating said value function V in response to said Amax value change; means for normalizing update of said advantage function A, by choosing an action u randomly, with uniform probability; and means for performing said action u and said normalizing update of said advantage function A in a state x; said learning means and said normalizing update functioning according to an algorithm of;
##EQU32## where said ##EQU33## symbology represents a function-approximating supervised learning system, generating an output of X, being trained to generate a desired output of Y at a learning rate of a. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A method of learning, for a controller having means for storing a value function V and an advantage function A in a function-approximation memory, said method comprising the steps of:
-
updating said value function V and said advantage function A in response to reinforcement information received from an environment input; said environment input including a learning state wherein performing an action ut in a state xt, leads to a state xt+Δ
t and a reinforcement R.sub.Δ
t (xt,ut);changing a stored maximum value Amax of said advantage function A by updating said advantage function A; updating said value function V in response to said Amax value change; normalizing said advantage function A by choosing an action u randomly, with uniform probability; performing said action u and said normalizing update of said advantage function A in a state x; and performing said learning updates and said normalizing update in accordance with an algorithm of;
##EQU35## where said ##EQU36## symbology represents a function-approximating supervised learning system, generating an output of X, being trained to generate a desired output of Y at a learning rate of a.
-
-
8. A learning controller comprising:
-
means for storing a value function V and an advantage function A in a function approximation memory; means for updating said value function V and said advantage function A in said function approximation memory according to reinforcement information received from an environment input; said means for updating including learning means for performing an action ut in a state xt, leading to a state xt+Δ
t and a reinforcement R.sub.δ
t (xt,ut);said means for updating also including means for updating said advantage function A, and changing a maximum value, Amax, thereof; said means for updating also including means for updating said value function V in response to said Amax value change; means for choosing a learning means action u randomly, with uniform probability and for normalizing said update of said advantage function A; means for performing said action u and said normalizing update of said advantage function A in a state x; one of said learning means update and said normalizing update being in accordance with a predetermined learning algorithm and a predetermined normalizing update algorithm respectively. - View Dependent Claims (9, 10)
-
Specification