Learning controller with advantage updating algorithm

US 5,608,843 A
Filed: 08/01/1994
Issued: 03/04/1997
Est. Priority Date: 08/01/1994
Status: Expired due to Fees

First Claim

Patent Images

1. A learning controller comprising:

means for storing a value function V and an advantage function A in a function approximation memory system;

means for updating said value function V and said advantage function A according to reinforcements received from an environment;

said means for updating including learning means for performing an action u_t in a state x_t, leading to a state x_t+Δ

t and a reinforcement R.sub.Δ

t (x_t,u_t);

said means for updating also including means for updating said advantage function A, and changing a maximum value, A_max, thereof;

said means for updating also including means for updating said value function V in response to said A_max value change;

means for normalizing update of said advantage function A, by choosing an action u randomly, with uniform probability; and

means for performing said action u and said normalizing update of said advantage function A in a state x;

said learning means and said normalizing update functioning according to an algorithm of;

##EQU32## where said ##EQU33## symbology represents a function-approximating supervised learning system, generating an output of X, being trained to generate a desired output of Y at a learning rate of a.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A new algorithm for reinforcement learning, advantage updating, is proposed. Advantage updating is a direct learning technique; it does not require a model to be given or learned. It is incremental, requiring only a constant amount of calculation per time step, independent of the number of possible actions, possible outcomes from a given action, or number of states. Analysis and simulation indicate that advantage updating is applicable to reinforcement learning systems working in continuous time (or discrete time with small time steps) for which Q-learning is not applicable. Simulation results are presented indicating that for a simple linear quadratic regulator (LQR) problem with no noise and large time steps, advantage updating learns slightly faster than Q-learning. When there is noise or small time steps, advantage updating learns more quickly than Q-learning by a factor of more than 100,000. Convergence properties and implementation issues are discussed. New convergence results are presented for R-learning and algorithms based upon change in value. It is proved that the learning rule for advantage updating converges to the optimal policy with probability one.

29 Citations

View as Search Results

10 Claims

1. A learning controller comprising:
- means for storing a value function V and an advantage function A in a function approximation memory system;
  
  means for updating said value function V and said advantage function A according to reinforcements received from an environment;
  
  said means for updating including learning means for performing an action u_t in a state x_t, leading to a state x_t+Δ
  
  t and a reinforcement R.sub.Δ
  
  t (x_t,u_t);
  
  said means for updating also including means for updating said advantage function A, and changing a maximum value, A_max, thereof;
  
  said means for updating also including means for updating said value function V in response to said A_max value change;
  
  means for normalizing update of said advantage function A, by choosing an action u randomly, with uniform probability; and
  
  means for performing said action u and said normalizing update of said advantage function A in a state x;
  
  said learning means and said normalizing update functioning according to an algorithm of;
  
  ##EQU32## where said ##EQU33## symbology represents a function-approximating supervised learning system, generating an output of X, being trained to generate a desired output of Y at a learning rate of a.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. A learning controller according to claim 1, wherein said function approximation memory system comprises a lookup table, wherein said learning means updating and said normalizing update are equivalent to replacing an entry X in said table with the value (1-a) X+aY.
  - 3. A learning controller according to claim 1, wherein for continuous time, said equations are determined by taking a mathematical limit as Δ
    - t goes to zero, so said second two update equations remain unchanged, and said first update equation becomes;
      
      ##EQU34##
  - 4. A learning controller according to claim 1, wherein said function approximation memory system comprises a multilayer perceptron neural network.
  - 5. A learning controller according to claim 1, wherein said function approximation memory system comprises a radial basis function network.
  - 6. A learning controller according to claim 1, wherein said function approximation memory system comprises a memory based learning system.

7. A method of learning, for a controller having means for storing a value function V and an advantage function A in a function-approximation memory, said method comprising the steps of:
- updating said value function V and said advantage function A in response to reinforcement information received from an environment input;
  
  said environment input including a learning state wherein performing an action u_t in a state x_t, leads to a state x_t+Δ
  
  t and a reinforcement R.sub.Δ
  
  t (x_t,u_t);
  
  changing a stored maximum value A_max of said advantage function A by updating said advantage function A;
  
  updating said value function V in response to said A_max value change;
  
  normalizing said advantage function A by choosing an action u randomly, with uniform probability;
  
  performing said action u and said normalizing update of said advantage function A in a state x; and
  
  performing said learning updates and said normalizing update in accordance with an algorithm of;
  
  ##EQU35## where said ##EQU36## symbology represents a function-approximating supervised learning system, generating an output of X, being trained to generate a desired output of Y at a learning rate of a.

8. A learning controller comprising:
- means for storing a value function V and an advantage function A in a function approximation memory;
  
  means for updating said value function V and said advantage function A in said function approximation memory according to reinforcement information received from an environment input;
  
  said means for updating including learning means for performing an action u_t in a state x_t, leading to a state x_t+Δ
  
  t and a reinforcement R.sub.δ
  
  t (x_t,u_t);
  
  said means for updating also including means for updating said advantage function A, and changing a maximum value, A_max, thereof;
  
  said means for updating also including means for updating said value function V in response to said A_max value change;
  
  means for choosing a learning means action u randomly, with uniform probability and for normalizing said update of said advantage function A;
  
  means for performing said action u and said normalizing update of said advantage function A in a state x;
  
  one of said learning means update and said normalizing update being in accordance with a predetermined learning algorithm and a predetermined normalizing update algorithm respectively.
- View Dependent Claims (9, 10)
- - 9. The learning controller of claim 8 wherein said learning means update and said normalizing update are each in accordance with predetermined algorithms.
  - 10. The learning controller of claim 9 wherein said learning means update and said normalizing update are each in accordance with an algorithm of:
    - ##EQU37## where said ##EQU38## symbology represents a function-approximating supervised learning system, generating an output of X, being trained to generate a desired output of Y at a learning rate of a.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
United States Of America As Represented By The Secretary Of The Air Force
Original Assignee
United States Of America As Represented By The Secretary Of The Air Force
Inventors
Baird, Leemon C. III
Primary Examiner(s)
Hafiz, Tariq R.

Application Number

US08/283,729
Time in Patent Office

946 Days
Field of Search

382/155-159, 382/20-27, 395/20-27
US Class Current

706/25
CPC Class Codes

G05B 13/0265 the criterion being a learn...

G06N 20/00 Machine learning

Learning controller with advantage updating algorithm

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

29 Citations

10 Claims

Specification

Solutions

Use Cases

Quick Links

Learning controller with advantage updating algorithm

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

29 Citations

10 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links