Smoothed Sarsa: Reinforcement Learning for Robot Delivery Tasks

US 20100094786A1
Filed: 10/13/2009
Published: 04/15/2010
Est. Priority Date: 10/14/2008
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for learning a policy for performing a task by a computing system, the method comprising the steps of:

determining a first state associated with a first time interval;

determining a subsequent state associated with a subsequent time interval;

determining a first action from the first state using the policy, which comprises a plurality of weights, properties of one or more actions and properties of one or more states;

determining a subsequent action from the subsequent state using the policy;

determining a reward value associated with a combination of the first state and the first action; and

storing a state description including the first state, the first action, the subsequent state, the subsequent action and the reward value.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention provides a method for learning a policy used by a computing system to perform a task, such delivery of one or more objects by the computing system. During a first time interval, the computing system determines a first state, a first action and a first reward value. As the computing system determines different states, actions and reward values during subsequent time intervals, a state description identifying the current sate, the current action, the current reward and a predicted action is stored. Responsive to a variance of a stored state description falling below a threshold value, the stored state description is used to modify one or more weights in the policy associated with the first state.

40 Citations

View as Search Results

38 Claims

1. A computer-implemented method for learning a policy for performing a task by a computing system, the method comprising the steps of:
- determining a first state associated with a first time interval;
  
  determining a subsequent state associated with a subsequent time interval;
  
  determining a first action from the first state using the policy, which comprises a plurality of weights, properties of one or more actions and properties of one or more states;
  
  determining a subsequent action from the subsequent state using the policy;
  
  determining a reward value associated with a combination of the first state and the first action; and
  
  storing a state description including the first state, the first action, the subsequent state, the subsequent action and the reward value.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The computer-implemented method of claim 1, further comprising the steps of:
    - responsive to a variance associated with the subsequent state stored in the state description not exceeding a variance threshold value at a current state, calculating a backup target from the state description;
      
      modifying one or more weights included in the policy responsive to the backup target; and
      
      deleting the state description.
  - 3. The computer-implemented method of claim 1, further comprising the steps of:
    - responsive to a time delay between the first time interval and a current time interval associated with a current state exceeding a delay threshold value, calculating a backup target from the state description;
      
      modifying one or more weights included in the policy responsive to the backup target; and
      
      deleting the state description.
  - 4. The computer-implemented method of claim 1, further comprising the steps of:
    - responsive to a time delay between the first time interval and a current time interval associated with a current state exceeding a delay threshold value or a variance associated with the subsequent state stored in the state description not exceeding a variance threshold value at the current state, calculating a backup target from the state description;
      
      modifying one or more weights included in the policy responsive to the backup target; and
      
      deleting the state description.
  - 5. The computer-implemented method of claim 4, further comprising the step of:
    - responsive to determining the one or more weights included in the policy modified responsive to the backup target converge to a value, storing the policy in a computer-readable storage medium.
  - 6. The computer-implemented method of claim 4, further comprising the step of:
    - responsive to determining the one or more weights included in the policy modified responsive to the backup target do not converge to a value, determining a second state associated with a second time interval;
      
      determining a current action from the current state using the policy;
      
      determining a second action from the second state using the policy;
      
      determining a current reward value associated with a combination of the current state and the current action;
      
      storing a second state description including the current state, the current action, the second state, the second action and the current reward value;
      
      responsive to a second time delay between the current time interval and an additional time interval associated with an additional state exceeding the delay threshold value or a variance associated with the second state stored in the second state description not exceeding the variance threshold value at the additional state, calculating a second backup target from the second state description;
      
      modifying one or more weights included in the policy responsive to the second backup target; and
      
      deleting the second state description.
  - 7. The computer-implemented method of claim 4, wherein calculating the backup target from the state description comprises the steps of:
    - generating a predicted Q-function value associated with the subsequent state stored in the state description and the subsequent action stored in the state description;
      
      attenuating the predicted Q-function value by a discount factor;
      
      generating a sum of the reward value and the predicted Q-function value; and
      
      subtracting a Q-function value associated with the first state stored in the state description and the first action stored in the state description from the sum.
  - 8. The computer-implemented method of claim 7, wherein the predicted Q-function value is dependent upon a probability distribution over the current state.
  - 9. The computer-implemented method of claim 8, wherein the probability distribution over the current state is conditioned on additional stored state descriptions associated with time intervals occurring between the first time interval and the current time interval.
  - 10. The computer-implemented method of claim 4, wherein modifying one or more weights included in the policy responsive to the backup target comprises the steps of:
    - generating a gradient at a current weight of a Q-function value associated with the combination of the first state and the first action;
      
      determining a product of the gradient, the backup target and a learning rate;
      
      generating a modified weight by computing a sum of the product and the current weight; and
      
      storing the modified weight.
  - 11. The computer-implemented method of claim 10, wherein the learning rate comprises a value that is decreased as time intervals elapse.
  - 12. The computer-implemented method of claim 10, wherein the Q-function value comprises a weighted sum of feature functions identifying properties of a state and an action multiplied by one or more weights.

13. A computer program product comprising a computer readable storage medium storing computer executable code for learning a policy for performing a task by a computing system, the computer executable code performing the steps of:
- determining a first state associated with a first time interval;
  
  determining a subsequent state associated with a subsequent time interval;
  
  determining a first action from the first state using the policy, which comprises a plurality of weights, properties of one or more actions and properties of one or more states;
  
  determining a subsequent action from the subsequent state using the policy;
  
  determining a reward value associated with a combination of the first state and the first action; and
  
  storing a state description including the first state, the first action, the subsequent state, the subsequent action and the reward value.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
- - 14. The computer program product of claim 13, further comprising the steps of:
    - responsive to a variance associated with the subsequent state stored in the state description not exceeding a variance threshold value at a current state, calculating a backup target from the state description;
      
      modifying one or more weights included in the policy responsive to the backup target; and
      
      deleting the state description.
  - 15. The computer program product of claim 13, further comprising the steps of:
    - responsive to a time delay between the first time interval and a current time interval associated with a current state exceeding a delay threshold value, calculating a backup target from the state description;
      
      modifying one or more weights included in the policy responsive to the backup target; and
      
      deleting the state description.
  - 16. The computer program product of claim 13, further comprising the steps of:
    - responsive to a time delay between the first time interval and a current time interval associated with a current state exceeding a delay threshold value or a variance associated with the subsequent state stored in the state description not exceeding a variance threshold value at the current state, calculating a backup target from the state description;
      
      modifying one or more weights included in the policy responsive to the backup target; and
      
      deleting the state description.
  - 17. The computer program product of claim 13, further comprising the step of:
    - responsive to determining the one or more weights included in the policy modified responsive to the backup target converge to a value, storing the policy in a computer-readable storage medium.
  - 18. The computer program product of claim 16, further comprising the steps of:
    - responsive to determining the one or more weights included in the policy modified responsive to the backup target do not converge to a value, determining a second state associated with a second time interval;
      
      determining a current action from the current state using the policy;
      
      determining a second action from the second state using the policy;
      
      determining a current reward value associated with a combination of the current state and the current action;
      
      storing a second state description including the current state, the current action, the second state, the second action and the current reward value;
      
      responsive to a second time delay between the current time interval and an additional time interval associated with an additional state exceeding the delay threshold value or a variance associated with the second state stored in the second state description not exceeding the variance threshold value at the additional state, calculating a second backup target from the second state description;
      
      modifying one or more weights included in the policy responsive to the second backup target; and
      
      deleting the second state description.
  - 19. The computer program product of claim 16, further comprising the steps of:
    - generating a predicted Q-function value associated with the subsequent state stored in the state description and the subsequent action stored in the state description;
      
      attenuating the predicted Q-function value by a discount factor;
      
      generating a sum of the reward value and the predicted Q-function value; and
      
      subtracting a Q-function value associated with the first state stored in the state description and the first action stored in the state description from the sum.
  - 20. The computer program product of claim 19, wherein the predicted Q-function value is dependent upon a probability distribution over the current state.
  - 21. The computer program product of claim 20, wherein the probability distribution over the current state is conditioned on additional stored state descriptions associated with time intervals occurring between the first time interval and the current time interval.
  - 22. The computer program product of claim 16, wherein modifying one or more weights included in the policy responsive to the backup target comprises the steps of:
    - generating a gradient at a current weight of a Q-function value associated with the combination of the first state and the first action;
      
      determining a product of the gradient, the backup target and a learning rate;
      
      generating a modified weight by computing a sum of the product and the current weight; and
      
      storing the modified weight.
  - 23. The computer program product of claim 22, wherein the learning rate comprises a value that is decreased as time intervals elapse.
  - 24. The computer program product of claim 22, wherein the Q-function value comprises a weighted sum of feature functions identifying properties of a state and an action multiplied by one or more weights.

25. A computing system for learning a policy for performing a task comprising:
- a state generation module determining a first state associated with a first time interval and determining a subsequent state associated with a subsequent time interval;
  
  a decision module, coupled to the state generation module, determining a first action from the first state using the policy, which comprises a plurality of weights, properties of one or more actions and properties of one or more states, determining a subsequent action from the subsequent state using the policy, determining a reward value associated with a combination of the first state and the first action and storing a state description including the first state, the first action, the subsequent state, the subsequent action and the reward value.
- View Dependent Claims (26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38)
- - 26. The computing system of claim 25, further comprising:
    - an observation module, coupled to the decision module, determining a position of the computing system using a localization process, determining a position of one or more entities external to the computing system, the position of the computing system and the position of the one or more entities used by the decision module in determining the subsequent action.
  - 27. The computing system of claim 25, wherein the decision module further:
    - calculates a backup target from the state description, modifies one or more weights included in the policy responsive to the backup target and deletes the state description responsive to a variance associated with the subsequent state stored in the state description not exceeding a variance threshold value at a current state.
  - 28. The computing system of claim 25, wherein the decision module further:
    - calculates a backup target from the state description, modifies one or more weights included in the policy responsive to the backup target and deletes the state description responsive to a time delay between the first time interval and a current time interval associated with a current state exceeding a delay threshold value.
  - 29. The computing system of claim 25, wherein the decision module further:
    - calculates a backup target from the state description, modifies one or more weights included in the policy responsive to the backup target and deletes the state description responsive to a time delay between the first time interval and a current time interval associated with a current state exceeding a delay threshold value or responsive to a variance associated with the subsequent state stored in the state description not exceeding a variance threshold value at the current state.
  - 30. The computing system of claim 29, wherein the decision module further:
    - stores the policy responsive to determining that the one or more weights included in the policy modified responsive to the backup target converge to a value.
  - 31. The computing system of claim 29, wherein:
    - the a state generation module determines a second state associated with a second time interval responsive to the decision module determining that the one or more weights included in the policy modified responsive to the backup target do not converge to a value; and
      
      the computing decision module, responsive to determining the one or more weights included in the policy modified responsive to the backup target do not converge to a value, determines a current action from the current state using the policy, determines a second action from the second state using the policy, determines a current reward value associated with a combination of the current state and the current action and stores a second state description including the current state, the current action, the second state, the second action and the current reward value.
  - 32. The computing system of claim 31, wherein the decision module further:
    - calculates a second backup target from the second state description, modifies one or more weights included in the policy responsive to the second backup target and deletes the second state description responsive to a second time delay between the current time interval and an additional time interval associated with an additional state exceeding the delay threshold value or a variance associated with the second state stored in the second state description not exceeding the variance threshold value at the additional state.
  - 33. The computing system of claim 29, wherein the decision module calculates the backup target from the stored state description by:
    - generating a predicted Q-function value associated with the subsequent state stored in the state description and the subsequent action stored in the state description;
      
      attenuating the predicted Q-function value by a discount factor;
      
      generating a sum of the reward value and the predicted Q-function value; and
      
      subtracting a Q-function value associated with the first state stored in the state description and the first action stored in the state description from the sum.
  - 34. The computing system of claim 33, wherein the predicted Q-function value is dependent upon a probability distribution over the current state.
  - 35. The computing system of claim 34, wherein the probability distribution over the current state is conditioned on additional stored state descriptions associated with time intervals occurring between the first time interval and the current time interval.
  - 36. The computing system of claim 29, wherein the decision module modifies one or more weights included in the policy associated with the first state responsive to the backup target calculated from the stored state description by:
    - generating a gradient at a current weight of a Q-function value associated with the combination of the first state and the first action;
      
      determining a product of the gradient, the backup target and a learning rate;
      
      generating a modified weight by computing a sum of the product and the current weight; and
      
      storing the modified weight.
  - 37. The computing system of claim 36, wherein the learning rate comprises a value that is decreased as time intervals elapse.
  - 38. The computing system of claim 36, wherein the Q-function value comprises a weighted sum of feature functions identifying properties of a state and an action multiplied by one or more weights.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Honda Motor Co., Ltd. (Honda Motor Company)
Original Assignee
Honda Motor Co., Ltd. (Honda Motor Company)
Inventors
Ramachandran, Deepak, Gupta, Rakesh

Granted Patent

US 8,326,780 B2
Time in Patent Office

Days
Field of Search
US Class Current

706/12
CPC Class Codes

G06N 20/00   Machine learning

G06N 3/006   based on simulated virtual ...

G06N 7/01   Probabilistic graphical mod...

Smoothed Sarsa: Reinforcement Learning for Robot Delivery Tasks

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

40 Citations

38 Claims

Specification

Use Cases

Quick Links

Others

Smoothed Sarsa: Reinforcement Learning for Robot Delivery Tasks

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

40 Citations

38 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others