Smoothed sarsa: reinforcement learning for robot delivery tasks

US 8,326,780 B2
Filed: 10/13/2009
Issued: 12/04/2012
Est. Priority Date: 10/14/2008
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for learning a policy for performing a task by a computing system, the method comprising the steps of:

determining, by the computing system, a first state associated with a first time interval;

determining, by the computing system, a subsequent state associated with a subsequent time interval;

determining, by the computing system, a first action from the first state using the policy, which comprises a plurality of weights, properties of one or more actions and properties of one or more states;

determining, by the computing system, a subsequent action from the subsequent state using the policy;

determining, by the computing system, a reward value associated with a combination of the first state and the first action;

storing, by the computing system, a state description including the first state, the first action, the subsequent state, the subsequent action and the reward value in a non-transitory computer-readable storage medium;

responsive to a time delay between the first time interval and a current time interval associated with a current state exceeding a delay threshold value or a variance associated with the subsequent state stored in the state description not exceeding a variance threshold value at the current state, calculating, by the computing system, a backup target from the state description;

modifying, by the computing system, one or more weights of the plurality of weights responsive to the backup target; and

deleting, by the computing system, the state description from the non-transitory computer-readable storage medium.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention provides a method for learning a policy used by a computing system to perform a task, such delivery of one or more objects by the computing system. During a first time interval, the computing system determines a first state, a first action and a first reward value. As the computing system determines different states, actions and reward values during subsequent time intervals, a state description identifying the current sate, the current action, the current reward and a predicted action is stored. Responsive to a variance of a stored state description falling below a threshold value, the stored state description is used to modify one or more weights in the policy associated with the first state.

Citations

34 Claims

1. A computer-implemented method for learning a policy for performing a task by a computing system, the method comprising the steps of:
- determining, by the computing system, a first state associated with a first time interval;
  
  determining, by the computing system, a subsequent state associated with a subsequent time interval;
  
  determining, by the computing system, a first action from the first state using the policy, which comprises a plurality of weights, properties of one or more actions and properties of one or more states;
  
  determining, by the computing system, a subsequent action from the subsequent state using the policy;
  
  determining, by the computing system, a reward value associated with a combination of the first state and the first action;
  
  storing, by the computing system, a state description including the first state, the first action, the subsequent state, the subsequent action and the reward value in a non-transitory computer-readable storage medium;
  
  responsive to a time delay between the first time interval and a current time interval associated with a current state exceeding a delay threshold value or a variance associated with the subsequent state stored in the state description not exceeding a variance threshold value at the current state, calculating, by the computing system, a backup target from the state description;
  
  modifying, by the computing system, one or more weights of the plurality of weights responsive to the backup target; and
  
  deleting, by the computing system, the state description from the non-transitory computer-readable storage medium.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The computer-implemented method of claim 1, wherein the backup target is calculated in response toa variance associated with the subsequent state stored in the state description not exceeding a variance threshold value at a current state.
  - 3. The computer-implemented method of claim 1, wherein the backup target is calculated in response toa time delay between the first time interval and a current time interval associated with a current state exceeding a delay threshold value.
  - 4. The computer-implemented method of claim 1, further comprising the step of:
    - responsive to determining the one or more weights included in the policy modified responsive to the backup target converge to a value, storing the policy in a computer-readable storage medium.
  - 5. The computer-implemented method of claim 1, further comprising the step of:
    - responsive to determining the one or more weights included in the policy modified responsive to the backup target do not converge to a value, determining a second state associated with a second time interval;
      
      determining a current action from the current state using the policy;
      
      determining a second action from the second state using the policy;
      
      determining a current reward value associated with a combination of the current state and the current action;
      
      storing a second state description including the current state, the current action, the second state, the second action and the current reward value;
      
      responsive to a second time delay between the current time interval and an additional time interval associated with an additional state exceeding the delay threshold value or a variance associated with the second state stored in the second state description not exceeding the variance threshold value at the additional state, calculating a second backup target from the second state description;
      
      modifying one or more weights included in the policy responsive to the second backup target; and
      
      deleting the second state description.
  - 6. The computer-implemented method of claim 1, wherein calculating the backup target from the state description comprises the steps of:
    - generating a predicted Q-function value associated with the subsequent state stored in the state description and the subsequent action stored in the state description;
      
      attenuating the predicted Q-function value by a discount factor;
      
      generating a sum of the reward value and the predicted Q-function value; and
      
      subtracting a Q-function value associated with the first state stored in the state description and the first action stored in the state description from the sum.
  - 7. The computer-implemented method of claim 6, wherein the predicted Q-function value is dependent upon a probability distribution over the current state.
  - 8. The computer-implemented method of claim 7, wherein the probability distribution over the current state is conditioned on additional stored state descriptions associated with time intervals occurring between the first time interval and the current time interval.
  - 9. The computer-implemented method of claim 1, wherein modifying one or more weights included in the policy responsive to the backup target comprises the steps of:
    - generating a gradient at a current weight of a Q-function value associated with the combination of the first state and the first action;
      
      determining a product of the gradient, the backup target and a learning rate;
      
      generating a modified weight by computing a sum of the product and the current weight; and
      
      storing the modified weight.
  - 10. The computer-implemented method of claim 9, wherein the learning rate comprises a value that is decreased as time intervals elapse.
  - 11. The computer-implemented method of claim 9, wherein the Q-function value comprises a weighted sum of feature functions identifying properties of a state and an action multiplied by one or more weights.

12. A computer program product comprising a non-transitory computer readable storage medium storing computer executable code for learning a policy for performing a task by a computing system, the computer executable code performing the steps of:
- determining a first state associated with a first time interval;
  
  determining a subsequent state associated with a subsequent time interval;
  
  determining a first action from the first state using the policy, which comprises a plurality of weights, properties of one or more actions and properties of one or more states;
  
  determining a subsequent action from the subsequent state using the policy;
  
  determining a reward value associated with a combination of the first state and the first action;
  
  storing a state description including the first state, the first action, the subsequent state, the subsequent action and the reward value;
  
  responsive to a time delay between the first time interval and a current time interval associated with a current state exceeding a delay threshold value or a variance associated with the subsequent state stored in the state description not exceeding a variance threshold value at the current state, calculating a backup target from the state description;
  
  modifying one or more weights included in the policy responsive to the backup target; and
  
  deleting the state description.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
- - 13. The computer program product of claim 12, wherein the backup target is calculated in response toa variance associated with the subsequent state stored in the state description not exceeding a variance threshold value at a current state.
  - 14. The computer program product of claim 12, wherein the backup target is calculated in response toa time delay between the first time interval and a current time interval associated with a current state exceeding a delay threshold value.
  - 15. The computer program product of claim 12, further comprising the step of:
    - responsive to determining the one or more weights included in the policy modified responsive to the backup target converge to a value, storing the policy in a computer-readable storage medium.
  - 16. The computer program product of claim 12, further comprising the steps of:
    - responsive to determining the one or more weights included in the policy modified responsive to the backup target do not converge to a value, determining a second state associated with a second time interval;
      
      determining a current action from the current state using the policy;
      
      determining a second action from the second state using the policy;
      
      determining a current reward value associated with a combination of the current state and the current action;
      
      storing a second state description including the current state, the current action, the second state, the second action and the current reward value;
      
      responsive to a second time delay between the current time interval and an additional time interval associated with an additional state exceeding the delay threshold value or a variance associated with the second state stored in the second state description not exceeding the variance threshold value at the additional state, calculating a second backup target from the second state description;
      
      modifying one or more weights included in the policy responsive to the second backup target; and
      
      deleting the second state description.
  - 17. The computer program product of claim 12, further comprising the steps of:
    - generating a predicted Q-function value associated with the subsequent state stored in the state description and the subsequent action stored in the state description;
      
      attenuating the predicted Q-function value by a discount factor;
      
      generating a sum of the reward value and the predicted Q-function value; and
      
      subtracting a Q-function value associated with the first state stored in the state description and the first action stored in the state description from the sum.
  - 18. The computer program product of claim 17, wherein the predicted Q-function value is dependent upon a probability distribution over the current state.
  - 19. The computer program product of claim 18, wherein the probability distribution over the current state is conditioned on additional stored state descriptions associated with time intervals occurring between the first time interval and the current time interval.
  - 20. The computer program product of claim 12, wherein modifying one or more weights included in the policy responsive to the backup target comprises the steps of:
    - generating a gradient at a current weight of a Q-function value associated with the combination of the first state and the first action;
      
      determining a product of the gradient, the backup target and a learning rate;
      
      generating a modified weight by computing a sum of the product and the current weight; and
      
      storing the modified weight.
  - 21. The computer program product of claim 20, wherein the learning rate comprises a value that is decreased as time intervals elapse.
  - 22. The computer program product of claim 20, wherein the Q-function value comprises a weighted sum of feature functions identifying properties of a state and an action multiplied by one or more weights.

23. A computing system for learning a policy for performing a task comprising:
- a non-transitory computer-readable storage medium containing executable computer instructions comprising;
  
  a state generation module configured to determine a first state associated with a first time interval and to determine a subsequent state associated with a subsequent time interval;
  
  a decision module, coupled to the state generation module, configured to determine a first action from the first state using the policy, which comprises a plurality of weights, properties of one or more actions and properties of one or more states, to determine a subsequent action from the subsequent state using the policy, to determine a reward value associated with a combination of the first state and the first action and to store a state description including the first state, the first action, the subsequent state, the subsequent action and the reward value; and
  
  an observation module, coupled to the decision module, configured to determine a position of the computing system using a localization process, and to determine a position of one or more entities external to the computing system, the position of the computing system and the position of the one or more entities used by the decision module in determining the subsequent action;
  
  wherein, in response to a time delay between the first time interval and a current time interval associated with a current state exceeding a delay threshold value or to a variance associated with the subsequent state stored in the state description not exceeding a variance threshold value at the current state, the decision module is further configured to calculated a backup target from the state description, to modify one or more weights included in the policy responsive to the backup target, and to delete the state description; and
  
  a processor configured to execute the computer instructions.
- View Dependent Claims (24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34)
- - 24. The computing system of claim 23, wherein the decision modulecalculates a backup target from the state description responsive to a variance associated with the subsequent state stored in the state description not exceeding a variance threshold value at a current state.
  - 25. The computing system of claim 23, wherein the decision modulecalculates a backup target from the state description responsive to a time delay between the first time interval and a current time interval associated with a current state exceeding a delay threshold value.
  - 26. The computing system of claim 23, wherein the decision module is further configured to:
    - store the policy responsive to determining that the one or more weights included in the policy modified responsive to the backup target converge to a value.
  - 27. The computing system of claim 23, wherein:
    - the state generation module is further configured to determine a second state associated with a second time interval responsive to the decision module determining that the one or more weights included in the policy modified responsive to the backup target do not converge to a value; and
      
      the computing decision module, responsive to determining the one or more weights included in the policy modified responsive to the backup target do not converge to a value, is further configured to determine a current action from the current state using the policy, to determine a second action from the second state using the policy, to determine a current reward value associated with a combination of the current state and the current action and to store a second state description including the current state, the current action, the second state, the second action and the current reward value.
  - 28. The computing system of claim 27, wherein the decision module is further configured to:
    - calculate a second backup target from the second state description, modify one or more weights included in the policy responsive to the second backup target, and delete the second state description responsive to a second time delay between the current time interval and an additional time interval associated with an additional state exceeding the delay threshold value or a variance associated with the second state stored in the second state description not exceeding the variance threshold value at the additional state.
  - 29. The computing system of claim 23, wherein the decision module calculates the backup target from the stored state description by:
    - generating a predicted Q-function value associated with the subsequent state stored in the state description and the subsequent action stored in the state description;
      
      attenuating the predicted Q-function value by a discount factor;
      
      generating a sum of the reward value and the predicted Q-function value; and
      
      subtracting a Q-function value associated with the first state stored in the state description and the first action stored in the state description from the sum.
  - 30. The computing system of claim 29, wherein the predicted Q-function value is dependent upon a probability distribution over the current state.
  - 31. The computing system of claim 30, wherein the probability distribution over the current state is conditioned on additional stored state descriptions associated with time intervals occurring between the first time interval and the current time interval.
  - 32. The computing system of claim 23, wherein the decision module modifies one or more weights included in the policy associated with the first state responsive to the backup target calculated from the stored state description by:
    - generating a gradient at a current weight of a Q-function value associated with the combination of the first state and the first action;
      
      determining a product of the gradient, the backup target and a learning rate;
      
      generating a modified weight by computing a sum of the product and the current weight; and
      
      storing the modified weight.
  - 33. The computing system of claim 32, wherein the learning rate comprises a value that is decreased as time intervals elapse.
  - 34. The computing system of claim 32, wherein the Q-function value comprises a weighted sum of feature functions identifying properties of a state and an action multiplied by one or more weights.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Honda Motor Co., Ltd. (Honda Motor Company)
Original Assignee
Honda Motor Co., Ltd. (Honda Motor Company)
Inventors
Gupta, Rakesh, Ramachandran, Deepak
Primary Examiner(s)
FERNANDEZ RIVAS, OMAR F

Application Number

US12/578,574
Publication Number

US 20100094786A1
Time in Patent Office

1,148 Days
Field of Search

None
US Class Current

706/14
CPC Class Codes

G06N 20/00   Machine learning

G06N 3/006   based on simulated virtual ...

G06N 7/01   Probabilistic graphical mod...

Smoothed sarsa: reinforcement learning for robot delivery tasks

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

34 Claims

Specification

Solutions

Use Cases

Quick Links

Smoothed sarsa: reinforcement learning for robot delivery tasks

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

34 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links