Smoothed sarsa: reinforcement learning for robot delivery tasks
First Claim
1. A computer-implemented method for learning a policy for performing a task by a computing system, the method comprising the steps of:
- determining, by the computing system, a first state associated with a first time interval;
determining, by the computing system, a subsequent state associated with a subsequent time interval;
determining, by the computing system, a first action from the first state using the policy, which comprises a plurality of weights, properties of one or more actions and properties of one or more states;
determining, by the computing system, a subsequent action from the subsequent state using the policy;
determining, by the computing system, a reward value associated with a combination of the first state and the first action;
storing, by the computing system, a state description including the first state, the first action, the subsequent state, the subsequent action and the reward value in a non-transitory computer-readable storage medium;
responsive to a time delay between the first time interval and a current time interval associated with a current state exceeding a delay threshold value or a variance associated with the subsequent state stored in the state description not exceeding a variance threshold value at the current state, calculating, by the computing system, a backup target from the state description;
modifying, by the computing system, one or more weights of the plurality of weights responsive to the backup target; and
deleting, by the computing system, the state description from the non-transitory computer-readable storage medium.
1 Assignment
0 Petitions
Accused Products
Abstract
The present invention provides a method for learning a policy used by a computing system to perform a task, such delivery of one or more objects by the computing system. During a first time interval, the computing system determines a first state, a first action and a first reward value. As the computing system determines different states, actions and reward values during subsequent time intervals, a state description identifying the current sate, the current action, the current reward and a predicted action is stored. Responsive to a variance of a stored state description falling below a threshold value, the stored state description is used to modify one or more weights in the policy associated with the first state.
-
Citations
34 Claims
-
1. A computer-implemented method for learning a policy for performing a task by a computing system, the method comprising the steps of:
-
determining, by the computing system, a first state associated with a first time interval; determining, by the computing system, a subsequent state associated with a subsequent time interval; determining, by the computing system, a first action from the first state using the policy, which comprises a plurality of weights, properties of one or more actions and properties of one or more states; determining, by the computing system, a subsequent action from the subsequent state using the policy; determining, by the computing system, a reward value associated with a combination of the first state and the first action; storing, by the computing system, a state description including the first state, the first action, the subsequent state, the subsequent action and the reward value in a non-transitory computer-readable storage medium; responsive to a time delay between the first time interval and a current time interval associated with a current state exceeding a delay threshold value or a variance associated with the subsequent state stored in the state description not exceeding a variance threshold value at the current state, calculating, by the computing system, a backup target from the state description; modifying, by the computing system, one or more weights of the plurality of weights responsive to the backup target; and deleting, by the computing system, the state description from the non-transitory computer-readable storage medium. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A computer program product comprising a non-transitory computer readable storage medium storing computer executable code for learning a policy for performing a task by a computing system, the computer executable code performing the steps of:
-
determining a first state associated with a first time interval; determining a subsequent state associated with a subsequent time interval; determining a first action from the first state using the policy, which comprises a plurality of weights, properties of one or more actions and properties of one or more states; determining a subsequent action from the subsequent state using the policy; determining a reward value associated with a combination of the first state and the first action; storing a state description including the first state, the first action, the subsequent state, the subsequent action and the reward value; responsive to a time delay between the first time interval and a current time interval associated with a current state exceeding a delay threshold value or a variance associated with the subsequent state stored in the state description not exceeding a variance threshold value at the current state, calculating a backup target from the state description; modifying one or more weights included in the policy responsive to the backup target; and
deleting the state description. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
-
-
23. A computing system for learning a policy for performing a task comprising:
-
a non-transitory computer-readable storage medium containing executable computer instructions comprising; a state generation module configured to determine a first state associated with a first time interval and to determine a subsequent state associated with a subsequent time interval; a decision module, coupled to the state generation module, configured to determine a first action from the first state using the policy, which comprises a plurality of weights, properties of one or more actions and properties of one or more states, to determine a subsequent action from the subsequent state using the policy, to determine a reward value associated with a combination of the first state and the first action and to store a state description including the first state, the first action, the subsequent state, the subsequent action and the reward value; and an observation module, coupled to the decision module, configured to determine a position of the computing system using a localization process, and to determine a position of one or more entities external to the computing system, the position of the computing system and the position of the one or more entities used by the decision module in determining the subsequent action; wherein, in response to a time delay between the first time interval and a current time interval associated with a current state exceeding a delay threshold value or to a variance associated with the subsequent state stored in the state description not exceeding a variance threshold value at the current state, the decision module is further configured to calculated a backup target from the state description, to modify one or more weights included in the policy responsive to the backup target, and to delete the state description; and a processor configured to execute the computer instructions. - View Dependent Claims (24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34)
-
Specification