Evaluating reinforcement learning policies
First Claim
1. A method performed by one or more computers for controlling a robot interacting with an environment, the method comprising:
- receiving a plurality of training histories for the robot, wherein the robot interacts with the environment by receiving observations characterizing states of environment and, in response to each observation, performing a respective one of a pre-determined set of actions, wherein each training history comprises, for each time step in a sequence of time steps, a respective training observation that characterizes a state of the environment at the time step and associates the training observation with an action performed by the robot at the time step and a reward received by the robot in response to performing the action;
determining a total reward for each training observation in the training histories, wherein the total reward is a combination of rewards received by the robot subsequent to performing the action at the time step corresponding to the training observation;
partitioning the training observations into a plurality of partitions, each partition including training observations having the same total reward and being associated with the same action;
receiving a current observation characterizing a current state of the environment;
determining, for each partition and from the partitioned training observations, a probability that the robot will receive the total reward for the partition if the robot performs the action for the partition in response to receiving the current observation;
determining, from the probabilities and for each total reward, a respective estimated value of performing each action in response to receiving the current observation; and
controlling the robot by selecting, as an action to be performed by the robot in response to the current observation, an action from the pre-determined set of actions in accordance with an action selection policy, the action selection policy including one or more rules for selecting between the actions in the pre-determined set of actions using the estimated values.
4 Assignments
0 Petitions
Accused Products
Abstract
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for evaluating reinforcement learning policies. One of the methods includes receiving a plurality of training histories for a reinforcement learning agent; determining a total reward for each training observation in the training histories; partitioning the training observations into a plurality of partitions; determining, for each partition and from the partitioned training observations, a probability that the reinforcement learning agent will receive the total reward for the partition if the reinforcement learning agent performs the action for the partition in response to receiving the current observation; determining, from the probabilities and for each total reward, a respective estimated value of performing each action in response to receiving the current observation; and selecting an action from the pre-determined set of actions from the estimated values in accordance with an action selection policy.
6 Citations
20 Claims
-
1. A method performed by one or more computers for controlling a robot interacting with an environment, the method comprising:
-
receiving a plurality of training histories for the robot, wherein the robot interacts with the environment by receiving observations characterizing states of environment and, in response to each observation, performing a respective one of a pre-determined set of actions, wherein each training history comprises, for each time step in a sequence of time steps, a respective training observation that characterizes a state of the environment at the time step and associates the training observation with an action performed by the robot at the time step and a reward received by the robot in response to performing the action; determining a total reward for each training observation in the training histories, wherein the total reward is a combination of rewards received by the robot subsequent to performing the action at the time step corresponding to the training observation; partitioning the training observations into a plurality of partitions, each partition including training observations having the same total reward and being associated with the same action; receiving a current observation characterizing a current state of the environment; determining, for each partition and from the partitioned training observations, a probability that the robot will receive the total reward for the partition if the robot performs the action for the partition in response to receiving the current observation; determining, from the probabilities and for each total reward, a respective estimated value of performing each action in response to receiving the current observation; and controlling the robot by selecting, as an action to be performed by the robot in response to the current observation, an action from the pre-determined set of actions in accordance with an action selection policy, the action selection policy including one or more rules for selecting between the actions in the pre-determined set of actions using the estimated values. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform operations for controlling a robot, the operations comprising:
-
receiving a plurality of training histories for the robot, wherein the robot interacts with the environment by receiving observations characterizing states of environment and, in response to each observation, performing a respective one of a pre-determined set of actions, wherein each training history comprises, for each time step in a sequence of time steps, a respective training observation that characterizes a state of the environment at the time step and associates the training observation with an action performed by the robot at the time step and a reward received by the robot in response to performing the action; determining a total reward for each training observation in the training histories, wherein the total reward is a combination of rewards received by the robot subsequent to performing the action at the time step corresponding to the training observation; partitioning the training observations into a plurality of partitions, each partition including training observations having the same total reward and being associated with the same action; receiving a current observation characterizing a current state of the environment; determining, for each partition and from the partitioned training observations, a probability that the robot will receive the total reward for the partition if the robot performs the action for the partition in response to receiving the current observation; determining, from the probabilities and for each total reward, a respective estimated value of performing each action in response to receiving the current observation; and controlling the robot by selecting, as an action to be performed by the robot in response to the current observation, an action from the pre-determined set of actions in accordance with an action selection policy, the action selection policy including one or more rules for selecting between the actions in the pre-determined set of actions using the estimated values. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A computer program product encoded on one or more non-transitory computer storage media, the computer program product comprising instructions that when executed by one or more computers cause the one or more computers to perform operations for controlling a robot, the operations comprising:
-
receiving a plurality of training histories for the robot, wherein the robot interacts with the environment by receiving observations characterizing states of environment and, in response to each observation, performing a respective one of a pre-determined set of actions, wherein each training history comprises, for each time step in a sequence of time steps, a respective training observation that characterizes a state of the environment at the time step and associates the training observation with an action performed by the robot at the time step and a reward received by the robot in response to performing the action; determining a total reward for each training observation in the training histories, wherein the total reward is a combination of rewards received by the robot subsequent to performing the action at the time step corresponding to the training observation; partitioning the training observations into a plurality of partitions, each partition including training observations having the same total reward and being associated with the same action; receiving a current observation characterizing a current state of the environment; determining, for each partition and from the partitioned training observations, a probability that the robot will receive the total reward for the partition if the robot performs the action for the partition in response to receiving the current observation; determining, from the probabilities and for each total reward, a respective estimated value of performing each action in response to receiving the current observation; and controlling the robot by selecting, as an action to be performed by the robot in response to the current observation, an action from the pre-determined set of actions in accordance with an action selection policy, the action selection policy including one or more rules for selecting between the actions in the pre-determined set of actions using the estimated values. - View Dependent Claims (18, 19, 20)
-
Specification