×

Evaluating reinforcement learning policies

  • US 10,445,653 B1
  • Filed: 08/07/2015
  • Issued: 10/15/2019
  • Est. Priority Date: 08/07/2014
  • Status: Active Grant
First Claim
Patent Images

1. A method performed by one or more computers for controlling a robot interacting with an environment, the method comprising:

  • receiving a plurality of training histories for the robot, wherein the robot interacts with the environment by receiving observations characterizing states of environment and, in response to each observation, performing a respective one of a pre-determined set of actions, wherein each training history comprises, for each time step in a sequence of time steps, a respective training observation that characterizes a state of the environment at the time step and associates the training observation with an action performed by the robot at the time step and a reward received by the robot in response to performing the action;

    determining a total reward for each training observation in the training histories, wherein the total reward is a combination of rewards received by the robot subsequent to performing the action at the time step corresponding to the training observation;

    partitioning the training observations into a plurality of partitions, each partition including training observations having the same total reward and being associated with the same action;

    receiving a current observation characterizing a current state of the environment;

    determining, for each partition and from the partitioned training observations, a probability that the robot will receive the total reward for the partition if the robot performs the action for the partition in response to receiving the current observation;

    determining, from the probabilities and for each total reward, a respective estimated value of performing each action in response to receiving the current observation; and

    controlling the robot by selecting, as an action to be performed by the robot in response to the current observation, an action from the pre-determined set of actions in accordance with an action selection policy, the action selection policy including one or more rules for selecting between the actions in the pre-determined set of actions using the estimated values.

View all claims
  • 4 Assignments
Timeline View
Assignment View
    ×
    ×