Evaluating reinforcement learning policies

US 10,445,653 B1
Filed: 08/07/2015
Issued: 10/15/2019
Est. Priority Date: 08/07/2014
Status: Active Grant

First Claim

Patent Images

1. A method performed by one or more computers for controlling a robot interacting with an environment, the method comprising:

receiving a plurality of training histories for the robot, wherein the robot interacts with the environment by receiving observations characterizing states of environment and, in response to each observation, performing a respective one of a pre-determined set of actions, wherein each training history comprises, for each time step in a sequence of time steps, a respective training observation that characterizes a state of the environment at the time step and associates the training observation with an action performed by the robot at the time step and a reward received by the robot in response to performing the action;

determining a total reward for each training observation in the training histories, wherein the total reward is a combination of rewards received by the robot subsequent to performing the action at the time step corresponding to the training observation;

partitioning the training observations into a plurality of partitions, each partition including training observations having the same total reward and being associated with the same action;

receiving a current observation characterizing a current state of the environment;

determining, for each partition and from the partitioned training observations, a probability that the robot will receive the total reward for the partition if the robot performs the action for the partition in response to receiving the current observation;

determining, from the probabilities and for each total reward, a respective estimated value of performing each action in response to receiving the current observation; and

controlling the robot by selecting, as an action to be performed by the robot in response to the current observation, an action from the pre-determined set of actions in accordance with an action selection policy, the action selection policy including one or more rules for selecting between the actions in the pre-determined set of actions using the estimated values.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for evaluating reinforcement learning policies. One of the methods includes receiving a plurality of training histories for a reinforcement learning agent; determining a total reward for each training observation in the training histories; partitioning the training observations into a plurality of partitions; determining, for each partition and from the partitioned training observations, a probability that the reinforcement learning agent will receive the total reward for the partition if the reinforcement learning agent performs the action for the partition in response to receiving the current observation; determining, from the probabilities and for each total reward, a respective estimated value of performing each action in response to receiving the current observation; and selecting an action from the pre-determined set of actions from the estimated values in accordance with an action selection policy.

6 Citations

View as Search Results

20 Claims

1. A method performed by one or more computers for controlling a robot interacting with an environment, the method comprising:
- receiving a plurality of training histories for the robot, wherein the robot interacts with the environment by receiving observations characterizing states of environment and, in response to each observation, performing a respective one of a pre-determined set of actions, wherein each training history comprises, for each time step in a sequence of time steps, a respective training observation that characterizes a state of the environment at the time step and associates the training observation with an action performed by the robot at the time step and a reward received by the robot in response to performing the action;
  
  determining a total reward for each training observation in the training histories, wherein the total reward is a combination of rewards received by the robot subsequent to performing the action at the time step corresponding to the training observation;
  
  partitioning the training observations into a plurality of partitions, each partition including training observations having the same total reward and being associated with the same action;
  
  receiving a current observation characterizing a current state of the environment;
  
  determining, for each partition and from the partitioned training observations, a probability that the robot will receive the total reward for the partition if the robot performs the action for the partition in response to receiving the current observation;
  
  determining, from the probabilities and for each total reward, a respective estimated value of performing each action in response to receiving the current observation; and
  
  controlling the robot by selecting, as an action to be performed by the robot in response to the current observation, an action from the pre-determined set of actions in accordance with an action selection policy, the action selection policy including one or more rules for selecting between the actions in the pre-determined set of actions using the estimated values.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, wherein the action selection policy specifies that an action having a highest estimated value should be selected, and wherein selecting the action comprises selecting the action having the highest estimated value.
  - 3. The method of claim 1, further comprising:
    - building a respective compression model for each of the partitions, wherein each compression model uses the same compression algorithm, and wherein building the respective compression model for each of the partitions comprises learning a compression model of the training observations in the partition that results in a best compression of the training observations in the partition using the compression algorithm.
  - 4. The method of claim 3, wherein determining, for each partition and from the partitioned training observations, a probability that the robot will receive the total reward for the partition if the robot performs the action for the partition in response to receiving the current observation comprises:
    - compressing the current observation using each of the compression models to generate a respective compressed observation for each partition;
      
      determining, for each of the partitions and from the compressed observations, a respective probability of the current observation given that the action performed by the robot was the action associated with the partition and the total reward received by the robot subsequent to performing the action was the total reward associated with the partition; and
      
      determining, for each partition, the probability that the robot will receive the total reward for the partition if the robot performs the action for the partition in response to receiving the current observation from the probabilities of the current observation.
  - 5. The method of claim 4, wherein determining, for each of the partitions, the respective probability of the current observation comprises:
    - determining a size of the compressed observation for the partition; and
      
      determining the probability of the current observation for the partition from the size of the compressed observation for the partition.
  - 6. The method of claim 5, wherein the probability P(O_t|R_p;
    - a_t+1) of the current observation for a partition that is associated with a particular total reward R_pcand an action a_t+1satisfies;
  - 7. The method of claim 6, wherein the probability P(R_p|O_t;
    - a_t+1) that the robot will receive a particular total reward R_pif the robot performs the action a_t+1in response to receiving the current observation satisfies;
  - 8. The method of claim 1, wherein the estimated value {circumflex over (V)}_t+1(a_t+1,O_t) for an action a_t+1satisfies:

9. A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform operations for controlling a robot, the operations comprising:
- receiving a plurality of training histories for the robot, wherein the robot interacts with the environment by receiving observations characterizing states of environment and, in response to each observation, performing a respective one of a pre-determined set of actions, wherein each training history comprises, for each time step in a sequence of time steps, a respective training observation that characterizes a state of the environment at the time step and associates the training observation with an action performed by the robot at the time step and a reward received by the robot in response to performing the action;
  
  determining a total reward for each training observation in the training histories, wherein the total reward is a combination of rewards received by the robot subsequent to performing the action at the time step corresponding to the training observation;
  
  partitioning the training observations into a plurality of partitions, each partition including training observations having the same total reward and being associated with the same action;
  
  receiving a current observation characterizing a current state of the environment;
  
  determining, for each partition and from the partitioned training observations, a probability that the robot will receive the total reward for the partition if the robot performs the action for the partition in response to receiving the current observation;
  
  determining, from the probabilities and for each total reward, a respective estimated value of performing each action in response to receiving the current observation; and
  
  controlling the robot by selecting, as an action to be performed by the robot in response to the current observation, an action from the pre-determined set of actions in accordance with an action selection policy, the action selection policy including one or more rules for selecting between the actions in the pre-determined set of actions using the estimated values.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. The system of claim 9, wherein the action selection policy specifies that an action having a highest estimated value should be selected, and wherein selecting the action comprises selecting the action having the highest estimated value.
  - 11. The system of claim 9, the operations further comprising:
    - building a respective compression model for each of the partitions, wherein each compression model uses the same compression algorithm, and wherein building the respective compression model for each of the partitions comprises learning a compression model of the training observations in the partition that results in a best compression of the training observations in the partition using the compression algorithm.
  - 12. The system of claim 11, wherein determining, for each partition and from the partitioned training observations, a probability that the robot will receive the total reward for the partition if the robot performs the action for the partition in response to receiving the current observation comprises:
    - compressing the current observation using each of the compression models to generate a respective compressed observation for each partition;
      
      determining, for each of the partitions and from the compressed observations, a respective probability of the current observation given that the action performed by the robot was the action associated with the partition and the total reward received by the robot subsequent to performing the action was the total reward associated with the partition; and
      
      determining, for each partition, the probability that the robot will receive the total reward for the partition if the robot performs the action for the partition in response to receiving the current observation from the probabilities of the current observation.
  - 13. The system of claim 12, wherein determining, for each of the partitions, the respective probability of the current observation comprises:
    - determining a size of the compressed observation for the partition; and
      
      determining the probability of the current observation for the partition from the size of the compressed observation for the partition.
  - 14. The system of claim 13, wherein the probability P(O_t|R_p;
    - a_t+1) of the current observation for a partition that is associated with a particular total reward R_pand an action a_t+1satisfies;
  - 15. The system of claim 14, wherein the probability P(R_p|O_t;
    - a_t+1) that the robot will receive a particular total reward R_pif the robot performs the action a_t+1in response to receiving the current observation satisfies;
  - 16. The system of claim 9, wherein the estimated value {circumflex over (V)}_t+1(a_t+1,O_t) for an action a_t+1satisfies:

17. A computer program product encoded on one or more non-transitory computer storage media, the computer program product comprising instructions that when executed by one or more computers cause the one or more computers to perform operations for controlling a robot, the operations comprising:
- receiving a plurality of training histories for the robot, wherein the robot interacts with the environment by receiving observations characterizing states of environment and, in response to each observation, performing a respective one of a pre-determined set of actions, wherein each training history comprises, for each time step in a sequence of time steps, a respective training observation that characterizes a state of the environment at the time step and associates the training observation with an action performed by the robot at the time step and a reward received by the robot in response to performing the action;
  
  determining a total reward for each training observation in the training histories, wherein the total reward is a combination of rewards received by the robot subsequent to performing the action at the time step corresponding to the training observation;
  
  partitioning the training observations into a plurality of partitions, each partition including training observations having the same total reward and being associated with the same action;
  
  receiving a current observation characterizing a current state of the environment;
  
  determining, for each partition and from the partitioned training observations, a probability that the robot will receive the total reward for the partition if the robot performs the action for the partition in response to receiving the current observation;
  
  determining, from the probabilities and for each total reward, a respective estimated value of performing each action in response to receiving the current observation; and
  
  controlling the robot by selecting, as an action to be performed by the robot in response to the current observation, an action from the pre-determined set of actions in accordance with an action selection policy, the action selection policy including one or more rules for selecting between the actions in the pre-determined set of actions using the estimated values.
- View Dependent Claims (18, 19, 20)
- - 18. The computer program product of claim 17, wherein the action selection policy specifies that an action having a highest estimated value should be selected, and wherein selecting the action comprises selecting the action having the highest estimated value.
  - 19. The computer program product of claim 17, the operations further comprising:
    - building a respective compression model for each of the partitions, wherein each compression model uses the same compression algorithm, and wherein building the respective compression model for each of the partitions comprises learning a compression model of the training observations in the partition that results in a best compression of the training observations in the partition using the compression algorithm.
  - 20. The computer program product of claim 19, wherein determining, for each partition and from the partitioned training observations, a probability that the robot will receive the total reward for the partition if the robot performs the action for the partition in response to receiving the current observation comprises:
    - compressing the current observation using each of the compression models to generate a respective compressed observation for each partition;
      
      determining, for each of the partitions and from the compressed observations, a respective probability of the current observation given that the action performed by the robot was the action associated with the partition and the total reward received by the robot subsequent to performing the action was the total reward associated with the partition; and
      
      determining, for each partition, the probability that the robot will receive the total reward for the partition if the robot performs the action for the partition in response to receiving the current observation from the probabilities of the current observation.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
DeepMind Technologies Limited (Alphabet Inc.)
Inventors
Veness, Joel William, Gendron-Bellemare, Marc
Primary Examiner(s)
Chaki, Kakali
Assistant Examiner(s)
Zidanic, Michael

Application Number

US14/821,549
Time in Patent Office

1,530 Days
Field of Search

None
US Class Current
CPC Class Codes

G06N 20/00   Machine learning

G06N 3/006   based on simulated virtual ...

G06N 3/008   based on physical entities ...

G06N 5/022   Knowledge engineering; Know...

Evaluating reinforcement learning policies

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

6 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Evaluating reinforcement learning policies

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

6 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links