FRAMEWORK AND METHODS OF DIVERSE EXPLORATION FOR FAST AND SAFE POLICY IMPROVEMENT
First Claim
1. A method of learning and deploying a set of behavior policies for an artificial agent, selected from a set of behavior policies, each having a statistically expected return no worse than a lower bound of policy performance which excludes a portion of the set of behavior policies, comprising iteratively improving a behavior policy for each iteration of policy improvement, employing a diverse exploration strategy which strives for behavior diversity in a space of stochastic policies by deploying a diverse set comprising a plurality of behavior policies which are ensured as being safe during each iteration of policy improvement and assessing performance of the artificial agent.
1 Assignment
0 Petitions
Accused Products
Abstract
The present technology addresses the problem of quickly and safely improving policies in online reinforcement learning domains. As its solution, an exploration strategy comprising diverse exploration (DE) is employed, which learns and deploys a diverse set of safe policies to explore the environment. DE theory explains why diversity in behavior policies enables effective exploration without sacrificing exploitation. An empirical study shows that an online policy improvement algorithm framework implementing the DE strategy can achieve both fast policy improvement and safe online performance.
24 Citations
24 Claims
- 1. A method of learning and deploying a set of behavior policies for an artificial agent, selected from a set of behavior policies, each having a statistically expected return no worse than a lower bound of policy performance which excludes a portion of the set of behavior policies, comprising iteratively improving a behavior policy for each iteration of policy improvement, employing a diverse exploration strategy which strives for behavior diversity in a space of stochastic policies by deploying a diverse set comprising a plurality of behavior policies which are ensured as being safe during each iteration of policy improvement and assessing performance of the artificial agent.
-
18. A method of iterative policy improvement in reinforcement learning, comprising:
-
in each policy improvement iteration i, deploying a most recently confirmed set of policies to collect n trajectories uniformly distributed over the respective policies π
i within the set of policies π
i∈
;for each set of trajectories i collected from a respective policy π
i, partition i and append to a training set of trajectories train and a testing set of trajectories test;from train, generating a set of candidate policies and evaluating them using test; confirming a subset of policies as meeting predetermined criteria; and if no new policies π
i are confirmed, redeploying the current set of policies . - View Dependent Claims (19, 20)
-
-
21. An apparatus for learning and deploying a set of behavior policies for an artificial agent, selected from a set of behavior policies, to control a system within an environment, comprising:
-
an input configured to receive data from operation of the system according to a respective behavior policy; at least one automated processor configured to iteratively generate sets of behavior policies based on prior received data which iteratively improve a behavior policy for each iteration of policy improvement, each behavior policy having a statistically expected return no worse than a lower bound of policy performance which excludes a portion of the set of behavior policies and which is ensured as being safe, employing a diverse exploration strategy which strives for behavior diversity in a space of stochastic policies by deploying a diverse set comprising a plurality of behavior policies during each iteration of policy improvement; and at least one output configured to control the system in accordance with a respective behavior policy of the set of behavior policies. - View Dependent Claims (22, 23, 24)
-
Specification