FRAMEWORK AND METHODS OF DIVERSE EXPLORATION FOR FAST AND SAFE POLICY IMPROVEMENT

US 20190228309A1
Filed: 01/24/2019
Published: 07/25/2019
Est. Priority Date: 01/25/2018
Status: Active Grant

First Claim

Patent Images

1. A method of learning and deploying a set of behavior policies for an artificial agent, selected from a set of behavior policies, each having a statistically expected return no worse than a lower bound of policy performance which excludes a portion of the set of behavior policies, comprising iteratively improving a behavior policy for each iteration of policy improvement, employing a diverse exploration strategy which strives for behavior diversity in a space of stochastic policies by deploying a diverse set comprising a plurality of behavior policies which are ensured as being safe during each iteration of policy improvement and assessing performance of the artificial agent.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present technology addresses the problem of quickly and safely improving policies in online reinforcement learning domains. As its solution, an exploration strategy comprising diverse exploration (DE) is employed, which learns and deploys a diverse set of safe policies to explore the environment. DE theory explains why diversity in behavior policies enables effective exploration without sacrificing exploitation. An empirical study shows that an online policy improvement algorithm framework implementing the DE strategy can achieve both fast policy improvement and safe online performance.

24 Citations

View as Search Results

24 Claims

1. A method of learning and deploying a set of behavior policies for an artificial agent, selected from a set of behavior policies, each having a statistically expected return no worse than a lower bound of policy performance which excludes a portion of the set of behavior policies, comprising iteratively improving a behavior policy for each iteration of policy improvement, employing a diverse exploration strategy which strives for behavior diversity in a space of stochastic policies by deploying a diverse set comprising a plurality of behavior policies which are ensured as being safe during each iteration of policy improvement and assessing performance of the artificial agent.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 2. The method according to claim 1, wherein each behavior policy has a variance associated with the estimate of its performance by importance sampling, and each diverse set has a common average variance, in each of a plurality of policy improvement iterations.
  - 3. The method according to claim 1, wherein each of the policy performance and policy behavior diversity is quantified according to a common objective function.
  - 4. The method according to claim 1, wherein the set of behavior policies comprises a plurality of behavior policies predefined upon commencement of a respective single iteration of policy improvement.
  - 5. The method according to claim 1, wherein the set of behavior policies comprises a plurality of behavior policies which are adaptively defined based on policy performance within a respective single iteration of policy improvement.
  - 6. The method according to claim 5, wherein the adaptation is based on a change in the lower bound of policy performance as a selection criterion for a subsequent behavior policy within a respective single iteration of policy improvement.
  - 7. The method according to claim 5, wherein the adaptation is based on feedback of a system state received after deploying a prior behavior policy within a respective single iteration of policy improvement.
  - 8. The method according to claim 1, wherein the diverse set within a respective single iteration is selected as a plurality of behavior policies generated based on prior feedback, having maximum differences from each other according to a Kullback-Leibler (KL) divergence measure.
  - 9. The method according to claim 1, wherein behavior policies within a respective iteration are selected according to an aggregate group statistic.
  - 10. The method according to claim 1, each respective behavior policy represents a trained first artificial neural network, and each respective behavior policy controls an agent comprising a second artificial neural network.
  - 11. The method according to claim 1, wherein importance sampling within a confidence interval is employed to select behavior policies for each iteration of policy improvement.
  - 12. The method according to claim 1, wherein in each iteration of policy improvement, a data set is collected representing an environment in a first number of dimensions, and the set of behavior policies have a second number of dimensions less than the first number of dimensions.
  - 13. The method according to claim 1, wherein the statistically expected return no worse than a lower bound of policy performance is changed between iterations of policy improvement.
  - 14. The method according to claim 1, wherein, in each iteration of policy improvement, feedback is obtained from a system controlled in accordance with the respective behavior policy, and the feedback is used to improve a computational model of the system which is predictive of future behavior of the system over a range of environmental conditions.
  - 15. The method according to claim 14, further comprising providing a computational model of the system which is predictive of future behavior of the system over a multidimensional range of environmental conditions, based on a plurality of observations under different environmental conditions having a distribution, and the diverse exploration strategy is biased to select respective behavior policies within the set of behavior policies which selectively explore portions of the multidimensional range of environmental conditions.
  - 16. The method according to claim 1, wherein the set of behavior policies is selected based on a predicted state of a system controlled according to the respective behavior policy during deployment of the respective behavior policy.
  - 17. The method according to claim 1, further comprising selecting a set of behavior polices for evaluation within each iteration of policy improvement to generate a maximum predicted statistical improvement in policy performance.

18. A method of iterative policy improvement in reinforcement learning, comprising:
- in each policy improvement iteration i, deploying a most recently confirmed set of policies to collect n trajectories uniformly distributed over the respective policies π
  
  _iwithin the set of policies π
  
  _i∈
  
  ;
  
  for each set of trajectories _icollected from a respective policy π
  
  _i, partition _iand append to a training set of trajectories _trainand a testing set of trajectories _test;
  
  from _train, generating a set of candidate policies and evaluating them using _test;
  
  confirming a subset of policies as meeting predetermined criteria; and
  
  if no new policies π
  
  _iare confirmed, redeploying the current set of policies .
- View Dependent Claims (19, 20)
- - 19. The method according to claim 18, further comprising, for each iteration:
    - defining a lower policy performance bound ρ
      
      ₋; and
      
      performing a t-test on normalized returns of _testwithout importance sampling, treating the set of deployed policies as a mixture policy that generated _test.
  - 20. The method according to claim 18, further comprising employing a set of conjugate policies generated as a byproduct of conjugate gradient descent.

21. An apparatus for learning and deploying a set of behavior policies for an artificial agent, selected from a set of behavior policies, to control a system within an environment, comprising:
- an input configured to receive data from operation of the system according to a respective behavior policy;
  
  at least one automated processor configured to iteratively generate sets of behavior policies based on prior received data which iteratively improve a behavior policy for each iteration of policy improvement, each behavior policy having a statistically expected return no worse than a lower bound of policy performance which excludes a portion of the set of behavior policies and which is ensured as being safe, employing a diverse exploration strategy which strives for behavior diversity in a space of stochastic policies by deploying a diverse set comprising a plurality of behavior policies during each iteration of policy improvement; and
  
  at least one output configured to control the system in accordance with a respective behavior policy of the set of behavior policies.
- View Dependent Claims (22, 23, 24)
- - 22. The method according to claim 21, wherein the adaptation is based on a change in the lower bound of policy performance as a selection criterion for a subsequent behavior policy within a respective single iteration of policy improvement;
    - and feedback of a system state received after deploying a prior behavior policy within a respective single iteration of policy improvement.
  - 23. The autonomous agent apparatus according to claim 21, each respective behavior policy represents a trained first artificial neural network and each respective behavior policy controls a second artificial neural network.
  - 24. The autonomous agent apparatus according to claim 21, wherein the autonomous agent is used to control a physical dynamic system whose dynamics and operating statistics change over a period of policy improvement.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
The Research Foundation for The State University of New York (State University of New York)
Original Assignee
The Research Foundation for The State University of New York (State University of New York)
Inventors
Yu, Lei, Cohen, Andrew

Granted Patent

US 11,568,236 B2
Time in Patent Office

Days
Field of Search
US Class Current
CPC Class Codes

G05B 13/0265   the criterion being a learn...

G05B 13/048   using a predictor

G06N 20/00   Machine learning

G06N 3/006   based on simulated virtual ...

G06N 3/045   Combinations of networks

G06N 3/08   Learning methods

G06N 7/01   Probabilistic graphical mod...

FRAMEWORK AND METHODS OF DIVERSE EXPLORATION FOR FAST AND SAFE POLICY IMPROVEMENT

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

24 Citations

24 Claims

Specification

Use Cases

Quick Links

Others

FRAMEWORK AND METHODS OF DIVERSE EXPLORATION FOR FAST AND SAFE POLICY IMPROVEMENT

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

24 Citations

24 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others