APPROXIMATE VALUE ITERATION WITH COMPLEX RETURNS BY BOUNDING

US 20180012137A1
Filed: 11/22/2016
Published: 01/11/2018
Est. Priority Date: 11/24/2015
Status: Active Grant

First Claim

Patent Images

1. A method for controlling a system, comprising:

providing a set of data representing a plurality of states and associated trajectories of an environment of the system;

iteratively determining an estimate of an optimal control policy for the system, comprising performing the substeps until convergence;

estimating a long term value for operation at a respective state of the environment over a series of predicted future environmental states;

using a complex return of the data set to determine a bound to improve the estimated long term value; and

producing an updated estimate of an optimal control policy dependent on the improved estimate of the long term value; and

at least one of;

updating an automated controller for controlling the system with the updated estimate of the optimal control policy, wherein the automated controller operates according to the updated estimate of the optimal control policy to automatically alter at least one of a state of the system and the environment of the system; and

controlling the system with the updated estimate of the optimal control policy, according to the updated estimate of the optimal control policy to automatically alter at least one of a state of the system and the environment of the system..

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A control system and method for controlling a system, which employs a data set representing a plurality of states and associated trajectories of an environment of the system; and which iteratively determines an estimate of an optimal control policy for the system. The iterative process performs the substeps, until convergence, of estimating a long term value for operation at a respective state of the environment over a series of predicted future environmental states; using a complex return of the data set to determine a bound to improve the estimated long term value; and producing an updated estimate of an optimal control policy dependent on the improved estimate of the long term value. The control system may produce an output signal to control the system directly, or output the optimized control policy. The system preferably is a reinforcement learning system which continually improves.

Citations

20 Claims

1. A method for controlling a system, comprising:
- providing a set of data representing a plurality of states and associated trajectories of an environment of the system;
  
  iteratively determining an estimate of an optimal control policy for the system, comprising performing the substeps until convergence;
  
  estimating a long term value for operation at a respective state of the environment over a series of predicted future environmental states;
  
  using a complex return of the data set to determine a bound to improve the estimated long term value; and
  
  producing an updated estimate of an optimal control policy dependent on the improved estimate of the long term value; and
  
  at least one of;
  
  updating an automated controller for controlling the system with the updated estimate of the optimal control policy, wherein the automated controller operates according to the updated estimate of the optimal control policy to automatically alter at least one of a state of the system and the environment of the system; and
  
  controlling the system with the updated estimate of the optimal control policy, according to the updated estimate of the optimal control policy to automatically alter at least one of a state of the system and the environment of the system..
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method according to claim 1, wherein said using a complex return of the data set as a bound to improve the estimated long term value comprises using a truncated portion of a trajectory which is consistent with the estimate of the optimal control policy, to estimate the complex return, without introducing off-policy bias.
  - 3. The method according to claim 2, wherein the truncated portion of the trajectory comprises a predetermined number of sequential data.
  - 4. The method according to claim 2, wherein the truncated portion of the trajectory is truncated dependent on whether a sequential datum is on-policy or off-policy.
  - 5. The method according to claim 1, wherein an inherent negative bias of the complex return is employed as a lower bound for the estimate of the long term value.
  - 6. The method according to claim 1, wherein a trajectory comprises an ordered collection of observations, and the long term value is the sum of the discounted values of a reward received for each observation plus the maximum discounted estimated value for operation at the estimated optimal policy.
  - 7. The method according to claim 1, wherein said iteratively determining comprises:
  - 8. The method according to claim 1, wherein the bound to improve the estimated long term value is a bounded return representing the maximum of an unbiased estimator and a complex return function.
  - 9. The method according to claim 1, wherein said iteratively determining comprises:
  - 10. The method according to claim 1, further comprising predicting an upper bound for the estimated optimal control policy.
  - 11. The method according to claim 10, wherein the upper bound for a value associated with a respective state is determined based on at least looking backward along a respective trajectory, to provide an estimate of a respective environment of the system at the respective state, as an inflated value of the past environment of the system to achieve the respective environment.
  - 12. The method according to claim 1, further comprising using the updated estimate of an optimal control policy to control a controlled system.

13. A control system, comprising:
- a memory configured to store a set of data representing a plurality of states and associated trajectories of an environment of the system; and
  
  at least one automated processor, configured to process the data in the memory, according to an algorithm comprising;
  
  iteratively determining an estimate of an optimal control policy for the system, comprising performing the substeps until convergence;
  
  estimating a long term value for operation at a current state of the environment over a series of predicted future environmental states;
  
  using a complex return of the data set to determine a bound to improve the estimated long term value; and
  
  producing an updated estimate of an optimal control policy dependent on the improved estimate of the long term value.
- View Dependent Claims (14, 15, 16, 17, 18, 19)
- - 14. The control system according to claim 13, further comprising an automated communication interface, configured to automatically communicate at least one of:
    - the updated estimate of the optimal control policy to a controller configured to automatically control the system, to at least one of change an operating state of the system and change an environment of the system; and
      
      a control signal for automatically controlling the system, to at least one of change an operating state of the system and change an environment of the system, dependent on the updated estimate of the optimal control policy.
  - 15. The control system according to claim 13, wherein the algorithm uses the complex return of the data set as a bound to improve the estimated long term value by truncating the trajectory to a truncated portion which is consistent with the estimate of the optimal control policy, to estimate the complex return, without introducing off-policy bias.
  - 16. The control system according to claim 13, wherein the truncated portion of the trajectory at least one of:
    - comprises a predetermined number of sequential data; and
      
      is truncated dependent on whether a sequential datum is on-policy or off-policy.
  - 17. The control system according to claim 13, wherein the algorithm:
    - employs an inherent negative bias of the complex return as a lower bound for the estimate of the long term value; and
      
      further comprises predicting an upper bound for the estimated long term value.
  - 18. The control system according to claim 13, wherein the trajectory comprises an ordered collection of observations, and the long term value is the sum of the discounted values of a reward received for each observation plus the maximum discounted estimated value for operation at the estimated optimal policy.
  - 19. The control system according to claim 13, wherein the bound to improve the estimated long term value is a bounded return representing the maximum of an unbiased estimator and a complex return function.

20. A computer readable medium storing nontransitory instructions for controlling at least one automated processor, comprising:
- nontransitory instructions for controlling the at least one automated processor to perform an algorithm comprising;
  
  iteratively determining an estimate of an optimal control policy for a system based on a set of data representing a plurality of states and associated trajectories of an environment of the system;
  
  comprising performing the substeps until convergence;
  
  estimating a long term value for operation at a current state of the environment over a series of predicted future environmental states;
  
  using a complex return of the data set to determine a bound to improve the estimated long term value; and
  
  producing an updated estimate of an optimal control policy dependent on the improved estimate of the long term value; and
  
  nontransitory instructions for controlling the at least one automated processor to at least one of;
  
  update an automated controller for controlling the system with the updated estimate of the optimal control policy, wherein the automated controller operates according to the updated estimate of the optimal control policy to automatically alter at least one of a state of the system and the environment of the system; and
  
  control the system with the updated estimate of the optimal control policy, according to the updated estimate of the optimal control policy to automatically alter at least one of a state of the system and the environment of the system.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
The Research Foundation for The State University of New York (State University of New York)
Original Assignee
The Research Foundation for The State University of New York (State University of New York)
Inventors
Wright, Robert, Yu, Lei, Loscalzo, Steven

Granted Patent

US 10,839,302 B2
Time in Patent Office

Days
Field of Search
US Class Current
CPC Class Codes

G05B 13/0265   the criterion being a learn...

G05B 15/02   electric

G06N 20/00   Machine learning

G06N 7/01   Probabilistic graphical mod...

Y02B 10/30   Wind power

APPROXIMATE VALUE ITERATION WITH COMPLEX RETURNS BY BOUNDING

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

APPROXIMATE VALUE ITERATION WITH COMPLEX RETURNS BY BOUNDING

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links