APPROXIMATE VALUE ITERATION WITH COMPLEX RETURNS BY BOUNDING
First Claim
1. A method for controlling a system, comprising:
- providing a set of data representing a plurality of states and associated trajectories of an environment of the system;
iteratively determining an estimate of an optimal control policy for the system, comprising performing the substeps until convergence;
estimating a long term value for operation at a respective state of the environment over a series of predicted future environmental states;
using a complex return of the data set to determine a bound to improve the estimated long term value; and
producing an updated estimate of an optimal control policy dependent on the improved estimate of the long term value; and
at least one of;
updating an automated controller for controlling the system with the updated estimate of the optimal control policy, wherein the automated controller operates according to the updated estimate of the optimal control policy to automatically alter at least one of a state of the system and the environment of the system; and
controlling the system with the updated estimate of the optimal control policy, according to the updated estimate of the optimal control policy to automatically alter at least one of a state of the system and the environment of the system..
1 Assignment
0 Petitions
Accused Products
Abstract
A control system and method for controlling a system, which employs a data set representing a plurality of states and associated trajectories of an environment of the system; and which iteratively determines an estimate of an optimal control policy for the system. The iterative process performs the substeps, until convergence, of estimating a long term value for operation at a respective state of the environment over a series of predicted future environmental states; using a complex return of the data set to determine a bound to improve the estimated long term value; and producing an updated estimate of an optimal control policy dependent on the improved estimate of the long term value. The control system may produce an output signal to control the system directly, or output the optimized control policy. The system preferably is a reinforcement learning system which continually improves.
-
Citations
20 Claims
-
1. A method for controlling a system, comprising:
-
providing a set of data representing a plurality of states and associated trajectories of an environment of the system; iteratively determining an estimate of an optimal control policy for the system, comprising performing the substeps until convergence; estimating a long term value for operation at a respective state of the environment over a series of predicted future environmental states; using a complex return of the data set to determine a bound to improve the estimated long term value; and producing an updated estimate of an optimal control policy dependent on the improved estimate of the long term value; and at least one of; updating an automated controller for controlling the system with the updated estimate of the optimal control policy, wherein the automated controller operates according to the updated estimate of the optimal control policy to automatically alter at least one of a state of the system and the environment of the system; and controlling the system with the updated estimate of the optimal control policy, according to the updated estimate of the optimal control policy to automatically alter at least one of a state of the system and the environment of the system.. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A control system, comprising:
-
a memory configured to store a set of data representing a plurality of states and associated trajectories of an environment of the system; and at least one automated processor, configured to process the data in the memory, according to an algorithm comprising; iteratively determining an estimate of an optimal control policy for the system, comprising performing the substeps until convergence; estimating a long term value for operation at a current state of the environment over a series of predicted future environmental states; using a complex return of the data set to determine a bound to improve the estimated long term value; and producing an updated estimate of an optimal control policy dependent on the improved estimate of the long term value. - View Dependent Claims (14, 15, 16, 17, 18, 19)
-
-
20. A computer readable medium storing nontransitory instructions for controlling at least one automated processor, comprising:
-
nontransitory instructions for controlling the at least one automated processor to perform an algorithm comprising; iteratively determining an estimate of an optimal control policy for a system based on a set of data representing a plurality of states and associated trajectories of an environment of the system;
comprising performing the substeps until convergence;estimating a long term value for operation at a current state of the environment over a series of predicted future environmental states; using a complex return of the data set to determine a bound to improve the estimated long term value; and producing an updated estimate of an optimal control policy dependent on the improved estimate of the long term value; and nontransitory instructions for controlling the at least one automated processor to at least one of; update an automated controller for controlling the system with the updated estimate of the optimal control policy, wherein the automated controller operates according to the updated estimate of the optimal control policy to automatically alter at least one of a state of the system and the environment of the system; and control the system with the updated estimate of the optimal control policy, according to the updated estimate of the optimal control policy to automatically alter at least one of a state of the system and the environment of the system.
-
Specification