Control policy learning and vehicle control method based on reinforcement learning without active exploration
First Claim
Patent Images
1. A computer-implemented method for autonomously controlling a vehicle to perform a vehicle operation, the method comprising steps of:
- applying a passive actor-critic reinforcement learning method to passively-collected data relating to the vehicle operation, to adapt an existing control policy so as to enable control of the vehicle by the control policy so as to perform the vehicle operation with a minimum expected cumulative cost, the step of applying a passive actor-critic reinforcement learning method to passively-collected data including steps of;
a) in a critic network, estimating a Z-value and an average cost under an optimal control policy using samples of the passively collected data;
b) in an actor network operatively coupled to the critic network, revising the control policy using samples of the passively collected data, the estimated Z-value, and the estimated average cost under an optimal control policy from the critic network; and
c) iteratively repeating steps (a)-(b) until the estimated average cost converges; and
controlling the vehicle in accordance with the adapted control policy to perform the vehicle operation.
2 Assignments
0 Petitions
Accused Products
Abstract
A computer-implemented method is provided for autonomously controlling a vehicle to perform a vehicle operation. The method includes steps of applying a passive actor-critic reinforcement learning method to passively-collected data relating to the vehicle operation, to learn a control policy configured for controlling the vehicle so as to perform the vehicle operation with a minimum expected cumulative cost; and controlling the vehicle in accordance with the control policy to perform the vehicle operation.
-
Citations
13 Claims
-
1. A computer-implemented method for autonomously controlling a vehicle to perform a vehicle operation, the method comprising steps of:
-
applying a passive actor-critic reinforcement learning method to passively-collected data relating to the vehicle operation, to adapt an existing control policy so as to enable control of the vehicle by the control policy so as to perform the vehicle operation with a minimum expected cumulative cost, the step of applying a passive actor-critic reinforcement learning method to passively-collected data including steps of; a) in a critic network, estimating a Z-value and an average cost under an optimal control policy using samples of the passively collected data; b) in an actor network operatively coupled to the critic network, revising the control policy using samples of the passively collected data, the estimated Z-value, and the estimated average cost under an optimal control policy from the critic network; and c) iteratively repeating steps (a)-(b) until the estimated average cost converges; and controlling the vehicle in accordance with the adapted control policy to perform the vehicle operation. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A computer-implemented method for optimizing a control policy usable for controlling a system to perform an operation, the method comprising steps of:
-
providing a control policy usable for controlling the system; and applying a passive actor-critic reinforcement learning method to passively-collected data relating to the operation to be performed, to revise the control policy such that the control policy is operable to control the system to perform the operation with a minimum expected cumulative cost, wherein the step of applying a passive actor-critic reinforcement learning method to passively-collected data includes steps of; a) in a critic network, estimating a Z-value using samples of the passively-collected data, and estimating an average cost under an optimal policy using samples of the passively-collected data; b) in an actor network, revising the control policy using samples of the passively-collected data, a control dynamics for the system, a cost-to-go, and a control gain; c) updating parameters used in revising the control policy and in estimating the Z-value and the average cost under an optimal policy; and d) iteratively repeating steps (a)-(c) until the estimated average cost converges.
-
-
13. A computing system configured for optimizing a control policy usable for autonomously controlling a vehicle to perform a vehicle operation, the computing system including one or more processors for controlling operation of the computing system, and a memory for storing data and program instructions usable by the one or more processors, wherein the memory is configured to store computer code that, when executed by the one or more processors, causes the one or more processors to:
-
a) receive passively-collected data relating to the vehicle operation; b) determine a Z-value function usable for estimating a cost-to-go for the vehicle; c) in a critic network in the computing system; c1) determine a Z-value using the Z-value function and samples of the passively-collected data; c2) estimate an average cost under an optimal policy using samples of the passively-collected data d) in an actor network in the computing system, revise the control policy using samples of the passively-collected data;
a control dynamics for the vehicle;
a cost-to-go, and a control gain; ande) iteratively repeat steps (c) and (d) until the estimated average cost converges.
-
Specification