Online learning and vehicle control method based on reinforcement learning without active exploration

US 10,065,654 B2
Filed: 07/08/2016
Issued: 09/04/2018
Est. Priority Date: 07/08/2016
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-implemented method of adaptively controlling an autonomous operation of a vehicle, the method comprising:

a) in a critic network in a computing system configured to autonomously control the vehicle, determining, using samples of passively collected data and a state cost, an estimated average cost, and an approximated cost-to-go function that produces a minimum value for a cost-to-go of the vehicle when applied by an actor network; and

b) in an actor network in the computing system and operatively coupled to the critic network, determining a control input to apply to the vehicle which produces the minimum value for the cost-to-go,wherein the actor network is configured to determine the control input by estimating a noise level using the estimated average cost, an estimated cost-to-go determined from the approximated cost-to-go function, a control dynamics for a current state of the vehicle, and the samples of passively collected data, andwherein the approximated cost-to-go function is determined using a linear combination of weighted radial basis functions in accordance with the following relationship;

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer-implemented method of adaptively controlling an autonomous operation of a vehicle is provided. The method includes steps of (a) in a critic network in a computing system configured to autonomously control the vehicle, determining, using samples of passively collected data and a state cost, an estimated average cost, and an approximated cost-to-go function that produces a minimum value for a cost-to-go of the vehicle when applied by an actor network; and (b) in an actor network in the computing system and operatively coupled to the critic network, determining a control input to apply to the vehicle that produces the minimum value for the cost-to-go, wherein the actor network is configured to determine the control input by estimating a noise level using the average cost, a cost-to-go determined from the approximated cost-to-go function, a control dynamics for a current state of the vehicle, and the passively collected data.

Citations

19 Claims

1. A computer-implemented method of adaptively controlling an autonomous operation of a vehicle, the method comprising:
- a) in a critic network in a computing system configured to autonomously control the vehicle, determining, using samples of passively collected data and a state cost, an estimated average cost, and an approximated cost-to-go function that produces a minimum value for a cost-to-go of the vehicle when applied by an actor network; and
  
  b) in an actor network in the computing system and operatively coupled to the critic network, determining a control input to apply to the vehicle which produces the minimum value for the cost-to-go,wherein the actor network is configured to determine the control input by estimating a noise level using the estimated average cost, an estimated cost-to-go determined from the approximated cost-to-go function, a control dynamics for a current state of the vehicle, and the samples of passively collected data, andwherein the approximated cost-to-go function is determined using a linear combination of weighted radial basis functions in accordance with the following relationship;
- View Dependent Claims (2, 3, 4, 6, 8, 13, 14, 15)
- - 2. The method of claim 1 wherein weights ω
    - used in the approximated cost-to-go function are updated in accordance with the following relationship;
  - 3. The method of claim 1 further comprising the step of updating parameters of the critic network using an approximated temporal difference error determined using a linearized version of a Bellman equation.
  - 4. The method of claim 3 wherein updating of the critic network parameters is performed when the vehicle is in motion.
  - 6. The method of claim 3, wherein passively-collected data is the only data used during updating of the critic network parameters.
  - 8. The method of claim 1, wherein the approximated cost-to-go function is learned by the critic network in real time.
  - 13. The method of claim 1 further comprising the step of, using the control input, revising a control policy usable for controlling the autonomous operation.
  - 14. The method of claim 1 further comprising the step of optimizing a control policy usable for controlling the autonomous operation by iteratively performing steps (a) and (b) to redetermine the control input until convergence of the estimated average cost.
  - 15. The method of claim 14 wherein the control policy is optimized without active exploration.

5. A computer-implemented method of adaptively controlling an autonomous operation of a vehicle, the method comprising:
- a) in a critic network in a computing system configured to autonomously control the vehicle, determining, using samples of passively collected data and a state cost, an estimated average cost, and an approximated cost-to-go function that produces a minimum value for a cost-to-go of the vehicle when applied by an actor network; and
  
  b) in an actor network in the computing system and operatively coupled to the critic network, determining a control input to apply to the vehicle which produces the minimum value for the cost-to-go,wherein the actor network is configured to determine the control input by estimating a noise level using the estimated average cost, an estimated cost-to-go determined from the approximated cost-to-go function, a control dynamics for a current state of the vehicle, and the samples of passively collected data,the method further comprising the step of updating parameters of the critic network using an approximated temporal difference error determined using a linearized version of a bellman equation, andwherein the estimated average cost determined by the critic network is updated in accordance with the following relationship;
  
  {circumflex over (Z)}_avgⁱ⁺¹={circumflex over (Z)}_avgⁱ−
  
  α
  
  ₂ⁱe_k{circumflex over (Z)}_kwhere β
  
  is a learning rate, e_kis the approximated temporal difference error, {circumflex over (Z)}_kis an estimated cost determined from the approximated cost-to-go function, {circumflex over (Z)}_avgⁱis an estimated average cost in state i, and {circumflex over (Z)}_avgⁱ⁺¹is an estimated average cost in state i+1.

7. A computer-implemented method of adaptively controlling an autonomous operation of a vehicle, the method comprising:
- a) in a critic network in a computing system configured to autonomously control the vehicle, determining, using samples of passively collected data and a state cost, an estimated average cost, and an approximated cost-to-go function that produces a minimum value for a cost-to-go of the vehicle when applied by an actor network; and
  
  b) in an actor network in the computing system and operatively coupled to the critic network, determining a control input to apply to the vehicle which produces the minimum value for the cost-to-go,wherein the actor network is configured to determine the control input by estimating a noise level using the estimated average cost, an estimated cost-to-go determined from the approximated cost-to-go function, a control dynamics for a current state of the vehicle, and the samples of passively collected data,the method further comprising the step of updating parameters of the critic network using an approximated temporal difference error determined in accordance with the following relationship;
  
  e_k;
  
  ={circumflex over (Z)}_avg{circumflex over (Z)}_k−
  
  exp(−
  
  q_k){circumflex over (Z)}_k+1where e_kis the approximated temporal difference error, {circumflex over (Z)}_avgis an estimated average cost, {circumflex over (Z)}_kis an estimated cost-to-go in a state k, {circumflex over (Z)}_k+1is an estimated cost-to-go in a state k+1, and q_kis a state cost in the state k.

9. A computer-implemented method of adaptively controlling an autonomous operation of a vehicle, the method comprising:
- a) in a critic network in a computing system configured to autonomously control the vehicle, determining, using samples of passively collected data and a state cost, an estimated average cost, and an approximated cost-to-go function that produces a minimum value for a cost-to-go of the vehicle when applied by an actor network; and
  
  b) in an actor network in the computing system and operatively coupled to the critic network, determining a control input to apply to the vehicle which produces the minimum value for the cost-to-go,wherein the actor network is configured to determine the control input by estimating a noise level using the estimated average cost, an estimated cost-to-go determined from the approximated cost-to-go function, a control dynamics for a current state of the vehicle, and the samples of passively collected data, andwherein the noise level is learned using a linear combination of weighted basis functions in accordance with the relationship;
- View Dependent Claims (10, 11, 12)
- - 10. The method of claim 9 further comprising the step of updating a weight parameter of the actor network using an approximated error determined in accordance with the following relationship:
    - d_k≈
      
      q_kΔ
      
      t−
      
      {circumflex over (V)}_k+1+{circumflex over (V)}_avg+L_k,k+1ρ
      
      _k,where d_kis the approximated error, q_kis a state cost in state k, {circumflex over (V)}_kis an approximated cost-to-go in state k, {circumflex over (V)}_k+1is an approximated cost-to-go in state k+1,{circumflex over (V)}_avgis an approximated average cost, and
      L_k,k+1;
      
      =(0.5{circumflex over (V)}_k−
      
      {circumflex over (V)}_k+1)^τB_kB_k^τ{circumflex over (V)}_kΔ
      
      t where B_kis a control dynamics in state k.
  - 11. The method of claim 10 wherein updating of the actor network weight parameter is performed when the vehicle is in motion.
  - 12. The method of claim 10 wherein a weight parameter of the actor network is updated in accordance with the following relationship:
    - μ
      
      ⁱ⁺¹=μ
      
      ⁱ−
      
      β
      
      ⁱd_kL_k,k+1g_k,where μ
      
      ⁱ⁺¹is a value of the weight parameter in a state i+1, μ
      
      ⁱis the value of the weight parameter in a state i, β
      
      ⁱis a learning rate, d_kis a temporal difference error, and g is a radial basis function.

16. A computing system configured for adaptively controlling an autonomous operation of a vehicle, the computing system comprising one or more processors for controlling operation of the computing system, and a memory for storing data and program instructions usable by the one or more processors, wherein the one or more processors are configured to execute instructions stored in the memory to:
- a) determine, using samples of passively collected data and a state cost, an estimated average cost, and an approximated cost-to-go function that produces a minimum value for a cost-to-go of the vehicle; and
  
  b) determine a control input to apply to the vehicle that produces the minimum value for the cost-to-go, wherein the one or more processors are configured to determine the control input by estimating a noise level using the estimated average cost, a cost-to-go determined from the approximated cost-to-go function, a control dynamics for a current state of the vehicle, and the samples of passively collected data,wherein the approximated cost-to-go function is determined using a linear combination of weighted radial basis functions in accordance with the following relationship;
- View Dependent Claims (17)
- - 17. The computing system of claim 16 wherein the one or more processors are configured to execute instructions stored in the memory to optimize a control policy usable for controlling the autonomous operation by iteratively repeating steps (a) and (b) to redetermine the control input until convergence of the estimated average cost.

18. A non-transitory computer readable medium having stored therein instructions executable by a computer system to cause the computer system to perform functions, the functions comprising:
- a) determining, using samples of passively collected data and a state cost, an estimated average cost, and an approximated cost-to-go function that produces a minimum value for a cost-to-go of a vehicle; and
  
  b) determining a control input to apply to the vehicle to control an autonomous operation of the vehicle, wherein the control input produces the minimum value for the cost-to-go, and wherein the control input is determined by estimating a noise level using the average cost, a cost-to-go determined from the approximated cost-to-go function, a control dynamics for a current state of the vehicle, and the samples of passively collected data,wherein the approximated cost-to-go function is determined using a linear combination of weighted radial basis functions in accordance with the following relationship;
- View Dependent Claims (19)
- - 19. The non-transitory computer readable medium of claim 18 wherein the instructions are executable to optimize a control policy usable for controlling the autonomous operation by iteratively repeating steps (a) and (b) to redetermine the control input until convergence of the estimated average cost.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Kabushiki Kaisha Toyota Chuo Kenkyusho (Toyota Motor Corporation)
Original Assignee
Toyota Motor Engineering & Manufacturing North America Incorporated (Toyota Motor Corporation)
Inventors
Nishi, Tomoki
Primary Examiner(s)
Smith, Jelani A
Assistant Examiner(s)
Martinez Borrero, Luis A

Application Number

US15/205,558
Publication Number

US 20180009445A1
Time in Patent Office

788 Days
Field of Search

701301, 701 45, 701 1, 701 36, 701 23, 701117, 701 46, 701 49, 701116, 701 27, 701 28
US Class Current
CPC Class Codes

B60W 2050/0013   Optimal controllers

B60W 2050/0014   Adaptive controllers

B60W 2050/0075   Automatic parameter input, ...

B60W 2050/0088   Adaptive recalibration

B60W 2420/403   Image sensing, e.g. optical...

B60W 2420/408   Radar; Laser, e.g. lidar

B60W 2520/105   Longitudinal acceleration

B60W 2520/125   Lateral acceleration

B60W 2520/14   Yaw

B60W 2520/16   Pitch

B60W 2520/18   Roll

B60W 2520/28   Wheel speed

B60W 2556/00   Input parameters relating t...

B60W 2556/10   Historical data

B60W 50/0098   Details of control systems ...

B60W 50/06   Improving the dynamic respo...

B60W 60/001   Planning or execution of dr...

G05B 13/0265   the criterion being a learn...

G05B 13/041   in which a variable is auto...

G05D 1/0221   involving a learning process

G06N 3/045 : Combinations of networks

G06N 3/08 : Learning methods

G06N 7/01 : Probabilistic graphical mod...

Y02T 10/40 : Engine management systems

Y02T 10/84 : Data processing systems or ...

View All

Online learning and vehicle control method based on reinforcement learning without active exploration

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Online learning and vehicle control method based on reinforcement learning without active exploration

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links