METHOD AND APPARATUS FOR IMPROVED REWARD-BASED LEARNING USING ADAPTIVE DISTANCE METRICS

US 20090099985A1
Filed: 10/11/2007
Published: 04/16/2009
Est. Priority Date: 10/11/2007
Status: Active Grant

First Claim

Patent Images

1. A method for learning a management policy, comprising:

receiving a set of one or more exemplars, where each of the exemplars comprises at least a (state, action) pair for a system;

initializing a distance metric, where the distance metric computes a distance between pairs of exemplars;

initializing a function approximator;

adjusting the distance metric such that a Bellman error measure of the function approximator on the set of exemplars is minimized; and

deriving the management policy from the function approximator.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention is a method and an apparatus for reward-based learning of policies for managing or controlling a system or plant. In one embodiment, a method for reward-based learning includes receiving a set of one or more exemplars, where at least two of the exemplars comprise a (state, action) pair for a system, and at least one of the exemplars includes an immediate reward responsive to a (state, action) pair. A distance metric and a distance-based function approximator estimating long-range expected value are then initialized, where the distance metric computes a distance between two (state, action) pairs, and the distance metric and function approximator are adjusted such that a Bellman error measure of the function approximator on the set of exemplars is minimized. A management policy is then derived based on the trained distance metric and function approximator.

39 Citations

View as Search Results

25 Claims

1. A method for learning a management policy, comprising:
- receiving a set of one or more exemplars, where each of the exemplars comprises at least a (state, action) pair for a system;
  
  initializing a distance metric, where the distance metric computes a distance between pairs of exemplars;
  
  initializing a function approximator;
  
  adjusting the distance metric such that a Bellman error measure of the function approximator on the set of exemplars is minimized; and
  
  deriving the management policy from the function approximator.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The method of claim 1, wherein the distance metric takes the form of a Mahalanobis distance.
  - 3. The method of claim 2, wherein initializing the distance metric comprises:
    - setting initial values for one or more elements in a positive semi-definite matrix.
  - 4. The method of claim 3, wherein the setting comprises:
    - setting the one or more elements to random values.
  - 5. The method of claim 3, wherein the setting comprises:
    - setting the one or more elements to values that correspond to an identity matrix.
  - 6. The method of claim 1, wherein the function approximator is governed by a set of distances between a (state, action) pair and a set of one or more reference points.
  - 7. The method of claim 6, wherein initializing the function approximator comprises:
    - setting a number of the one or more reference points; and
      
      setting locations for the one or more reference points.
  - 8. The method of claim 6, wherein initializing the function approximator comprises:
    - setting one or more adjustable parameters to initial values.
  - 9. The method of claim 1, wherein the adjusting comprises:
    - performing one or more training sweeps through the set of one or more exemplars to produce a trained distance metric and a trained function approximator.
  - 10. The method of claim 9, wherein the one or more training sweeps are performed in accordance with a Reinforcement Learning algorithm.
  - 11. The method of claim 10, wherein the Reinforcement Learning algorithm is one of:
    - Q-Learning and Sarsa.
  - 12. The method of claim 1, further comprising:
    - applying nonlinear dimensionality reduction to the one or more exemplars and to the learned distance metric in order to embed the one or more exemplars in a lower-dimensional space;
      
      applying reward-based learning to the embedded exemplars in order to obtain a value function over an embedding space;
      
      constructing an out-of-sample embedding function based on the embedded exemplars; and
      
      deriving a management policy from the out-of-sample embedding function and the value function over the embedding space.
  - 13. The method of claim 1, further comprising:
    - applying the learned management policy to manage a computing system or to control a plant.

14. A computer readable medium containing an executable program for learning a management policy, where the program performs the steps of:
- receiving a set of one or more exemplars, where each of the exemplars comprises at least a (state, action) pair for a system;
  
  initializing a distance metric, where the distance metric computes a distance between pairs of exemplars;
  
  initializing a function approximator;
  
  adjusting the distance metric such that a Bellman error measure of the function approximator on the set of exemplars is minimized; and
  
  deriving the management policy from the function approximator.
- View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
- - 15. The computer readable medium of claim 14, wherein the distance metric takes the form of a Mahalanobis distance.
  - 16. The computer readable medium of claim 15, wherein initializing the distance metric comprises:
    - setting initial values for one or more elements in a positive semi-definite matrix.
  - 17. The computer readable medium of claim 16, wherein the setting comprises:
    - setting the one or more elements to random values.
  - 18. The computer readable medium of claim 16, wherein the setting comprises:
    - setting the one or more elements to values that correspond to an identity matrix.
  - 19. The computer readable medium of claim 14, wherein the function approximator is governed by a set of distances between a (state, action) pair and a set of one or more reference points.
  - 20. The computer readable medium of claim 19, wherein initializing the function approximator comprises:
    - setting a number of the one or more reference points; and
      
      setting locations for the one or more reference points.
  - 21. The computer readable medium of claim 19, wherein initializing the function approximator comprises:
    - setting one or more adjustable parameters to initial values.
  - 22. The computer readable medium of claim 14, wherein the adjusting comprises:
    - performing one or more training sweeps through the set of one or more exemplars to produce a trained distance metric and a trained function approximator.
  - 23. The computer readable medium of claim 22, wherein the one or more training sweeps are performed in accordance with a Reinforcement Learning algorithm.
  - 24. The computer readable medium of claim 23, wherein the Reinforcement Learning algorithm is one of:
    - Q-Learning and Sarsa.

25. A system for learning a management policy, comprising:
- means for receiving a set of one or more exemplars, where each of the exemplars comprises at least a (state, action) pair for a system;
  
  means for initializing a distance metric, where the distance metric computes a distance between pairs of exemplars;
  
  means for initializing a function approximator;
  
  means for adjusting the distance metric such that a Bellman error measure of the function approximator on the set of exemplars is minimized; and
  
  means for deriving the management policy from the function approximator.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
TESAURO, GERALD J., Weinberger, Kilian Q.

Granted Patent

US 9,298,172 B2
Time in Patent Office

Days
Field of Search
US Class Current

706/12
CPC Class Codes

G05B 13/0265   the criterion being a learn...

G06F 18/24   Classification techniques

G06F 18/24147   Distances to closest patter...

G06N 20/00   Machine learning

G06N 5/02   Knowledge representation; S...

METHOD AND APPARATUS FOR IMPROVED REWARD-BASED LEARNING USING ADAPTIVE DISTANCE METRICS

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

39 Citations

25 Claims

Specification

Solutions

Use Cases

Quick Links

METHOD AND APPARATUS FOR IMPROVED REWARD-BASED LEARNING USING ADAPTIVE DISTANCE METRICS

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

39 Citations

25 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links