METHOD AND APPARATUS FOR IMPROVED REWARD-BASED LEARNING USING ADAPTIVE DISTANCE METRICS
First Claim
1. A method for learning a management policy, comprising:
- receiving a set of one or more exemplars, where each of the exemplars comprises at least a (state, action) pair for a system;
initializing a distance metric, where the distance metric computes a distance between pairs of exemplars;
initializing a function approximator;
adjusting the distance metric such that a Bellman error measure of the function approximator on the set of exemplars is minimized; and
deriving the management policy from the function approximator.
1 Assignment
0 Petitions
Accused Products
Abstract
The present invention is a method and an apparatus for reward-based learning of policies for managing or controlling a system or plant. In one embodiment, a method for reward-based learning includes receiving a set of one or more exemplars, where at least two of the exemplars comprise a (state, action) pair for a system, and at least one of the exemplars includes an immediate reward responsive to a (state, action) pair. A distance metric and a distance-based function approximator estimating long-range expected value are then initialized, where the distance metric computes a distance between two (state, action) pairs, and the distance metric and function approximator are adjusted such that a Bellman error measure of the function approximator on the set of exemplars is minimized. A management policy is then derived based on the trained distance metric and function approximator.
39 Citations
25 Claims
-
1. A method for learning a management policy, comprising:
-
receiving a set of one or more exemplars, where each of the exemplars comprises at least a (state, action) pair for a system; initializing a distance metric, where the distance metric computes a distance between pairs of exemplars; initializing a function approximator; adjusting the distance metric such that a Bellman error measure of the function approximator on the set of exemplars is minimized; and deriving the management policy from the function approximator. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A computer readable medium containing an executable program for learning a management policy, where the program performs the steps of:
-
receiving a set of one or more exemplars, where each of the exemplars comprises at least a (state, action) pair for a system; initializing a distance metric, where the distance metric computes a distance between pairs of exemplars; initializing a function approximator; adjusting the distance metric such that a Bellman error measure of the function approximator on the set of exemplars is minimized; and deriving the management policy from the function approximator. - View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
-
-
25. A system for learning a management policy, comprising:
-
means for receiving a set of one or more exemplars, where each of the exemplars comprises at least a (state, action) pair for a system; means for initializing a distance metric, where the distance metric computes a distance between pairs of exemplars; means for initializing a function approximator; means for adjusting the distance metric such that a Bellman error measure of the function approximator on the set of exemplars is minimized; and means for deriving the management policy from the function approximator.
-
Specification