Apparatus and method for selecting a working data set for model development
First Claim
1. A method for generating a working set of data from a full set of data in a computer for building an analyzer, said analyzer having target outputs, said method comprising the steps of:
- augmenting said full set of data with the target outputs;
normalizing said augmented data;
clustering said augmented and normalized data; and
selecting one or more members of said clusterized data as said working set of data.
3 Assignments
0 Petitions
Accused Products
Abstract
The present invention provides a data selection apparatus which augments a set of training examples with the desired output data. The resulting augmented data set is normalized such that the augmented data values range between -1 and +1 and such that the mean of the augmented data set is zero. The data selection apparatus then groups the augmented and normalized data set into related clusters using a clusterizer. Preferably, the clusterizer is a neural network such as a Kohonen self-organizing map (SOM). The data selection apparatus further applies an extractor to cull a working set of data from the clusterized data set. The present invention thus picks, or filters, a set of data which is more nearly uniformly distributed across the portion of the input space of interest to minimize the maximum absolute error over the entire input space. The output of the data selection apparatus is provided to train the analyzer with important sub-sets of the training data rather than with all available training data. A smaller training data set significantly reduces the complexity of the model building or analyzer construction process.
245 Citations
33 Claims
-
1. A method for generating a working set of data from a full set of data in a computer for building an analyzer, said analyzer having target outputs, said method comprising the steps of:
-
augmenting said full set of data with the target outputs; normalizing said augmented data; clustering said augmented and normalized data; and selecting one or more members of said clusterized data as said working set of data. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
-
-
18. An apparatus for generating a working set of data from a full set of data in a computer for building an analyzer, said analyzer having target outputs, said apparatus comprising:
-
a cumulator for augmenting said full set of data with the target outputs; an adjuster coupled to said cumulator for normalizing said augmented data; a clusterizer coupled to said adjuster for clustering said augmented and normalized data; and a selector coupled to said clusterizer for picking one or more members of said clusterized data as said working set of data. - View Dependent Claims (19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32)
-
-
33. A program storage device having a computer readable program code embodied therein for generating a working set of data from a full set of data to build an analyzer, said analyzer having target outputs, said program storage device comprising:
-
a cumulator code for augmenting said full set of data with the target outputs; an adjuster code coupled to said cumulator code for normalizing said augmented data; a clusterizer code coupled to said adjuster code for clustering said augmented and normalized data; and a selector code coupled to said clusterizer code for picking one or more members of said clusterized data as said working set of data.
-
Specification