System and method for quantifying an extent to which a data mining algorithm captures useful information in input data
First Claim
1. A method for quantifying an extent to which a data mining algorithm captures useful information in input data, the method comprising:
- performing a forward transform on input data;
identifying and quantifying a region of overlap Yo in the forward transformed data;
performing a reverse transform on the overlap region Yo to create an overlap region Z in an original feature space;
quantifying a degree of overlap in region Z;
comparing a level of overlap in the Yo region with a level of overlap in the Z region; and
quantifying the extent to which a data mining algorithm captures useful information in the input data, based upon a result of the comparison in the levels of overlap between the Yo region and the Z region.
2 Assignments
0 Petitions
Accused Products
Abstract
A system and method for estimating the point of diminishing returns for additional information in data mining processing applications. The present invention provides a convenient method of estimating the extent to which a data mining algorithm captures useful information in raw feature data. First, the input data is processed using a forward transform. A region of overlap Yo in the forward transformed data is identified and quantified. The region of overlap Yo is processed with a reverse transform to create an overlap region Z in an original feature space. The degree of overlap in region Z is quantified and compared to a level of overlap in the Yo region, such that the comparison quantifies the extent to which a data mining algorithm captures useful information in the input data.
73 Citations
12 Claims
-
1. A method for quantifying an extent to which a data mining algorithm captures useful information in input data, the method comprising:
-
performing a forward transform on input data;
identifying and quantifying a region of overlap Yo in the forward transformed data;
performing a reverse transform on the overlap region Yo to create an overlap region Z in an original feature space;
quantifying a degree of overlap in region Z;
comparing a level of overlap in the Yo region with a level of overlap in the Z region; and
quantifying the extent to which a data mining algorithm captures useful information in the input data, based upon a result of the comparison in the levels of overlap between the Yo region and the Z region. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
generating a rank-order curve to estimate a feature dimension at which data mining performance reaches an inflection point prior to performing the forward transform.
-
-
3. The method of claim 2, wherein the performing the forward transform comprises transforming a feature set that contains all the features up to the inflection point into a one-dimensional decision space.
-
4. The method of claim 3, wherein quantifying a region of overlap Yo in the forward transformed data comprises calculating a degree of overlap between different output classes in the decision space.
-
5. The method of claim 4, wherein calculating a degree of overlap comprises using one of Kullback-Leibler divergence, Bhattacharrya distance, and multi-modal overlap measures.
-
6. The method of claim 4, wherein it a dimension N of the feature set is above a threshold value, the input features are orthogonalized with a level of overlap in region Z to convert to N one-dimensional vectors.
-
7. The method of claim 6, wherein the level of overlap in region Z is computed as a sum of each probability density function (PDF) for each of the N one-dimensional vectors.
-
8. The method of claim 4, wherein if a dimension N of the feature set is less than a threshold value, a level of overlap in region Z is computed using a Parzens window to estimate a multi-dimensional class-conditional feature probability density function (PDF).
-
9. The method of claim 4, wherein comparing a level of overlap in the Yo region with a level of overlap in the Z region comprises performing a linear regression between the two regions, such that a magnitude of a slope is proportional to the extent to which a data mining algorithm captures useful information in the input data.
-
10. A method to quantify how close a data mining algorithm is to optimal performance for a given set of input data, the method comprising:
-
performing a forward transform on input data;
calculating a degree of confusion in a region of overlap Yo in the forward transformed data;
performing a reverse transform on the overlap region Yo to create an overlap region Z in an original feature space;
calculating a degree of confusion in overlap region Z; and
performing a linear regression between the level of confusion in the Yo region and the level of confusion in the Z region, such that a magnitude of a slope is proportional to the extent to which a data mining algorithm captures useful information in the input data.
-
-
11. A computer readable medium including computer code for quantifying an extent to which a data mining algorithm captures useful information in input data, the computer readable medium comprising:
-
computer code for performing a forward transform on input data;
computer code for identifying arid quantifying a region of overlap Yo in the forward transformed data;
computer code for performing a reverse transform on the overlap region Yo to create an overlap region Z in an original feature space;
computer code for quantifying a degree of overlap in region Z;
computer code for comparing a level of overlap in the Yo region with a level of overlap in the Z region; and
computer code for quantifying the extent to which a data mining algorithm captures useful information in the input data based upon a result of the comparison in the levels of overlap between the Yo region and the Z region.
-
-
12. A computer system for quantifying how close a data mining algorithm is to optimal performance for a given set of input data, the computer system comprising:
-
a processor; and
computer program code that executes on the processor, the computer program code comprising;
computer code for performing a forward transform on input data;
computer code for identifying and quantifying a region of overlap Yo in the forward transformed data;
computer code for performing a reverse transform on the overlap region Yo to create an overlap region Z in an original feature space;
computer code for quantifying a degree of overlap in region Z;
computer code for comparing a level of overlap in the Yo region with a level of overlap in the Z region; and
computer code for quantifying how close a data mining algorithm is to optimal performance for a given set of input data based upon a result of the comparison in the levels of overlap between the Yo region and the Z region.
-
Specification