METHOD FOR FEATURE SELECTION AND FOR EVALUATING FEATURES IDENTIFIED AS SIGNIFICANT FOR CLASSIFYING DATA
First Claim
1. A method for estimating a subset of features falsely labeled “
- significant”
within a group of features which appear able to separate a dataset comprising multiple examples into two or more classes, the method comprising;
inputting the dataset into a computer adapted for implementing a support vector machine;
separately for each feature of the group of features, assigning a value to the feature by;
processing the dataset using the support vector machine to separate the examples into classes according to known outcomes, wherein the classes comprise one class having one set of feature values and at least one other class having another set of feature values;
calculating an extremal margin value between a lowest feature value in the one class and the highest feature value in the at least one other class;
generating a list of the group of features and their calculated extremal margin values;
before or after assigning a value to the feature, determining a probability of obtaining an extremal margin value that exceeds a normal distribution of extremal margin values by;
drawing a set of examples from each class at random according to a normal distribution;
processing the randomly drawn example set using the support vector machine for each feature of the group of features to separate the randomly drawn example set into classes;
computing the extremal margin value within the randomly drawn example set;
repeating the steps of drawing, processing and computing for a large number of randomly drawn sets;
generating a table comprising estimated p-values, wherein the estimated p-value is a fraction of the large number of randomly drawn sets in which the computed extremal margin value exceeds a specified extremal margin value;
selecting a desired p-value;
determining from the table the specified extremal margin value corresponding to the desired p-value;
identifying as falsely significant features the features on the list of the group of features that have an extremal margin value of less than the specified extremal value corresponding to the desired p-value;
generating an output comprising a listing of the falsely significant features; and
transferring the output to a media.
3 Assignments
0 Petitions
Accused Products
Abstract
A group of features that has been identified as “significant” in being able to separate data into classes is evaluated using a support vector machine which separates the dataset into classes one feature at a time. After separation, an extremal margin value is assigned to each feature based on the distance between the lowest feature value in the first class and the highest feature value in the second class. Separately, extremal margin values are calculated for a normal distribution within a large number of randomly drawn example sets for the two classes to determine the number of examples within the normal distribution that would have a specified extremal margin value. Using p-values calculated for the normal distribution, a desired p-value is selected. The specified extremal margin value corresponding to the selected p-value is compared to the calculated extremal margin values for the group of features. The features in the group that have a calculated extremal margin value less than the specified margin value are labeled as falsely significant.
70 Citations
8 Claims
-
1. A method for estimating a subset of features falsely labeled “
- significant”
within a group of features which appear able to separate a dataset comprising multiple examples into two or more classes, the method comprising;inputting the dataset into a computer adapted for implementing a support vector machine; separately for each feature of the group of features, assigning a value to the feature by; processing the dataset using the support vector machine to separate the examples into classes according to known outcomes, wherein the classes comprise one class having one set of feature values and at least one other class having another set of feature values; calculating an extremal margin value between a lowest feature value in the one class and the highest feature value in the at least one other class; generating a list of the group of features and their calculated extremal margin values; before or after assigning a value to the feature, determining a probability of obtaining an extremal margin value that exceeds a normal distribution of extremal margin values by; drawing a set of examples from each class at random according to a normal distribution; processing the randomly drawn example set using the support vector machine for each feature of the group of features to separate the randomly drawn example set into classes; computing the extremal margin value within the randomly drawn example set; repeating the steps of drawing, processing and computing for a large number of randomly drawn sets; generating a table comprising estimated p-values, wherein the estimated p-value is a fraction of the large number of randomly drawn sets in which the computed extremal margin value exceeds a specified extremal margin value; selecting a desired p-value; determining from the table the specified extremal margin value corresponding to the desired p-value; identifying as falsely significant features the features on the list of the group of features that have an extremal margin value of less than the specified extremal value corresponding to the desired p-value; generating an output comprising a listing of the falsely significant features; and transferring the output to a media. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- significant”
Specification