Feature selection method using support vector machine classifier
DC CAFCFirst Claim
1. A computer-implemented method for predicting patterns in biological data, wherein the data comprises a large set of features that describe the data and a sample set from which the biological data is obtained is much smaller than the large set of features, the method comprising:
- identifying a determinative subset of features that are most correlated to the patterns comprising;
(a) inputting the data into a computer processor programmed for executing support vector machine classifiers;
(b) training a support vector machine classifier with a training data set comprising at least a portion of the sample set and having known outcomes with respect to the patterns, wherein the classifier comprises weights having weight values that correspond to the features in the data set and removal of a subset of features affects the weight values;
(c) ranking the features according to their corresponding weight values;
(d) removing one or more features corresponding to the smallest weight values;
(e) training a new classifier with the remaining features;
(f) repeating steps (c) through (e) for a plurality of iterations until a final subset having a pre-determined number of features remains; and
generating at a printer or display device a report comprising a listing of the features in the final subset, wherein the final subset comprises the determinative subset of features for determining biological characteristics of the sample set.
4 Assignments
Litigations
2 Petitions
Accused Products
Abstract
Identification of a determinative subset of features from within a large set of features is performed by training a support vector machine to rank the features according to classifier weights, where features are removed to determine how their removal affects the value of the classifier weights. The features having the smallest weight values are removed and a new support vector machine is trained with the remaining weights. The process is repeated until a relatively small subset of features remain that is capable of accurately separating the data into different patterns or classes. The method is applied for selecting the smallest number of genes that are capable of accurately distinguishing between medical conditions such as cancer and non-cancer.
-
Citations
19 Claims
-
1. A computer-implemented method for predicting patterns in biological data, wherein the data comprises a large set of features that describe the data and a sample set from which the biological data is obtained is much smaller than the large set of features, the method comprising:
-
identifying a determinative subset of features that are most correlated to the patterns comprising; (a) inputting the data into a computer processor programmed for executing support vector machine classifiers; (b) training a support vector machine classifier with a training data set comprising at least a portion of the sample set and having known outcomes with respect to the patterns, wherein the classifier comprises weights having weight values that correspond to the features in the data set and removal of a subset of features affects the weight values; (c) ranking the features according to their corresponding weight values; (d) removing one or more features corresponding to the smallest weight values; (e) training a new classifier with the remaining features; (f) repeating steps (c) through (e) for a plurality of iterations until a final subset having a pre-determined number of features remains; and generating at a printer or display device a report comprising a listing of the features in the final subset, wherein the final subset comprises the determinative subset of features for determining biological characteristics of the sample set. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A computer program product embodied on a computer readable medium for predicting patterns in data without overfitting by identifying a determinative subset of features that are most correlated to the patterns, wherein the data comprises a large set of features that describe the data, the computer program product comprising instructions for executing support vector machine classifiers and further for causing a computer processor to:
-
(a) receive the data; (b) train a support vector machine classifier with a training data set having known outcomes with respect to the patterns, wherein the training data set has a number of training patterns that is much smaller than the number of features in the large set of features, and wherein the classifier comprises weights having weight values that correspond to the features in the data set and removal of a subset of features affects the weight values; (c) rank the features according to their corresponding weight values; (d) remove one or more features corresponding to the smallest weight values; (e) train a new classifier with the remaining features; (f) repeat steps (c) through (e) for a plurality of iterations until a final subset having a pre-determined number of features remains; and (g) generate at a printer or display device a report comprising a listing of the features in the final subset, wherein the final subset comprises the determinative subset of features. - View Dependent Claims (13, 14, 15)
-
-
16. An apparatus comprising:
-
a computer processor; a memory; a computer readable medium storing a computer program product for predicting patterns in data without overfitting by identifying a determinative subset of features that are most correlated to the patterns, wherein the data comprises a large set of features that describe the data, the computer program product comprising instructions for executing support vector machine classifiers and further for causing a computer processor to; (a) receive the data; (b) train a support vector machine classifier with a training data set having known outcomes with respect to the patterns, wherein the training data set has a number of training patterns that is much smaller than the number of features in the large set of features, and wherein the classifier comprises weights having weight values that correspond to the features in the data set and removal of a subset of features affects the weight values; (c) rank the features according to their corresponding weight values; (d) remove one or more features corresponding to the smallest weight values; (e) train a new classifier with the remaining features; (f) repeat steps (c) through (e) for a plurality of iterations until a final subset having a pre-determined number of features remains; and (g) generate at a printer or display device a report comprising a listing of the features in the final subset, wherein the final subset comprises the determinative subset of features. - View Dependent Claims (17, 18, 19)
-
Specification