Pre-processed feature ranking for a support vector machine
First Claim
1. A computer-implemented method for analyzing a dataset comprising a plurality of features to separate the dataset into two or more known classes, the method comprising:
- downloading the dataset into a computer system having a memory, an output device, and a processor programmed for executing a support vector machine;
separately, for each feature of the plurality of features;
(i) training the support vector machine to separate the dataset into the two or more known classes to define two or more sets of data points, wherein each set of data points has an extremal point corresponding to a maximum separation between the two or more known classes;
(ii) determining the separation distance between the extremal points of the two or more sets of data points;
repeating steps (i) and (ii) for all features of the plurality so that the separation distance between the extremal points is determined for each feature whereby each feature is associated with a corresponding separation distance value;
ranking the features according to their corresponding separation distance values, wherein the highest ranked features have the greatest separation distance values;
selecting a subset of features having the highest rank; and
generating an output comprising a report listing the selected subset of features for display or storage on a computer-readable medium at the output device.
3 Assignments
0 Petitions
Accused Products
Abstract
A computer-implemented method is provided for ranking features within a large dataset containing a large number of features according to each feature'"'"'s ability to separate data into classes. For each feature, a support vector machine separates the dataset into two classes and determines the margins between extremal points in the two classes. The margins for all of the features are compared and the features are ranked based upon the size of the margin, with the highest ranked features corresponding to the largest margins. A subset of features for classifying the dataset is selected from a group of the highest ranked features. In one embodiment, the method is used to identify the best genes for disease prediction and diagnosis using gene expression data from micro-arrays.
65 Citations
24 Claims
-
1. A computer-implemented method for analyzing a dataset comprising a plurality of features to separate the dataset into two or more known classes, the method comprising:
-
downloading the dataset into a computer system having a memory, an output device, and a processor programmed for executing a support vector machine; separately, for each feature of the plurality of features; (i) training the support vector machine to separate the dataset into the two or more known classes to define two or more sets of data points, wherein each set of data points has an extremal point corresponding to a maximum separation between the two or more known classes; (ii) determining the separation distance between the extremal points of the two or more sets of data points; repeating steps (i) and (ii) for all features of the plurality so that the separation distance between the extremal points is determined for each feature whereby each feature is associated with a corresponding separation distance value; ranking the features according to their corresponding separation distance values, wherein the highest ranked features have the greatest separation distance values; selecting a subset of features having the highest rank; and generating an output comprising a report listing the selected subset of features for display or storage on a computer-readable medium at the output device. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A computer-implemented method for selecting a subset of features for classifying data in a dataset comprising a plurality of features into known classes, the method comprising:
-
downloading the dataset into a computer system having a memory, an output device, and a processor programmed for executing a support vector machine; separately for each feature of the plurality of features, repeating the steps of training the support vector machine to separate the dataset into two or more known classes and determine a margin between extremal points of the two or more known classes, so that the margin between the extremal points is determined for each feature whereby each feature is associated with a corresponding margin value; ranking the features according to the size of the corresponding margin value, wherein the largest margin value corresponds to the highest rank; selecting a subset of features comprising a pre-determined number of highest ranked features; and generating an output comprising a report listing the selected subset of features for display or storage on a computer-readable medium at the output device. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
-
Specification