Method for feature selection in a support vector machine using feature ranking

US 7,805,388 B2
Filed: 10/30/2007
Issued: 09/28/2010
Est. Priority Date: 05/01/1998
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-implemented method for predicting patterns in a dataset, wherein the data comprises a large set of features that describe the data, wherein each feature has a feature value corresponding to each datapoint within the dataset, the method comprising:

identifying a subset of significant features that are most correlated to the patterns, comprising;

downloading a dataset having known outcomes into a memory of a computer having a processor for executing a classification algorithm;

for each feature, separating the data into classes according to their known outcomes, wherein the classes comprise a first class having a first set of feature values and a second class having second set of feature values;

for each feature, calculating an extremal difference in feature value between a lowest feature value in the first class and a highest feature value in the second class;

ranking the features according to the extremal differences in feature value between the classes, wherein the highest extremal differences in feature value have the highest ranks;

generating an output in the memory comprising the subset of features having the highest ranks, wherein the subset of features is correlated to the patterns; and

transferring the output from the memory to a media.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In a pre-processing step prior to training a learning machine, pre-processing includes reducing the quantity of features to be processed using feature selection methods selected from the group consisting of recursive feature elimination (RFE), minimizing the number of non-zero parameters of the system (l₀-norm minimization), evaluation of cost function to identify a subset of features that are compatible with constraints imposed by the learning set, unbalanced correlation score, transductive feature selection and single feature using margin-based ranking. The features remaining after feature selection are then used to train a learning machine for purposes of pattern classification, regression, clustering and/or novelty detection.

77 Citations

View as Search Results

27 Claims

1. A computer-implemented method for predicting patterns in a dataset, wherein the data comprises a large set of features that describe the data, wherein each feature has a feature value corresponding to each datapoint within the dataset, the method comprising:
- identifying a subset of significant features that are most correlated to the patterns, comprising;
  
  downloading a dataset having known outcomes into a memory of a computer having a processor for executing a classification algorithm;
  
  for each feature, separating the data into classes according to their known outcomes, wherein the classes comprise a first class having a first set of feature values and a second class having second set of feature values;
  
  for each feature, calculating an extremal difference in feature value between a lowest feature value in the first class and a highest feature value in the second class;
  
  ranking the features according to the extremal differences in feature value between the classes, wherein the highest extremal differences in feature value have the highest ranks;
  
  generating an output in the memory comprising the subset of features having the highest ranks, wherein the subset of features is correlated to the patterns; and
  
  transferring the output from the memory to a media.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1, wherein the classification algorithm is a support vector machine.
  - 3. The method of claim 1, further comprising pre-processing the dataset by normalizing the data.
  - 4. The method of claim 1, wherein the dataset comprises gene expression data obtained from DNA micro-arrays, and each feature comprises a gene within the micro-arrays.
  - 5. The method of claim 1, further comprising:
    - downloading an unknown dataset having unknown outcomes into the memory, wherein the unknown dataset is of the same data type as the known dataset;
      
      separating the unknown dataset into one or more classes according to the feature values of the subset of significant features within the unknown dataset; and
      
      generating an output decision comprising an identification of the one or more classes.
  - 6. The method of claim 5, further comprising displaying the output decision on a display device.
  - 7. The method of claim 1, wherein the media comprises a disk drive or removable media.
  - 8. The method of claim 1, further comprising displaying the output on a display device.
  - 9. The method of claim 1, wherein the step of generating an output includes computing p-values for each feature and applying a threshold criterion based on the p-value.
  - 10. The method of claim 9, wherein the threshold criterion is 0.001.
  - 11. The method of claim 9, wherein the threshold criterion is 0.0001.
  - 12. The method of claim 1, wherein the step of generating an output comprises applying a threshold comprising an estimated upper bound determined by randomly selecting a sample set of data points, assuming the sample set of data points has a normal distribution and determining a number of data points falsely called significant as being determinative for separating the dataset into the two or more known classes as a function of a number of data points called significant.

13. A computer-implemented method for predicting patterns in a dataset, wherein the data comprises a large set of features that describe the data, wherein each feature has a feature value corresponding to each datapoint within the dataset, the method comprising:
- identifying a subset of significant features that are most correlated to the patterns, comprising;
  
  downloading a dataset having known outcomes into a memory of a computer having a processor for executing a classification algorithm;
  
  using the classification algorithm, separating the dataset into two or more classes according to the known outcomes;
  
  for each feature, determining a separation between extremal feature value points within the two or more classes; and
  
  ranking the subset of features according to the size of the separation for each feature, wherein the feature with the largest separation is assigned the highest rank;
  
  generating an output in the memory comprising the subset of features having the highest ranks, wherein the subset of features is correlated to the patterns; and
  
  transferring the output from the memory to a media.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
- - 14. The method of claim 13, wherein the classification algorithm is a support vector machine.
  - 15. The method of claim 13, further comprising pre-processing the dataset by normalizing the data.
  - 16. The method of claim 13, wherein the dataset comprises gene expression data obtained from DNA micro-arrays, and each feature comprises a gene within the micro-arrays.
  - 17. The method of claim 13, further comprising:
    - downloading an unknown dataset having unknown outcomes into the memory, wherein the unknown dataset is of the same data type as the known dataset;
      
      separating the unknown dataset into one or more classes according to the feature values of the subset of significant features within the unknown dataset; and
      
      generating an output decision comprising an identification of the one or more classes.
  - 18. The method of claim 17, further comprising displaying the output decision on a display device.
  - 19. The method of claim 13, wherein the media comprises a disk drive or removable media.
  - 20. The method of claim 13, further comprising displaying the output on a display device.
  - 21. The method of claim 13, wherein the step of selecting a subset of features comprises computing p-values for each feature and applying the threshold criterion based on the p-value.
  - 22. The method of claim 21, wherein the threshold criterion is 0.001.
  - 23. The method of claim 21, wherein the threshold criterion is 0.0001.
  - 24. The method of claim 13, wherein the step of selecting a subset of features comprises applying a threshold comprising an estimated upper bound determined by randomly selecting a sample set of data points, assuming the sample set of data points has a normal distribution and determining a number of data points falsely called significant as being determinative for separating the dataset into the two or more known classes as a function of a number of data points called significant.

25. A computer program product embodied on a computer readable medium for predicting patterns in data by identifying a subset of significant features that are most correlated to the patterns, wherein the data comprises a large set of features that describe the data, the computer program product comprising instructions for executing a classification algorithm and further for causing a computer processor to:
- (a) receive the data;
  
  (b) using the classification algorithm, separating the dataset into two or more classes according to the known outcomes;
  
  (c) for each feature, determining a separation between extremal feature value points within the two or more classes of interest; and
  
  (d) ranking the subset of features according to the size of the separation for each feature, wherein the feature with the largest separation corresponds to is assigned the highest rank; and
  
  (e) generating an output in the memory comprising the subset of features having the highest ranks, wherein the subset of features is correlated to the patterns.
- View Dependent Claims (26, 27)
- - 26. The computer program product of claim 25, further comprising:
    - (f) receiving an unknown dataset having unknown outcomes, wherein the unknown dataset is of the same data type as the known dataset;
      
      (g) separating the unknown dataset into one or more of the two or more classes according to the feature values of the subset of significant features within the unknown dataset; and
      
      (h) generating an output decision comprising an identification of the one or more classes.
  - 27. The computer program product of claim 25, wherein the step of generating an output comprises applying a threshold comprising an estimated upper bound determined by randomly selecting a sample set of data points, assuming the sample set of data points has a normal distribution and determining a number of data points falsely called significant as being determinative for separating the dataset into the two or more classes as a function of a number of data points called significant.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Curtis Anderson, Health Discovery Corporation, James Roberts, Joe Mckenzie, Jules B. Paderewski, Julian N. Stern, Memorial Health Systems Incorporated, Timothy P. O'Hayer
Original Assignee
Health Discovery Corporation
Inventors
Scholkopf, Bernhard, Guyon, Isabelle, Perez-Cruz, Fernando, Elisseeff, Andre, Weston, Jason
Primary Examiner(s)
Sparks; Donald
Assistant Examiner(s)
Fernandez Rivas; Omar F

Application Number

US11/928,784
Publication Number

US 20080233576A1
Time in Patent Office

1,064 Days
Field of Search

706/12, 706/15, 706/16, 706/20, 706/45, 706/62, 702 19- 22
US Class Current

706/20
CPC Class Codes

C12Q 1/6883   for diseases caused by alte...

C12Q 2600/112   Disease subtyping, staging ...

C12Q 2600/158   Expression markers

G06F 18/2115   by evaluating different sub...

G06F 18/2411   based on the proximity to a...

G06N 20/10   using kernel methods, e.g. ...

G16B 25/00   ICT specially adapted for h...

G16B 25/10   Gene or protein expression ...

G16B 40/00   ICT specially adapted for b...

G16B 40/20   Supervised data analysis

G16B 40/30   Unsupervised data analysis

Method for feature selection in a support vector machine using feature ranking

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

77 Citations

27 Claims

Specification

Solutions

Use Cases

Quick Links

Method for feature selection in a support vector machine using feature ranking

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

77 Citations

27 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links