Pre-processed feature ranking for a support vector machine

US 7,475,048 B2
Filed: 11/07/2002
Issued: 01/06/2009
Est. Priority Date: 05/01/1998
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-implemented method for analyzing a dataset comprising a plurality of features to separate the dataset into two or more known classes, the method comprising:

downloading the dataset into a computer system having a memory, an output device, and a processor programmed for executing a support vector machine;

separately, for each feature of the plurality of features;

(i) training the support vector machine to separate the dataset into the two or more known classes to define two or more sets of data points, wherein each set of data points has an extremal point corresponding to a maximum separation between the two or more known classes;

(ii) determining the separation distance between the extremal points of the two or more sets of data points;

repeating steps (i) and (ii) for all features of the plurality so that the separation distance between the extremal points is determined for each feature whereby each feature is associated with a corresponding separation distance value;

ranking the features according to their corresponding separation distance values, wherein the highest ranked features have the greatest separation distance values;

selecting a subset of features having the highest rank; and

generating an output comprising a report listing the selected subset of features for display or storage on a computer-readable medium at the output device.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer-implemented method is provided for ranking features within a large dataset containing a large number of features according to each feature'"'"'s ability to separate data into classes. For each feature, a support vector machine separates the dataset into two classes and determines the margins between extremal points in the two classes. The margins for all of the features are compared and the features are ranked based upon the size of the margin, with the highest ranked features corresponding to the largest margins. A subset of features for classifying the dataset is selected from a group of the highest ranked features. In one embodiment, the method is used to identify the best genes for disease prediction and diagnosis using gene expression data from micro-arrays.

65 Citations

View as Search Results

24 Claims

1. A computer-implemented method for analyzing a dataset comprising a plurality of features to separate the dataset into two or more known classes, the method comprising:
- downloading the dataset into a computer system having a memory, an output device, and a processor programmed for executing a support vector machine;
  
  separately, for each feature of the plurality of features;
  
  (i) training the support vector machine to separate the dataset into the two or more known classes to define two or more sets of data points, wherein each set of data points has an extremal point corresponding to a maximum separation between the two or more known classes;
  
  (ii) determining the separation distance between the extremal points of the two or more sets of data points;
  
  repeating steps (i) and (ii) for all features of the plurality so that the separation distance between the extremal points is determined for each feature whereby each feature is associated with a corresponding separation distance value;
  
  ranking the features according to their corresponding separation distance values, wherein the highest ranked features have the greatest separation distance values;
  
  selecting a subset of features having the highest rank; and
  
  generating an output comprising a report listing the selected subset of features for display or storage on a computer-readable medium at the output device.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1, further comprising pre-processing the dataset by normalizing the data.
  - 3. The method of claim 1, wherein the dataset comprises gene expression data obtained from DNA micro-arrays, and each feature comprises a gene within the micro-arrays.
  - 4. The method of claim 3, wherein the gene expression data is obtained from tissue from patients with renal cancer and the two or more classes comprise different types or stages of cancer.
  - 5. The method of claim 3, wherein the gene expression data is obtained from tissue from patients with renal cancer and the two or more classes comprise cancerous tissue and normal tissue.
  - 6. The method of claim 4, wherein the selected subset of features is genes comprising small inducible cytokine A2 (monocyte chemotactic protein 1) and ATP synthase, H+transporting, mitochondrial F1 complex, alpha subunit, isoform 1, cardiac muscle.
  - 7. The method of claim 4, wherein the selected subset of features is genes comprising acetyl-Coenzyme A acetyltransferase 1 (acetoacetyl Coenzyme A thiolase), glutamate decarboxylase 1 (brain, 67kD), and JTV1 gene.
  - 8. The method of claim 4, wherein the selected subset of features is genes comprising tissue inhibitor of metalloproteinase 3 (Sorsby fundus dystrophy, pseudoinflammatory), major histocompatibility complex, class II, DO beta, and guanylate binding protein 1, interferon-inducible, 67kD.
  - 9. The method of claim 1, wherein the step of selecting a subset of features comprises computing p-values for each feature and applying a threshold criterion based on the p-value.
  - 10. The method of claim 9, wherein the threshold criterion is 0.001.
  - 11. The method of claim 9, wherein the threshold criterion 0.0001.
  - 12. The method of claim 1, wherein the step of selecting a subset of features comprises applying a threshold comprising an estimated upper bound determined by randomly selecting a sample set of data points, assuming the sample set of data points has a normal distribution and determining a number of data points falsely called significant as being determinative for separating the dataset into the two or more known classes as a function of a number of data points called significant.

13. A computer-implemented method for selecting a subset of features for classifying data in a dataset comprising a plurality of features into known classes, the method comprising:
- downloading the dataset into a computer system having a memory, an output device, and a processor programmed for executing a support vector machine;
  
  separately for each feature of the plurality of features, repeating the steps of training the support vector machine to separate the dataset into two or more known classes and determine a margin between extremal points of the two or more known classes, so that the margin between the extremal points is determined for each feature whereby each feature is associated with a corresponding margin value;
  
  ranking the features according to the size of the corresponding margin value, wherein the largest margin value corresponds to the highest rank;
  
  selecting a subset of features comprising a pre-determined number of highest ranked features; and
  
  generating an output comprising a report listing the selected subset of features for display or storage on a computer-readable medium at the output device.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
- - 14. The method of claim 13, further comprising, prior to separating the dataset, pre-processing the dataset by normalizing the data.
  - 15. The method of claim 13, wherein the dataset comprises gene expression data obtained from DNA micro-arrays, and each feature comprises a gene within the micro-arrays.
  - 16. The method of claim 15, wherein the gene expression data is obtained from tissue from patients with renal cancer and the two or more classes comprise different types or stages of cancer.
  - 17. The method of claim 16, wherein the gene expression data is obtained from tissue from patients with renal cancer and the two or more classes comprise cancerous tissue and normal tissue.
  - 18. The method of claim 16, wherein the selected subset of features is genes comprising small inducible cytokine A2 (monocyte chemotactic protein 1) and ATP synthase, H+ transporting, mitochondrial F1 complex, alpha subunit, isoform 1, cardiac muscle.
  - 19. The method of claim 16, wherein the selected subset of features is genes comprising acetyl-Coenzyme A acetyltransferase 1 (acetoacetyl Coenzyme A thiolase), glutamate decarboxylase 1 (brain, 67kD), and JTV1 gene.
  - 20. The method of claim 16, wherein the selected subset of features is genes comprising tissue inhibitor of metalloproteinase 3 (Sorsby fundus dystrophy, pseudoinflammatory), major histocompatibility complex, class II, DO beta, and guanylate binding protein 1, interferon-inducible, 67kD.
  - 21. The method of claim 13, wherein the step of selecting a subset of features comprises computing p-values for each feature and applying the threshold criterion based on the p-value.
  - 22. The method of claim 21, wherein the threshold criterion is 0.001.
  - 23. The method of claim 21, wherein the threshold criterion is 0.0001.
  - 24. The method of claim 13, wherein the step of selecting a subset of features comprises applying a threshold comprising an estimated upper bound determined by randomly selecting a sample set of data points, assuming the sample set of data points has a normal distribution and determining a number of data points falsely called significant as being determinative for separating the dataset into the two or more known classes as a function of a number of data points called significant.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Curtis Anderson, Health Discovery Corporation, James Roberts, Joe Mckenzie, Jules B. Paderewski, Julian N. Stern, Memorial Health Systems Incorporated, Timothy P. O'Hayer
Original Assignee
Health Discovery Corporation
Inventors
Schölkopf, Bernhard, Guyon, Isabelle, Elisseeff, André, Perez-Cruz, Fernando, Weston, Jason
Primary Examiner(s)
Vincent; David
Assistant Examiner(s)
Fernandez Rivas; Omar F

Application Number

US10/494,876
Publication Number

US 20050131847A1
Time in Patent Office

2,252 Days
Field of Search

706/1, 706/7, 706/10, 706 12- 22, 706/45, 706/48, 702/1, 702/19, 702/22, 702/32, 382103-109, 382/113, 382/115, 382128-134, 382/155, 382/165, 382/168, 382/181, 382191-132, 382/209, 382218-221, 382/278, 382/291
US Class Current

706/20
CPC Class Codes

G06F 18/2113   by ranking or filtering the...

G06F 18/2411   based on the proximity to a...

G06N 20/00   Machine learning

G06N 20/10   using kernel methods, e.g. ...

Pre-processed feature ranking for a support vector machine

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

65 Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

Pre-processed feature ranking for a support vector machine

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

65 Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links