Feature selection method using support vector machine classifier

DC CAFC

US 7,542,959 B2
Filed: 08/21/2007
Issued: 06/02/2009
Est. Priority Date: 05/01/1998
Status: Expired due to Fees

×

Create Patent Alert

District Court Events

PTAB Events

ITC Events

Federal Circuit Events

RPX Reports

Alert Frequency

Daily (M-F)

Weekly

^*Certain alert events are not available for your current subscription level. Upgrade

Cancel
- Alert
- Pin

	Alert Frequency
	Daily (M-F)
	Weekly

First Claim

Patent Images

1. A computer-implemented method for predicting patterns in biological data, wherein the data comprises a large set of features that describe the data and a sample set from which the biological data is obtained is much smaller than the large set of features, the method comprising:

identifying a determinative subset of features that are most correlated to the patterns comprising;

(a) inputting the data into a computer processor programmed for executing support vector machine classifiers;

(b) training a support vector machine classifier with a training data set comprising at least a portion of the sample set and having known outcomes with respect to the patterns, wherein the classifier comprises weights having weight values that correspond to the features in the data set and removal of a subset of features affects the weight values;

(c) ranking the features according to their corresponding weight values;

(d) removing one or more features corresponding to the smallest weight values;

(e) training a new classifier with the remaining features;

(f) repeating steps (c) through (e) for a plurality of iterations until a final subset having a pre-determined number of features remains; and

generating at a printer or display device a report comprising a listing of the features in the final subset, wherein the final subset comprises the determinative subset of features for determining biological characteristics of the sample set.

View all claims

4 Assignments

Timeline View

Assignment View

Litigations

2 Petitions

Accused Products

Abstract

Identification of a determinative subset of features from within a large set of features is performed by training a support vector machine to rank the features according to classifier weights, where features are removed to determine how their removal affects the value of the classifier weights. The features having the smallest weight values are removed and a new support vector machine is trained with the remaining weights. The process is repeated until a relatively small subset of features remain that is capable of accurately separating the data into different patterns or classes. The method is applied for selecting the smallest number of genes that are capable of accurately distinguishing between medical conditions such as cancer and non-cancer.

Citations

19 Claims

1. A computer-implemented method for predicting patterns in biological data, wherein the data comprises a large set of features that describe the data and a sample set from which the biological data is obtained is much smaller than the large set of features, the method comprising:
- identifying a determinative subset of features that are most correlated to the patterns comprising;
  
  (a) inputting the data into a computer processor programmed for executing support vector machine classifiers;
  
  (b) training a support vector machine classifier with a training data set comprising at least a portion of the sample set and having known outcomes with respect to the patterns, wherein the classifier comprises weights having weight values that correspond to the features in the data set and removal of a subset of features affects the weight values;
  
  (c) ranking the features according to their corresponding weight values;
  
  (d) removing one or more features corresponding to the smallest weight values;
  
  (e) training a new classifier with the remaining features;
  
  (f) repeating steps (c) through (e) for a plurality of iterations until a final subset having a pre-determined number of features remains; and
  
  generating at a printer or display device a report comprising a listing of the features in the final subset, wherein the final subset comprises the determinative subset of features for determining biological characteristics of the sample set.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, wherein step (d) comprises eliminating multiple features corresponding to the smallest ranking criteria so that the number of features is reduced by the closest power of two to the number of remaining features.
  - 3. The method of claim 1, wherein the one or more features removed in step (d) comprises up to half of the remaining features.
  - 4. The method of claim 1, wherein step (d) comprises eliminating a plurality of features corresponding to the smallest ranking criteria so that the number of features in the first iteration is reduced by up to half of the remaining features until a specified number of features remain and thereafter removing one feature per iteration.
  - 5. The method of claim 1, wherein the patterns comprise disease and normal.
  - 6. The method of claim 1, wherein the patterns comprise different diseases or conditions.
  - 7. The method of claim 1, wherein the sample set is divided into a first portion and a second, smaller portion, the method further comprising using the second, smaller portion of the sample set as a test data set for determining classifier quality.
  - 8. The method of claim 5, wherein the biological data is gene expression data and the features comprise genes.
  - 9. The method of claim 5, wherein the features comprise proteins.
  - 10. The method of claim 6, wherein the biological data is gene expression data and the features comprise genes.
  - 11. The method of claim 6, wherein the features comprise proteins.

12. A computer program product embodied on a computer readable medium for predicting patterns in data without overfitting by identifying a determinative subset of features that are most correlated to the patterns, wherein the data comprises a large set of features that describe the data, the computer program product comprising instructions for executing support vector machine classifiers and further for causing a computer processor to:
- (a) receive the data;
  
  (b) train a support vector machine classifier with a training data set having known outcomes with respect to the patterns, wherein the training data set has a number of training patterns that is much smaller than the number of features in the large set of features, and wherein the classifier comprises weights having weight values that correspond to the features in the data set and removal of a subset of features affects the weight values;
  
  (c) rank the features according to their corresponding weight values;
  
  (d) remove one or more features corresponding to the smallest weight values;
  
  (e) train a new classifier with the remaining features;
  
  (f) repeat steps (c) through (e) for a plurality of iterations until a final subset having a pre-determined number of features remains; and
  
  (g) generate at a printer or display device a report comprising a listing of the features in the final subset, wherein the final subset comprises the determinative subset of features.
- View Dependent Claims (13, 14, 15)
- - 13. The computer program product of claim 12, wherein step (d) comprises eliminating multiple features corresponding to the smallest ranking criteria so that the number of features is reduced by the closest power of two to the number of remaining features.
  - 14. The computer program product of claim 12, wherein the one or more features removed in step (d) comprises up to half of the remaining features.
  - 15. The computer program product of claim 12, wherein step (d) comprises eliminating a plurality of features corresponding to the smallest ranking criteria so that the number of features in the first iteration is reduced by up to half of the remaining features until a specified number of features remain and thereafter removing one feature per iteration.

16. An apparatus comprising:
- a computer processor;
  
  a memory;
  
  a computer readable medium storing a computer program product for predicting patterns in data without overfitting by identifying a determinative subset of features that are most correlated to the patterns, wherein the data comprises a large set of features that describe the data, the computer program product comprising instructions for executing support vector machine classifiers and further for causing a computer processor to;
  
  (a) receive the data;
  
  (b) train a support vector machine classifier with a training data set having known outcomes with respect to the patterns, wherein the training data set has a number of training patterns that is much smaller than the number of features in the large set of features, and wherein the classifier comprises weights having weight values that correspond to the features in the data set and removal of a subset of features affects the weight values;
  
  (c) rank the features according to their corresponding weight values;
  
  (d) remove one or more features corresponding to the smallest weight values;
  
  (e) train a new classifier with the remaining features;
  
  (f) repeat steps (c) through (e) for a plurality of iterations until a final subset having a pre-determined number of features remains; and
  
  (g) generate at a printer or display device a report comprising a listing of the features in the final subset, wherein the final subset comprises the determinative subset of features.
- View Dependent Claims (17, 18, 19)
- - 17. The apparatus of claim 16, wherein step (d) comprises eliminating multiple features corresponding to the smallest ranking criteria so that the number of features is reduced by the closest power of two to the number of remaining features.
  - 18. The apparatus of claim 16, wherein the one or more features removed in step (d) comprises up to half of the remaining features.
  - 19. The apparatus of claim 16, wherein step (d) comprises eliminating a plurality of features corresponding to the smallest ranking criteria so that the number of features in the first iteration is reduced by up to half of the remaining features until a specified number of features remain and thereafter removing one feature per iteration.

Specification

Resources

Litigation Campaign Assessment

Litigation Data

Current Assignee
Timothy P. O'Hayer, Memorial Health Systems Incorporated, James Roberts, Curtis Anderson, Health Discovery Corporation, Julian N. Stern, Jules B. Paderewski, Joe Mckenzie
Original Assignee
Health Discovery Corporation
Inventors
Guyon, Isabelle, Weston, Jason, Barnhill, Stephen
Primary Examiner(s)
Vincent; David R
Assistant Examiner(s)
Brown, Jr.; Nathan H

Application Number

US11/842,934
Publication Number

US 20080033899A1
Time in Patent Office

651 Days
Field of Search

706/48, 706/20, 706/25, 382/159, 382/224, 382/225
US Class Current

706/48
CPC Class Codes

G06F 18/211   Selection of the most signi...

G06F 18/2113   by ranking or filtering the...

G06F 18/2115   by evaluating different sub...

G06F 18/214   Generating training pattern...

G06F 18/2411   based on the proximity to a...

G06N 20/00   Machine learning

G06N 20/10   using kernel methods, e.g. ...

G06Q 10/0637   Strategic management or ana...

G06Q 10/10   Office automation; Time man...

G06Q 20/10   specially adapted for elect...

G06Q 40/06   Asset management; Financial...

G06T 7/0012   Biomedical image inspection

G06V 10/764   using classification, e.g. ...

G06V 10/771   Feature selection, e.g. sel...

G06V 10/774   Generating sets of training...

G16B 25/00   ICT specially adapted for h...

G16B 40/00   ICT specially adapted for b...

G16B 40/20   Supervised data analysis

G16B 40/30   Unsupervised data analysis

G16H 10/40   for data related to laborat...

Y02A 90/10 : Information and communicati...

View All

Feature selection method using support vector machine classifier

First Claim

4 Assignments

Litigations

2 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Feature selection method using support vector machine classifier

First Claim

4 Assignments

Subscription Required

Subscription Required

Litigations

2 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links