METHOD FOR FEATURE SELECTION AND FOR EVALUATING FEATURES IDENTIFIED AS SIGNIFICANT FOR CLASSIFYING DATA

US 20110078099A1
Filed: 09/26/2010
Published: 03/31/2011
Est. Priority Date: 05/18/2001
Status: Active Grant

First Claim

Patent Images

1. A method for estimating a subset of features falsely labeled “

significant”

within a group of features which appear able to separate a dataset comprising multiple examples into two or more classes, the method comprising;

inputting the dataset into a computer adapted for implementing a support vector machine;

separately for each feature of the group of features, assigning a value to the feature by;

processing the dataset using the support vector machine to separate the examples into classes according to known outcomes, wherein the classes comprise one class having one set of feature values and at least one other class having another set of feature values;

calculating an extremal margin value between a lowest feature value in the one class and the highest feature value in the at least one other class;

generating a list of the group of features and their calculated extremal margin values;

before or after assigning a value to the feature, determining a probability of obtaining an extremal margin value that exceeds a normal distribution of extremal margin values by;

drawing a set of examples from each class at random according to a normal distribution;

processing the randomly drawn example set using the support vector machine for each feature of the group of features to separate the randomly drawn example set into classes;

computing the extremal margin value within the randomly drawn example set;

repeating the steps of drawing, processing and computing for a large number of randomly drawn sets;

generating a table comprising estimated p-values, wherein the estimated p-value is a fraction of the large number of randomly drawn sets in which the computed extremal margin value exceeds a specified extremal margin value;

selecting a desired p-value;

determining from the table the specified extremal margin value corresponding to the desired p-value;

identifying as falsely significant features the features on the list of the group of features that have an extremal margin value of less than the specified extremal value corresponding to the desired p-value;

generating an output comprising a listing of the falsely significant features; and

transferring the output to a media.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A group of features that has been identified as “significant” in being able to separate data into classes is evaluated using a support vector machine which separates the dataset into classes one feature at a time. After separation, an extremal margin value is assigned to each feature based on the distance between the lowest feature value in the first class and the highest feature value in the second class. Separately, extremal margin values are calculated for a normal distribution within a large number of randomly drawn example sets for the two classes to determine the number of examples within the normal distribution that would have a specified extremal margin value. Using p-values calculated for the normal distribution, a desired p-value is selected. The specified extremal margin value corresponding to the selected p-value is compared to the calculated extremal margin values for the group of features. The features in the group that have a calculated extremal margin value less than the specified margin value are labeled as falsely significant.

70 Citations

View as Search Results

8 Claims

1. A method for estimating a subset of features falsely labeled “
- significant”
  
  within a group of features which appear able to separate a dataset comprising multiple examples into two or more classes, the method comprising;
  
  inputting the dataset into a computer adapted for implementing a support vector machine;
  
  separately for each feature of the group of features, assigning a value to the feature by;
  
  processing the dataset using the support vector machine to separate the examples into classes according to known outcomes, wherein the classes comprise one class having one set of feature values and at least one other class having another set of feature values;
  
  calculating an extremal margin value between a lowest feature value in the one class and the highest feature value in the at least one other class;
  
  generating a list of the group of features and their calculated extremal margin values;
  
  before or after assigning a value to the feature, determining a probability of obtaining an extremal margin value that exceeds a normal distribution of extremal margin values by;
  
  drawing a set of examples from each class at random according to a normal distribution;
  
  processing the randomly drawn example set using the support vector machine for each feature of the group of features to separate the randomly drawn example set into classes;
  
  computing the extremal margin value within the randomly drawn example set;
  
  repeating the steps of drawing, processing and computing for a large number of randomly drawn sets;
  
  generating a table comprising estimated p-values, wherein the estimated p-value is a fraction of the large number of randomly drawn sets in which the computed extremal margin value exceeds a specified extremal margin value;
  
  selecting a desired p-value;
  
  determining from the table the specified extremal margin value corresponding to the desired p-value;
  
  identifying as falsely significant features the features on the list of the group of features that have an extremal margin value of less than the specified extremal value corresponding to the desired p-value;
  
  generating an output comprising a listing of the falsely significant features; and
  
  transferring the output to a media.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, further comprising pre-processing the dataset by normalizing the data.
  - 3. The method of claim 1, wherein the dataset comprises gene expression data and each feature comprises a gene.
  - 4. The method of claim 3, wherein the classes correspond to diseases or conditions.
  - 5. The method of claim 4, further comprising generating a listing of significant features remaining after eliminating the falsely significant features.
  - 6. The method of claim 5, wherein the disease comprises renal cancer and the significant features comprise small inducible cytokine A2 (monocyte chemotactic protein 1) and ATP synthase, H+ transporting, mitochondrial F1 complex, alpha subunit, isoform 1, cardiac muscle.
  - 7. The method of claim 1, wherein the media comprises a disk drive or removable media.
  - 8. The method of claim 1, further comprising displaying the output on a display device.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Curtis Anderson, Health Discovery Corporation, James Roberts, Joe Mckenzie, Jules B. Paderewski, Julian N. Stern, Memorial Health Systems Incorporated, Timothy P. O'Hayer
Original Assignee
Health Discovery Corporation
Inventors
Guyon, Isabelle, Elisseeff, André, Weston, Jason Aaron Edward, Schöelkopf, Bernhard, Perez-Cruz, Fernando

Granted Patent

US 7,970,718 B2
Time in Patent Office

Days
Field of Search
US Class Current

706/12
CPC Class Codes

G06F 18/2115   by evaluating different sub...

G06N 20/00   Machine learning

G06N 20/10   using kernel methods, e.g. ...

G16B 25/00   ICT specially adapted for h...

G16B 25/10   Gene or protein expression ...

G16B 40/00   ICT specially adapted for b...

G16B 40/20   Supervised data analysis

G16B 40/30   Unsupervised data analysis

METHOD FOR FEATURE SELECTION AND FOR EVALUATING FEATURES IDENTIFIED AS SIGNIFICANT FOR CLASSIFYING DATA

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

70 Citations

8 Claims

Specification

Use Cases

Quick Links

Others

METHOD FOR FEATURE SELECTION AND FOR EVALUATING FEATURES IDENTIFIED AS SIGNIFICANT FOR CLASSIFYING DATA

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

70 Citations

8 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others