Method for selecting node variables in a binary decision tree structure

US 20030078936A1
Filed: 12/06/2002
Published: 04/24/2003
Est. Priority Date: 04/10/2000
Status: Active Grant

First Claim

Patent Images

1. A method of selecting node variables for use in building a binary decision tree, comprising the steps of:

(a) providing an input data set including a plurality of input variables and an associated decision state;

(b) calculating a statistical measure of the significance of each of the input variables to the associated decision state;

(c) averaging the statistical measures for each of the input variables and to form an averaged statistical measure for each input variable;

(d) selecting the input variable with the largest average statistical measure; and

(e) using the selected input variable as a node variable for splitting the input data set into two subsets that are used in building the binary decision tree.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for selecting node variables in a binary decision tree structure is provided. The binary decision tree is formed by mapping node variables to known outcome variables. The method calculates a statistical measure of the significance of each input variable in an input data set and then selects an appropriate node variable on which to base the structure of the binary decision tree using an averaged statistical measure of the input variable and any co-linear input variables of the data set.

60 Citations

View as Search Results

15 Claims

1. A method of selecting node variables for use in building a binary decision tree, comprising the steps of:
- (a) providing an input data set including a plurality of input variables and an associated decision state;
  
  (b) calculating a statistical measure of the significance of each of the input variables to the associated decision state;
  
  (c) averaging the statistical measures for each of the input variables and to form an averaged statistical measure for each input variable;
  
  (d) selecting the input variable with the largest average statistical measure; and
  
  (e) using the selected input variable as a node variable for splitting the input data set into two subsets that are used in building the binary decision tree.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 15)
- - 2. The method of claim 1, further comprising the steps of:
    - (f) for each of the two subsets in step (e), repeating steps (b) through (e).
  - 3. The method of claim 2, further comprising the steps of:
    - (g) repeating step (f) a predetermined number of times that corresponds to the depth of the binary decision tree.
  - 4. The method of claim 1, wherein the statistical measure of step (b) is the chi-squared statistic.
  - 5. The method of claim 1, wherein the input data set includes genomic data.
  - 6. The method of claim 5, wherein the input variables include clinical variables and marker variables.
  - 7. The method of claim 6, wherein the decision state indicates whether a particular individual set of input variables is associated with a disease.
  - 8. The method of claim 1, wherein the averaging step further comprises the steps of:
    - providing a neighbor parameter indicating how many nearby input variables to use in calculating the average statistical measure;
      
      providing a weight parameter indicating weight to apply to each of the nearby input variables used in calculating the average statistical measure; and
      
      calculating the average statistical measure for each input variable according to the following equation;
      
      $AVGCHI (j) = \frac{\begin{matrix} \overset{NEIGHBORNUM}{\sum_{k = - NEIGHBORNUM}} WEIGHTS [k + \\ NEIGHBORNUM] * MAX [j + k] \end{matrix}}{\overset{2^{*} NEIGHBORNUM}{\sum_{k = 0}} WEIGHTS [k]},$
      
      wherein NEIGHBORNUM is the neighbor parameter, WEIGHTS is an array of weight parameters having length NEIGHBORNUM, and MAX is the statistical measure.
  - 9. The method of claim 1, further comprising the step of:
    - storing the statistical measures of the significance of each of the input variables for use in the averaging step.
  - 15. The method of claim 1, wherein the calculating step further comprises the steps of:
    - providing a neighbor parameter indicating how many nearby genomic markers to use in calculating the second statistical measure;
      
      providing a weight parameter indicating a weight to apply to each of the nearby genomic markers used in calculating the second statistical measure; and
      
      calculating the second statistical measure for each input variable according to the following equation;
      
      $AVGCHI (j) = \frac{\begin{matrix} \overset{NEIGHBORNUM}{\sum_{k = - NEIGHBORNUM}} WEIGHTS [k + \\ NEIGHBORNUM] * MAX [j + k] \end{matrix}}{\overset{2^{*} NEIGHBORNUM}{\sum_{k = 0}} WEIGHTS [k]},$
      
      wherein NEIGHBORNUM is the neighbor parameter, WEIGHTS is an array of weight parameters having length NEIGHBORNUM, and MAX is the statistical measure.

10. A method for mapping genomic markers to a phenotypical trait, comprising the steps of:
- (a) receiving a structured data set having a plurality of genomic markers;
  
  (b) determining a first correlating statistic for each genomic marker where the magnitude of the correlating statistic is proportional to the capability of the genomic marker to map the phenotype;
  
  (c) calculating a second correlating statistic for each genomic marker using values of the genomic marker and adjacent genomic markers; and
  
  (d) selecting the largest second correlating statistic from the genomic markers;
  
  the genomic marker having the largest second correlating statistic being used as a decision node of a binary decision tree thereby splitting the data set into two sub sets.
- View Dependent Claims (11, 12, 13, 14)
- - 11. The method of claim 10, further comprising the steps of:
    - (e) for each of the two subsets in step (d), repeating steps (b) and (c).
  - 12. The method of claim 11, further comprising the steps of:
    - (f) repeating step (e) a predetermined number of times that corresponds to the depth of the binary decision tree.
  - 13. The method of claim 10, wherein the first statistical measure is the chi-squared statistic.
  - 14. The method of claim 10, wherein the phenotypical trait is whether a particular individual set of input genomic markers are associated with a disease.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Bruce S. Weir, John C. Brocklebank, Wendy Czika
Original Assignee
Bruce S. Weir, John C. Brocklebank, Wendy Czika
Inventors
Brocklebank, John C., Weir, Bruce S., Czika, Wendy

Granted Patent

US 7,809,539 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/101
CPC Class Codes

G06F 18/24323   Tree-organised classifiers

G06F 18/40   Software arrangements speci...

G16B 40/00   ICT specially adapted for b...

G16H 50/70   for mining of medical data,...

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99942   Manipulating data structure...

Y10S 707/99943   Generating database or data...

Method for selecting node variables in a binary decision tree structure

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

60 Citations

15 Claims

Specification

Solutions

Use Cases

Quick Links

Method for selecting node variables in a binary decision tree structure

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

60 Citations

15 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links