Method for selecting node variables in a binary decision tree structure

US 7,809,539 B2
Filed: 12/06/2002
Issued: 10/05/2010
Est. Priority Date: 04/10/2000
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-implemented method for mapping one or more genomic markers to a phenotypical trait, the method comprising:

receiving, using one or more processors, a structured data set having a plurality of genomic markers;

determining, using one or more processors, a first correlating statistic for each genomic marker where the magnitude of the correlating statistic is proportional to the capability of the genomic marker to map a phenotype; and

calculating, using one or more processors, a second correlating statistic for each genomic marker from a smoothing mathematical function of the determined first correlating statistic of the genomic marker and the first correlating statistic of adjacent genomic markers, wherein calculating includes;

providing a neighbor parameter indicating how many adjacent genomic markers to use in calculating the second correlating statistic, andproviding a weight parameter indicating a weight to apply to each of the adjacent genomic markers used in calculating the second correlating statistic, and calculating the second correlating statistic for each genomic marker according to the following equation;

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for selecting node variables in a binary decision tree structure is provided. The binary decision tree is formed by mapping node variables to known outcome variables. The method calculates a statistical measure of the significance of each input variable in an input data set and then selects an appropriate node variable on which to base the structure of the binary decision tree using an averaged statistical measure of the input variable and any co-linear input variables of the data set.

59 Citations

View as Search Results

20 Claims

1. A computer-implemented method for mapping one or more genomic markers to a phenotypical trait, the method comprising:
- receiving, using one or more processors, a structured data set having a plurality of genomic markers;
  
  determining, using one or more processors, a first correlating statistic for each genomic marker where the magnitude of the correlating statistic is proportional to the capability of the genomic marker to map a phenotype; and
  
  calculating, using one or more processors, a second correlating statistic for each genomic marker from a smoothing mathematical function of the determined first correlating statistic of the genomic marker and the first correlating statistic of adjacent genomic markers, wherein calculating includes;
  
  providing a neighbor parameter indicating how many adjacent genomic markers to use in calculating the second correlating statistic, andproviding a weight parameter indicating a weight to apply to each of the adjacent genomic markers used in calculating the second correlating statistic, and calculating the second correlating statistic for each genomic marker according to the following equation;
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 17, 18)
- - 2. The method of claim 1, further comprising:
    - repeatedly splitting the structured data set a predetermined number of times.
  - 3. The method of claim 1, wherein the first correlating statistic is a chi-squared statistic.
  - 4. The method of claim 1, wherein the phenotypical trait is whether a particular individual set of input genomic markers are associated with a disease.
  - 5. The method of claim 1, wherein a genomic marker in the structured data set has one or more neighbor genomic markers;
    - and wherein calculating the second correlating statistic measure includes averaging the first correlating statistics for a genomic marker and one or more neighbor genomic markers.
  - 6. The method of claim 5, wherein the plurality of genomic markers in the structured data set have an order;
    - wherein the order is used to determine which genomic markers are neighbors of a particular genomic marker; and
      
      wherein a change in the order of the genomic markers affects the averaging of the first correlating statistics.
  - 7. The method of claim 1, wherein the plurality of genomic markers in the structured data set are co-linear.
  - 8. The method of claim 5, wherein the averaging of the first correlating statistics results in a smoothing of the structured data set.
  - 9. The method of claim 8, wherein the smoothing of the structured data set includes eliminating one or more false identifiers from the structured data set.
  - 10. The method of claim 1, wherein receiving the structured data set, determining the first correlating statistic, and calculating the second correlating statistic are performed using a data mining software application.
  - 11. The method of claim 1, wherein the plurality of genomic markers are from DNA of a patient.
  - 12. The method of claim 11, further comprising:
    - categorizing whether the patient is more likely than not to have a disease.
  - 17. The method of claim 1, further comprising:
    - selecting a genomic marker that satisfies a predetermined criterion; and
      
      using the selected genomic marker to create a decision node by splitting the structured data set into two subsets, wherein the decision node is on a binary decision tree that maps genomic markers to phenotypical trait.
  - 18. The method of claim 17, wherein the predetermined criterion includes selecting the genomic marker with the largest second correlating statistic.

13. A computer-implemented system for mapping genomic markers to a phenotypical trait, comprising:
- one or more processors;
  
  one or more computer-readable storage mediums containing software instructions executable on the one or more processors to cause the one or more processors to perform operations including;
  
  receiving a structured data set having a plurality of genomic markers;
  
  determining a first correlating statistic for each genomic marker where the magnitude of the correlating statistic is proportional to the capability of the genomic marker to map a phenotype; and
  
  calculating a second correlating statistic for each genomic marker from a smoothing mathematical function of the determined first correlating statistic of the genomic marker and the first correlating statistic of adjacent genomic markers, wherein calculating includes;
  
  providing a neighbor parameter indicating how many adjacent genomic markers to use in calculating the second correlating statistic, andproviding a weight parameter indicating a weight to apply to each of the adjacent genomic markers used in calculating the second correlating statistic, and calculating the second correlating statistic for each genomic marker according to the following equation;
- View Dependent Claims (19, 20)
- - 19. The system of claim 13, further comprising software instructions executable on the one or more processors to cause the one or more processors to perform operations including:
    - selecting a genomic marker that satisfies a predetermined criterion; and
      
      using the selected genomic marker to create a decision node by splitting the structured data set into two subsets, wherein the decision node is on a binary decision tree that maps genomic markers to phenotypical trait.
  - 20. The system of claim 19, wherein the predetermined criterion includes selecting the genomic marker with the largest second correlating statistic.

14. A computer-readable storage medium encoded with instructions that when executed on one or more processors within a computer system, perform a method for mapping one or more genomic markers to a phenotypical trait, the method comprising:
- receiving a structured data set having a plurality of genomic markers;
  
  determining a first correlating statistic for each genomic marker where the magnitude of the correlating statistic is proportional to the capability of the genomic marker to map a phenotype;
  
  calculating a second correlating statistic for each genomic marker from a smoothing mathematical function of the determined first correlating statistic of the genomic marker and the first correlating statistic of adjacent genomic markers;
  
  wherein calculating includes;
  
  providing a neighbor parameter indicating how many adjacent genomic markers to use in calculating the second correlating statistic,providing a weight parameter indicating a weight to apply to each of the adjacent genomic markers used in calculating the second correlating statistic, and calculating the second correlating statistic for each genomic marker according to the following equation;
- View Dependent Claims (15, 16)
- - 15. The method of claim 14, further comprising:
    - selecting a genomic marker that satisfies a predetermined criterion; and
      
      using the selected genomic marker to create a decision node by splitting the structured data set into two subsets, wherein the decision node is on a binary decision tree that maps genomic markers to phenotypical trait.
  - 16. The method of claim 15, wherein the predetermined criterion includes selecting the genomic marker with the largest second correlating statistic.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
SAS Institute Incorporated
Original Assignee
SAS Institute Incorporated
Inventors
Weir, Bruce S., Czika, Wendy, Brocklebank, John C.
Primary Examiner(s)
Dejong; Eric S

Application Number

US10/313,569
Publication Number

US 20030078936A1
Time in Patent Office

2,860 Days
Field of Search

702/19, 702/20, 702/27, 703/11, 435/4, 435/6
US Class Current

703/11
CPC Class Codes

G06F 18/24323   Tree-organised classifiers

G06F 18/40   Software arrangements speci...

G16B 40/00   ICT specially adapted for b...

G16H 50/70   for mining of medical data,...

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99942   Manipulating data structure...

Y10S 707/99943   Generating database or data...

Method for selecting node variables in a binary decision tree structure

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

59 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Method for selecting node variables in a binary decision tree structure

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

59 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links