Categorization and filtering of scientific data

US 9,141,913 B2
Filed: 03/04/2009
Issued: 09/22/2015
Est. Priority Date: 12/16/2005
Status: Active Grant

First Claim

Patent Images

1. A computer implemented method for evaluating a correlation between (i) a gene, a SNP, a SNP pattern, a portion of gene, a region of a genome, or a compound, and (ii) a disease or a genotype, the method comprising:

providing a taxonomy of categories of diseases and/or phenotypes arranged in a hierarchical structure comprising at least one top-level category;

providing, a plurality of feature sets, each feature set comprising (a) two or more features, (b) associated experimentally-derived statistical information indicating one or more of;

differential expression of said features, abundance of said features, responses of said features to a treatment or stimulus, and effects of said features on biological systems, and (c) a feature rank indicating the importance of the feature in an experiment from which the statistical information was derived,whereinthe features are genes, SNPs, SNP patterns, portions of genes, regions of a genome, or compounds,at least some of the features have different names but correspond to a same gene, SNP, SNP pattern, portion of gene, region of a genome, or compound,the plurality of feature sets is obtained from across different experiments, platforms, and/or organisms, andat least some of said feature sets are associated with one or more categories in the taxonomy;

providing a plurality of globally unique mapping identifiers;

identifying, for each globally unique mapping identifier, one or more features associated with the globally unique mapping identifier;

mapping, for each globally unique mapping identifier, the identified one or more features to the globally unique mapping identifier, thereby providing mapping data indicating mapping between a plurality of features and the plurality of globally unique mapping identifiers, wherein at least some features having different names but corresponding to a same gene, SNP, SNP pattern, portion of gene, region of a genome, or compound are mapped to a same globally unique mapping identifier;

storing the mapping data in an index set;

identifying, for each of a plurality of the categories in the taxonomy, contributing feature sets that contribute to scoring a category under consideration by identifying all feature sets among the provided feature sets that are associated with the category under consideration and its child categories in the taxonomy;

combining the feature ranks of all features in the contributing feature sets that can be mapped to a globally unique mapping identifier under consideration based on the mapping data in the index set to obtain an overall score; and

evaluating a correlation between (i) a gene, a SNP, a SNP pattern, a portion of gene, a region of a genome, or a compound corresponding to the globally unique mapping identifier under consideration, and (ii) a disease or a genotype corresponding to the category under consideration based on the obtained overall score.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention relates to methods, systems and apparatus for capturing, integrating, organizing, navigating and querying large-scale data from high-throughput biological and chemical assay platforms. It provides a highly efficient meta-analysis infrastructure for performing research queries across a large number of studies and experiments from different biological and chemical assays, data types and organisms, as well as systems to build and add to such an infrastructure. According to various embodiments, methods, systems and interfaces for associating experimental data, features and groups of data related by structure and/or function with chemical, medical and/or biological terms in an ontology or taxonomy are provided. According to various embodiments, methods, systems and interfaces for filtering data by data source information are provided, allowing dynamic navigation through large amounts of data to find the most relevant results for a particular query.

57 Citations

View as Search Results

21 Claims

1. A computer implemented method for evaluating a correlation between (i) a gene, a SNP, a SNP pattern, a portion of gene, a region of a genome, or a compound, and (ii) a disease or a genotype, the method comprising:
- providing a taxonomy of categories of diseases and/or phenotypes arranged in a hierarchical structure comprising at least one top-level category;
  
  providing, a plurality of feature sets, each feature set comprising (a) two or more features, (b) associated experimentally-derived statistical information indicating one or more of;
  
  differential expression of said features, abundance of said features, responses of said features to a treatment or stimulus, and effects of said features on biological systems, and (c) a feature rank indicating the importance of the feature in an experiment from which the statistical information was derived,whereinthe features are genes, SNPs, SNP patterns, portions of genes, regions of a genome, or compounds,at least some of the features have different names but correspond to a same gene, SNP, SNP pattern, portion of gene, region of a genome, or compound,the plurality of feature sets is obtained from across different experiments, platforms, and/or organisms, andat least some of said feature sets are associated with one or more categories in the taxonomy;
  
  providing a plurality of globally unique mapping identifiers;
  
  identifying, for each globally unique mapping identifier, one or more features associated with the globally unique mapping identifier;
  
  mapping, for each globally unique mapping identifier, the identified one or more features to the globally unique mapping identifier, thereby providing mapping data indicating mapping between a plurality of features and the plurality of globally unique mapping identifiers, wherein at least some features having different names but corresponding to a same gene, SNP, SNP pattern, portion of gene, region of a genome, or compound are mapped to a same globally unique mapping identifier;
  
  storing the mapping data in an index set;
  
  identifying, for each of a plurality of the categories in the taxonomy, contributing feature sets that contribute to scoring a category under consideration by identifying all feature sets among the provided feature sets that are associated with the category under consideration and its child categories in the taxonomy;
  
  combining the feature ranks of all features in the contributing feature sets that can be mapped to a globally unique mapping identifier under consideration based on the mapping data in the index set to obtain an overall score; and
  
  evaluating a correlation between (i) a gene, a SNP, a SNP pattern, a portion of gene, a region of a genome, or a compound corresponding to the globally unique mapping identifier under consideration, and (ii) a disease or a genotype corresponding to the category under consideration based on the obtained overall score.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The computer-implemented method of claim 1, wherein identifying contributing feature sets further comprises filtering, the identified contributing feature sets to remove at least some less relevant feature sets.
  - 3. The computer-implemented method of claim 1, further comprising:
    - for each of a plurality of categories in the taxonomy, receiving, feature set-feature set correlation scores between the contributing feature sets and a feature set obtained from a person; and
      
      combining the feature set-feature set correlation scores to obtain a category-feature set score indicating the relevance of the category under consideration to the feature set obtained from the person; and
      
      determining whether the person is likely to have the disease or the phenotype by comparing the category-feature set score to a criterion.
  - 4. The computer-implemented method of claim 3, further comprising:
    - administering, in response to determining that the person is likely to have the disease, the compound identified as the drug for the treatment of the disease.
  - 5. The computer-implemented method of claim 1, further comprising:
    - providing a plurality of feature groups each including a list of features related by a biological structure or function;
      
      for each of a plurality of categories in the taxonomy, receiving from one or more storage devices feature set-feature group correlation scores between the contributing feature sets and a plurality of the feature groups in the knowledge base;
      
      combining the feature set-feature group correlation scores to obtain category-feature group scores indicating the relevance of the category under consideration to each of a plurality of feature groups wherein each score provides an indication of the relevance of the category under consideration to the feature group under consideration; and
      
      determining whether the biological structure or function is likely linked to the disease or the phenotype by comparing the category-feature set score to a criterion.
  - 6. The computer-implemented method of claim 5, wherein the biological structure or function comprises a biological pathway.
  - 7. The computer-implemented method of claim 1, wherein retrieving feature ranks comprises receiving normalized ranks of the features in the contributing feature sets.
  - 8. The computer-implemented method of claim 1, further comprising receiving feature set-feature set correlation scores from the one or more storage devices between the contributing feature sets and feature sets that contribute to the scoring of a plurality of the other categories in the knowledge base;
    - andobtaining category-category scores based on the feature set-feature set correlation scores, the category-category scores indicating the relevance of the category under consideration to other categories in the taxonomy.
  - 9. The computer-implemented method of claim 3, further comprising generating the feature set obtained from the person from raw data from a biological sample of the person, wherein the raw data includes information on one or more features with indications of one or more of:
    - differential expression, abundance of said features, responses of said features to a treatment or stimulus, and effects of said features on biological systems.
  - 10. The computer-implemented method of claim 9 further comprising importing the one or more generated feature sets into the knowledge base.
  - 11. The computer-implemented method of claim 1 further comprising displaying to a user a list of categories relevant to identified information in the knowledge base.
  - 12. The computer-implemented method of claim 1, wherein obtaining an overall score further comprises standardizing the feature ranks of the feature under consideration.
  - 13. The computer-implemented method of claim 12, wherein standardizing the feature ranks of the feature under consideration includes using one or more of the following:
    - a normalized rank of the feature in each of the contributing feature sets for the category under consideration;
      
      the total number of feature sets containing this feature that pass an inclusion criteria, andthe total number of contributing feature sets identified for the category under consideration.
  - 14. The computer-implemented method of claim 1, wherein the identified one or more features and the globally unique mapping identifier are associated based on synonymy, a structural relation, a functional relation, a genomic coordinate, a chromosomal coordinate, and/or a sequence similarity.

15. A computer program product comprising one or more computer-readable non-transitory storage media having stored thereon computer executable instructions that, when executed by one or more processors of a computer system, cause the computer system to implement a method for evaluating a correlation between (i) a gene, a SNP, a SNP pattern, a portion of gene, a region of a genome, or a compound, and (ii) a disease or a genotype, the method comprising:
- providing a taxonomy of categories of diseases and/or phenotypes arranged in a hierarchical structure comprising at least one top-level category;
  
  providing, a plurality of feature sets, each feature set comprising (a) two or more features, (b) associated statistical information indicating one or more of;
  
  differential expression of said features, abundance of said features, responses of said features to a treatment or stimulus, and effects of said features on biological systems, and (c) a feature rank indicating the importance of the feature in an experiment from which the statistical information was derived,whereinthe features are genes, SNPs, SNP patterns, portions of genes, regions of a genome, or compounds,at least some of the features have different names but correspond to a same gene, SNP, SNP pattern, portion of gene, region of a genome, or compound,the plurality of feature sets is obtained from across different experiments, platforms, and/or organisms, andat least some of said feature sets are associated with one or more categories in the taxonomy;
  
  providing a plurality of globally unique mapping identifiers;
  
  identifying, for each globally unique mapping identifier, one or more features associated with the globally unique mapping identifier;
  
  mapping, for each globally unique mapping identifier, the identified one or more features to the globally unique mapping identifier, thereby providing mapping data indicating mapping between a plurality of features and the plurality of globally unique mapping identifiers, wherein at least some features having different names but corresponding to a same gene, SNP, SNP pattern, portion of gene, region of a genome, or compound are mapped to a same globally unique mapping identifier;
  
  storing the mapping data in an index set;
  
  identifying, for each of a plurality of the categories in the taxonomy, contributing feature sets that contribute to scoring a category under consideration by identifying all feature sets among the provided feature sets that are associated with the category under consideration and/or its child categories in the taxonomy;
  
  combining the feature ranks of all features in the contributing feature sets that can be mapped to a globally unique mapping identifier under consideration based on the mapping data in the index set to obtain an overall score; and
  
  evaluating a correlation between (i) a gene, a SNP, a SNP pattern, a portion of gene, a region of a genome, or a compound corresponding to the globally unique mapping identifier under consideration, and (ii) a disease or a genotype corresponding to the category under consideration based on the obtained overall score.

16. A computer system comprising:
- one or more processors; and
  
  one or more storage devices in communication with the processors for storing a knowledge base and computer-executable instructions that, when executed by the one or more processors, cause the computer system to implement a method for evaluating a correlation between (i) a gene, a SNP, a SNP pattern, a portion of gene, a region of a genome, or a compound, and (ii) a disease or a genotype, the method comprising;
  
  providing a taxonomy of categories of diseases and/or genotypes arranged in a hierarchical structure comprising at least one top-level category;
  
  providing a plurality of feature sets, each feature set comprising (a) two or more features, (b) associated experimentally-derived statistical information indicating one or more of;
  
  differential expression of said features, abundance of said features, responses of said features to a treatment or stimulus, and effects of said features on biological systems, and (c) a feature rank indicating the importance of the feature in an experiment from which the statistical information was derived,whereinthe features are genes, SNPs, SNP patterns, portions of genes, regions of a genome, or compounds,at least some of the features have different names but correspond to a same gene, SNP, SNP pattern, portion of gene, region of a genome, or compound,the plurality of feature sets is obtained from across different experiments, platforms, and/or organisms, andat least some of said feature sets are associated with one or more categories in the taxonomy;
  
  providing a plurality of globally unique mapping identifiers;
  
  identifying, for each globally unique mapping identifier, one or more features associated with the globally unique mapping identifier;
  
  mapping, for each globally unique mapping identifier, the identified one or more features to the globally unique mapping identifier, thereby providing mapping data indicating mapping between a plurality of features and the plurality of globally unique mapping identifiers, wherein at least some features having different names but corresponding to a same gene, SNP, SNP pattern, portion of gene, region of a genome, or compound are mapped to a same globally unique mapping identifier;
  
  storing the mapping data in an index set;
  
  identifying, for each of a plurality of the categories in the taxonomy, contributing feature sets that contribute to scoring a category under consideration by identifying all feature sets among the provided feature sets that are associated with the category under consideration and its child categories in the taxonomy;
  
  combining the feature ranks of all features in the contributing feature sets that can be mapped to a globally unique mapping identifier under consideration based on the mapping data in the index set to obtain an overall; and
  
  evaluating a correlation between (i) a gene, a SNP, a SNP pattern, a portion of gene, a region of a genome, or a compound corresponding to the globally unique mapping identifier under consideration, and (ii) a disease or a genotype corresponding to the category under consideration based on the obtained overall score.

17. A computer implemented method for evaluating correlations between at least two items, each item is selected from a tissue, an organ, a disease, or a treatment, the method comprising:
- providing a taxonomy of medical, biological and/or chemical categories arranged in a hierarchical structure comprising at least one top-level category comprising at least one of the group consisting of tissues or organs, diseases, and treatments;
  
  providing a plurality of feature sets each comprising (a) two or more features, (b) associated experimentally-derived statistical information indicating one or more of;
  
  differential expression of said features, abundance of said features, responses of said features to a treatment or stimulus, and effects of said features on biological systems, and (c) a feature rank indicating the importance of the feature in an experiment from which the statistical information was derived,whereinthe features are genes, SNPs, SNP patterns, portions of genes, regions of a genome, or compounds,at least some of the features have different names but correspond to a same gene, SNP, SNP pattern, portion of gene, region of a genome, or compound,the plurality of feature sets is obtained from across different experiments, platforms, and/or organisms, andat least some of said feature sets are associated with one or more categories in the taxonomy;
  
  providing a plurality of globally unique mapping identifiers;
  
  identifying, for each globally unique mapping identifier, one or more features associated with the globally unique mapping identifier;
  
  mapping, for each globally unique mapping identifier, the identified one or more features to the globally unique mapping identifier, thereby providing mapping data indicating mapping between a plurality of features and the plurality of globally unique mapping identifiers, wherein at least some features having different names but corresponding to a same gene, SNP, SNP pattern, portion of gene, region of a genome, or compound are mapped to a same globally unique mapping identifier;
  
  storing the mapping data in an index set;
  
  identifying contributing feature sets that contribute to scoring a pair of categories under consideration by identifying all feature sets associated with the pair of categories under consideration and their child categories in the taxonomy;
  
  obtain feature set-feature set correlation scores indicating pair-wise correlations between the contributing feature sets of one of the pair of categories under consideration and the contributing feature sets of the other of the pair of categories under consideration, wherein the pair-wise correlations are based on features that each can be mapped to a same globally unique mapping identifier across the pair of categories based on the mapping data in the index set;
  
  calculating a category-category score based on the feature set-feature set correlation scores, the category-category score indicating the correlation between the categories in the pair based on the pair-wise correlation scores; and
  
  determining whether a tissue, an organ, a disease, or a treatment in a category in the pair is likely associated with a tissue, an organ, a disease, or a treatment in another category in the pair by comparing the category-category score to a criterion.

18. A computer implemented method for determining if a person is likely to have a disease or a phenotype, the method comprising:
- providing a taxonomy of categories of diseases and/or phenotype arranged in a hierarchical structure comprising at least one top-level category;
  
  providing a plurality of feature sets and/or feature groups, each feature set comprising (a) two or more features, (b) experimentally-derived associated statistical information indicating one or more of;
  
  differential expression of said features, abundance of said features, responses of said features to a treatment or stimulus, and effects of said features on biological systems, and (c) a feature rank indicating the importance of the feature in an experiment from which the statistical information was derived, and each feature group comprising a list of features related by biological or chemical structure or function,whereinthe features are genes, SNPs, SNP patterns, portions of genes, regions of a genome, or compounds,at least some of the features have different names but correspond to a same gene, SNP, SNP pattern, portion of gene, region of a genome, or compound,the plurality of feature sets is obtained from across different experiments, platforms, and/or organisms, andat least some of said feature sets are associated with one or more categories in the taxonomy;
  
  providing a plurality of globally unique mapping identifiers;
  
  identifying, for each globally unique mapping identifier, one or more features associated with the globally unique mapping identifier;
  
  mapping, for each globally unique mapping identifier, the identified one or more features to the globally unique mapping identifier, thereby providing mapping data indicating mapping between a plurality of features and the plurality of globally unique mapping identifiers, wherein at least some features having different names but corresponding to a same gene, SNP, SNP pattern, portion of gene, region of a genome, or compound are mapped to a same globally unique mapping identifier;
  
  storing the mapping data in an index set;
  
  identifying, for each of a plurality of the categories in the taxonomy, contributing feature sets that contribute to scoring the category under consideration by identifying all feature sets among the provided feature sets that are associated with the category under consideration and its child categories in the taxonomy;
  
  obtaining, based on the feature ranks of features in the contributing feature sets that can be mapped to a same globally unique mapping identifier according to the mapping data in the index set, feature set-feature set correlation scores between the contributing feature sets and a plurality of other feature sets in the knowledge base, and feature set-feature group correlation scores between the contributing feature sets and a plurality of feature groups in the knowledge base;
  
  calculating overall scores indicating correlation between the category under consideration and each of a plurality of features, feature sets and feature groups in the knowledge base based on the feature set-feature set correlation scores, feature set-feature group correlation scores, and feature ranks;
  
  storing the overall scores on the one or more storage devices;
  
  for each of a plurality of categories in the taxonomy, receiving from the one or more storage devices feature-set correlation scores between the contributing feature sets and a feature set obtained from a person; and
  
  combining the feature-set correlation scores to obtain a category-feature set score indicating the relevance of the category under consideration to the feature set obtained from the person; and
  
  determining whether the person is likely to have a disease or a phenotype by comparing the category-feature set score to a criterion.
- View Dependent Claims (19, 20, 21)
- - 19. The computer-implemented method of claim 18, further comprising generating a feature set from raw data associated with an experiment.
  - 20. The computer-implemented method of claim 18, wherein the features of a plurality of feature sets are units of genetic information and the associated statistics indicate expression profiles.
  - 21. The computer-implemented method of claim 18, further comprising identifying categories relevant to a feature, feature set or feature group based on the stored scores and displaying to a user the identified relevant categories.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Illumina Incorporated
Original Assignee
NextBio
Inventors
Kupershmidt, Ilya, Su, Qiaojuan Jane, Liu, Qingdi, Alag, Satnam, Sundaresh, Suman
Primary Examiner(s)
Dejong, Eric S

Application Number

US12/398,107
Publication Number

US 20090222400A1
Time in Patent Office

2,393 Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/2246   Trees, e.g. B+trees

G06F 16/24578   using ranking

G06F 16/285   Clustering or classification

G06N 20/00   Machine learning

G16B 20/00   ICT specially adapted for f...

G16B 20/10   Ploidy or copy number detec...

G16B 20/20   Allele or variant detection...

G16B 30/00   ICT specially adapted for s...

G16B 50/00   ICT programming tools or da...

G16B 50/10   Ontologies; Annotations

Categorization and filtering of scientific data

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

57 Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Categorization and filtering of scientific data

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

57 Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links