Categorization and filtering of scientific data
First Claim
1. A computer implemented method for evaluating a correlation between (i) a gene, a SNP, a SNP pattern, a portion of gene, a region of a genome, or a compound, and (ii) a disease or a genotype, the method comprising:
- providing a taxonomy of categories of diseases and/or phenotypes arranged in a hierarchical structure comprising at least one top-level category;
providing, a plurality of feature sets, each feature set comprising (a) two or more features, (b) associated experimentally-derived statistical information indicating one or more of;
differential expression of said features, abundance of said features, responses of said features to a treatment or stimulus, and effects of said features on biological systems, and (c) a feature rank indicating the importance of the feature in an experiment from which the statistical information was derived,whereinthe features are genes, SNPs, SNP patterns, portions of genes, regions of a genome, or compounds,at least some of the features have different names but correspond to a same gene, SNP, SNP pattern, portion of gene, region of a genome, or compound,the plurality of feature sets is obtained from across different experiments, platforms, and/or organisms, andat least some of said feature sets are associated with one or more categories in the taxonomy;
providing a plurality of globally unique mapping identifiers;
identifying, for each globally unique mapping identifier, one or more features associated with the globally unique mapping identifier;
mapping, for each globally unique mapping identifier, the identified one or more features to the globally unique mapping identifier, thereby providing mapping data indicating mapping between a plurality of features and the plurality of globally unique mapping identifiers, wherein at least some features having different names but corresponding to a same gene, SNP, SNP pattern, portion of gene, region of a genome, or compound are mapped to a same globally unique mapping identifier;
storing the mapping data in an index set;
identifying, for each of a plurality of the categories in the taxonomy, contributing feature sets that contribute to scoring a category under consideration by identifying all feature sets among the provided feature sets that are associated with the category under consideration and its child categories in the taxonomy;
combining the feature ranks of all features in the contributing feature sets that can be mapped to a globally unique mapping identifier under consideration based on the mapping data in the index set to obtain an overall score; and
evaluating a correlation between (i) a gene, a SNP, a SNP pattern, a portion of gene, a region of a genome, or a compound corresponding to the globally unique mapping identifier under consideration, and (ii) a disease or a genotype corresponding to the category under consideration based on the obtained overall score.
2 Assignments
0 Petitions
Accused Products
Abstract
The present invention relates to methods, systems and apparatus for capturing, integrating, organizing, navigating and querying large-scale data from high-throughput biological and chemical assay platforms. It provides a highly efficient meta-analysis infrastructure for performing research queries across a large number of studies and experiments from different biological and chemical assays, data types and organisms, as well as systems to build and add to such an infrastructure. According to various embodiments, methods, systems and interfaces for associating experimental data, features and groups of data related by structure and/or function with chemical, medical and/or biological terms in an ontology or taxonomy are provided. According to various embodiments, methods, systems and interfaces for filtering data by data source information are provided, allowing dynamic navigation through large amounts of data to find the most relevant results for a particular query.
57 Citations
21 Claims
-
1. A computer implemented method for evaluating a correlation between (i) a gene, a SNP, a SNP pattern, a portion of gene, a region of a genome, or a compound, and (ii) a disease or a genotype, the method comprising:
-
providing a taxonomy of categories of diseases and/or phenotypes arranged in a hierarchical structure comprising at least one top-level category; providing, a plurality of feature sets, each feature set comprising (a) two or more features, (b) associated experimentally-derived statistical information indicating one or more of;
differential expression of said features, abundance of said features, responses of said features to a treatment or stimulus, and effects of said features on biological systems, and (c) a feature rank indicating the importance of the feature in an experiment from which the statistical information was derived,wherein the features are genes, SNPs, SNP patterns, portions of genes, regions of a genome, or compounds, at least some of the features have different names but correspond to a same gene, SNP, SNP pattern, portion of gene, region of a genome, or compound, the plurality of feature sets is obtained from across different experiments, platforms, and/or organisms, and at least some of said feature sets are associated with one or more categories in the taxonomy; providing a plurality of globally unique mapping identifiers; identifying, for each globally unique mapping identifier, one or more features associated with the globally unique mapping identifier; mapping, for each globally unique mapping identifier, the identified one or more features to the globally unique mapping identifier, thereby providing mapping data indicating mapping between a plurality of features and the plurality of globally unique mapping identifiers, wherein at least some features having different names but corresponding to a same gene, SNP, SNP pattern, portion of gene, region of a genome, or compound are mapped to a same globally unique mapping identifier; storing the mapping data in an index set; identifying, for each of a plurality of the categories in the taxonomy, contributing feature sets that contribute to scoring a category under consideration by identifying all feature sets among the provided feature sets that are associated with the category under consideration and its child categories in the taxonomy; combining the feature ranks of all features in the contributing feature sets that can be mapped to a globally unique mapping identifier under consideration based on the mapping data in the index set to obtain an overall score; and evaluating a correlation between (i) a gene, a SNP, a SNP pattern, a portion of gene, a region of a genome, or a compound corresponding to the globally unique mapping identifier under consideration, and (ii) a disease or a genotype corresponding to the category under consideration based on the obtained overall score. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A computer program product comprising one or more computer-readable non-transitory storage media having stored thereon computer executable instructions that, when executed by one or more processors of a computer system, cause the computer system to implement a method for evaluating a correlation between (i) a gene, a SNP, a SNP pattern, a portion of gene, a region of a genome, or a compound, and (ii) a disease or a genotype, the method comprising:
-
providing a taxonomy of categories of diseases and/or phenotypes arranged in a hierarchical structure comprising at least one top-level category; providing, a plurality of feature sets, each feature set comprising (a) two or more features, (b) associated statistical information indicating one or more of;
differential expression of said features, abundance of said features, responses of said features to a treatment or stimulus, and effects of said features on biological systems, and (c) a feature rank indicating the importance of the feature in an experiment from which the statistical information was derived,wherein the features are genes, SNPs, SNP patterns, portions of genes, regions of a genome, or compounds, at least some of the features have different names but correspond to a same gene, SNP, SNP pattern, portion of gene, region of a genome, or compound, the plurality of feature sets is obtained from across different experiments, platforms, and/or organisms, and at least some of said feature sets are associated with one or more categories in the taxonomy; providing a plurality of globally unique mapping identifiers; identifying, for each globally unique mapping identifier, one or more features associated with the globally unique mapping identifier; mapping, for each globally unique mapping identifier, the identified one or more features to the globally unique mapping identifier, thereby providing mapping data indicating mapping between a plurality of features and the plurality of globally unique mapping identifiers, wherein at least some features having different names but corresponding to a same gene, SNP, SNP pattern, portion of gene, region of a genome, or compound are mapped to a same globally unique mapping identifier; storing the mapping data in an index set; identifying, for each of a plurality of the categories in the taxonomy, contributing feature sets that contribute to scoring a category under consideration by identifying all feature sets among the provided feature sets that are associated with the category under consideration and/or its child categories in the taxonomy; combining the feature ranks of all features in the contributing feature sets that can be mapped to a globally unique mapping identifier under consideration based on the mapping data in the index set to obtain an overall score; and evaluating a correlation between (i) a gene, a SNP, a SNP pattern, a portion of gene, a region of a genome, or a compound corresponding to the globally unique mapping identifier under consideration, and (ii) a disease or a genotype corresponding to the category under consideration based on the obtained overall score.
-
-
16. A computer system comprising:
-
one or more processors; and one or more storage devices in communication with the processors for storing a knowledge base and computer-executable instructions that, when executed by the one or more processors, cause the computer system to implement a method for evaluating a correlation between (i) a gene, a SNP, a SNP pattern, a portion of gene, a region of a genome, or a compound, and (ii) a disease or a genotype, the method comprising; providing a taxonomy of categories of diseases and/or genotypes arranged in a hierarchical structure comprising at least one top-level category; providing a plurality of feature sets, each feature set comprising (a) two or more features, (b) associated experimentally-derived statistical information indicating one or more of;
differential expression of said features, abundance of said features, responses of said features to a treatment or stimulus, and effects of said features on biological systems, and (c) a feature rank indicating the importance of the feature in an experiment from which the statistical information was derived,wherein the features are genes, SNPs, SNP patterns, portions of genes, regions of a genome, or compounds, at least some of the features have different names but correspond to a same gene, SNP, SNP pattern, portion of gene, region of a genome, or compound, the plurality of feature sets is obtained from across different experiments, platforms, and/or organisms, and at least some of said feature sets are associated with one or more categories in the taxonomy; providing a plurality of globally unique mapping identifiers; identifying, for each globally unique mapping identifier, one or more features associated with the globally unique mapping identifier; mapping, for each globally unique mapping identifier, the identified one or more features to the globally unique mapping identifier, thereby providing mapping data indicating mapping between a plurality of features and the plurality of globally unique mapping identifiers, wherein at least some features having different names but corresponding to a same gene, SNP, SNP pattern, portion of gene, region of a genome, or compound are mapped to a same globally unique mapping identifier; storing the mapping data in an index set; identifying, for each of a plurality of the categories in the taxonomy, contributing feature sets that contribute to scoring a category under consideration by identifying all feature sets among the provided feature sets that are associated with the category under consideration and its child categories in the taxonomy; combining the feature ranks of all features in the contributing feature sets that can be mapped to a globally unique mapping identifier under consideration based on the mapping data in the index set to obtain an overall; and evaluating a correlation between (i) a gene, a SNP, a SNP pattern, a portion of gene, a region of a genome, or a compound corresponding to the globally unique mapping identifier under consideration, and (ii) a disease or a genotype corresponding to the category under consideration based on the obtained overall score.
-
-
17. A computer implemented method for evaluating correlations between at least two items, each item is selected from a tissue, an organ, a disease, or a treatment, the method comprising:
-
providing a taxonomy of medical, biological and/or chemical categories arranged in a hierarchical structure comprising at least one top-level category comprising at least one of the group consisting of tissues or organs, diseases, and treatments; providing a plurality of feature sets each comprising (a) two or more features, (b) associated experimentally-derived statistical information indicating one or more of;
differential expression of said features, abundance of said features, responses of said features to a treatment or stimulus, and effects of said features on biological systems, and (c) a feature rank indicating the importance of the feature in an experiment from which the statistical information was derived,wherein the features are genes, SNPs, SNP patterns, portions of genes, regions of a genome, or compounds, at least some of the features have different names but correspond to a same gene, SNP, SNP pattern, portion of gene, region of a genome, or compound, the plurality of feature sets is obtained from across different experiments, platforms, and/or organisms, and at least some of said feature sets are associated with one or more categories in the taxonomy; providing a plurality of globally unique mapping identifiers; identifying, for each globally unique mapping identifier, one or more features associated with the globally unique mapping identifier; mapping, for each globally unique mapping identifier, the identified one or more features to the globally unique mapping identifier, thereby providing mapping data indicating mapping between a plurality of features and the plurality of globally unique mapping identifiers, wherein at least some features having different names but corresponding to a same gene, SNP, SNP pattern, portion of gene, region of a genome, or compound are mapped to a same globally unique mapping identifier; storing the mapping data in an index set; identifying contributing feature sets that contribute to scoring a pair of categories under consideration by identifying all feature sets associated with the pair of categories under consideration and their child categories in the taxonomy; obtain feature set-feature set correlation scores indicating pair-wise correlations between the contributing feature sets of one of the pair of categories under consideration and the contributing feature sets of the other of the pair of categories under consideration, wherein the pair-wise correlations are based on features that each can be mapped to a same globally unique mapping identifier across the pair of categories based on the mapping data in the index set; calculating a category-category score based on the feature set-feature set correlation scores, the category-category score indicating the correlation between the categories in the pair based on the pair-wise correlation scores; and determining whether a tissue, an organ, a disease, or a treatment in a category in the pair is likely associated with a tissue, an organ, a disease, or a treatment in another category in the pair by comparing the category-category score to a criterion.
-
-
18. A computer implemented method for determining if a person is likely to have a disease or a phenotype, the method comprising:
-
providing a taxonomy of categories of diseases and/or phenotype arranged in a hierarchical structure comprising at least one top-level category; providing a plurality of feature sets and/or feature groups, each feature set comprising (a) two or more features, (b) experimentally-derived associated statistical information indicating one or more of;
differential expression of said features, abundance of said features, responses of said features to a treatment or stimulus, and effects of said features on biological systems, and (c) a feature rank indicating the importance of the feature in an experiment from which the statistical information was derived, and each feature group comprising a list of features related by biological or chemical structure or function,wherein the features are genes, SNPs, SNP patterns, portions of genes, regions of a genome, or compounds, at least some of the features have different names but correspond to a same gene, SNP, SNP pattern, portion of gene, region of a genome, or compound, the plurality of feature sets is obtained from across different experiments, platforms, and/or organisms, and at least some of said feature sets are associated with one or more categories in the taxonomy; providing a plurality of globally unique mapping identifiers; identifying, for each globally unique mapping identifier, one or more features associated with the globally unique mapping identifier; mapping, for each globally unique mapping identifier, the identified one or more features to the globally unique mapping identifier, thereby providing mapping data indicating mapping between a plurality of features and the plurality of globally unique mapping identifiers, wherein at least some features having different names but corresponding to a same gene, SNP, SNP pattern, portion of gene, region of a genome, or compound are mapped to a same globally unique mapping identifier; storing the mapping data in an index set; identifying, for each of a plurality of the categories in the taxonomy, contributing feature sets that contribute to scoring the category under consideration by identifying all feature sets among the provided feature sets that are associated with the category under consideration and its child categories in the taxonomy; obtaining, based on the feature ranks of features in the contributing feature sets that can be mapped to a same globally unique mapping identifier according to the mapping data in the index set, feature set-feature set correlation scores between the contributing feature sets and a plurality of other feature sets in the knowledge base, and feature set-feature group correlation scores between the contributing feature sets and a plurality of feature groups in the knowledge base; calculating overall scores indicating correlation between the category under consideration and each of a plurality of features, feature sets and feature groups in the knowledge base based on the feature set-feature set correlation scores, feature set-feature group correlation scores, and feature ranks; storing the overall scores on the one or more storage devices; for each of a plurality of categories in the taxonomy, receiving from the one or more storage devices feature-set correlation scores between the contributing feature sets and a feature set obtained from a person; and combining the feature-set correlation scores to obtain a category-feature set score indicating the relevance of the category under consideration to the feature set obtained from the person; and determining whether the person is likely to have a disease or a phenotype by comparing the category-feature set score to a criterion. - View Dependent Claims (19, 20, 21)
-
Specification