Sequence-centric scientific information management
First Claim
1. A computer-implemented method of integrating a sequence-centric feature set into a knowledge base on a storage device comprising sequence-centric feature sets and gene-centric feature sets, the method comprising:
- receiving by one or more processors of a computer system a sequence-centric feature set provided by a user, wherein the sequence-centric feature set comprises a plurality of sequence regions and associated statistics, wherein the plurality of sequence regions comprise one or more SNPs, methylated regions, or genomic variations;
mapping by the one or more processors of the computer system the plurality of sequence regions to genes within the knowledge base to provide a set of mapped genes for the received sequence-centric feature set, wherein the plurality of sequence regions and the genes within the knowledge base are related by genomic coordinate, physical proximity, haplotype, function, or phenotype;
mapping by the one or more processors of the computer system the plurality of sequence regions to other sequence regions within the knowledge base to provide a set of mapped sequence regions for the received sequence-centric feature set, wherein the plurality of sequence regions and the genes within the knowledge base are related by genomic coordinate, physical proximity, haplotype, function, or phenotype;
providing ranks of the set of mapped sequence regions in the received sequence-centric feature set and in other sequence-centric feature sets in the knowledge base, wherein the other sequence-centric feature sets comprise a plurality of sequence regions and associated statistics;
performing by the one or more processors of the computer system iterative rank based processes to calculate sequence-sequence scores indicating correlations between the received sequence-centric feature set and other sequence-centric feature sets in the knowledge base using the ranks of the set of mapped sequence regions;
providing ranks of the set of mapped genes in the received sequence-centric feature set and in the gene-centric feature sets in the knowledge base, wherein the gene-centric feature sets comprise one or more of genes ranked by activity and microarray-based gene expression data;
performing by one or more processors of the computer system iterative rank based processes to calculate sequence-gene scores indicating the correlations between the received sequence-centric feature set and the gene-centric feature sets using the ranks of the set of mapped genes;
storing the received sequence-centric feature set, the sequence-sequence scores, and the sequence-gene scores on the storage device;
receiving a query sequence region or a query gene as a query input; and
displaying information based on one or more sequence-sequence scores or one or more sequence-gene scores that correspond to the query sequence region or the query gene.
2 Assignments
0 Petitions
Accused Products
Abstract
According to various embodiments, aspects of the invention provide a highly efficient meta-analysis infrastructure for performing research queries across a large number of studies and experiments from diverse sequencing technologies as well as different biological and chemical assays, data types and organisms, as well as systems to build and add to such an infrastructure. The methods, systems and apparatuses described enable combining orthogonal types of data and available public knowledge to elucidate mechanisms governing normal development, disease progression, as well as susceptibility of individuals to disease or response to drug treatments.
-
Citations
20 Claims
-
1. A computer-implemented method of integrating a sequence-centric feature set into a knowledge base on a storage device comprising sequence-centric feature sets and gene-centric feature sets, the method comprising:
-
receiving by one or more processors of a computer system a sequence-centric feature set provided by a user, wherein the sequence-centric feature set comprises a plurality of sequence regions and associated statistics, wherein the plurality of sequence regions comprise one or more SNPs, methylated regions, or genomic variations; mapping by the one or more processors of the computer system the plurality of sequence regions to genes within the knowledge base to provide a set of mapped genes for the received sequence-centric feature set, wherein the plurality of sequence regions and the genes within the knowledge base are related by genomic coordinate, physical proximity, haplotype, function, or phenotype; mapping by the one or more processors of the computer system the plurality of sequence regions to other sequence regions within the knowledge base to provide a set of mapped sequence regions for the received sequence-centric feature set, wherein the plurality of sequence regions and the genes within the knowledge base are related by genomic coordinate, physical proximity, haplotype, function, or phenotype; providing ranks of the set of mapped sequence regions in the received sequence-centric feature set and in other sequence-centric feature sets in the knowledge base, wherein the other sequence-centric feature sets comprise a plurality of sequence regions and associated statistics; performing by the one or more processors of the computer system iterative rank based processes to calculate sequence-sequence scores indicating correlations between the received sequence-centric feature set and other sequence-centric feature sets in the knowledge base using the ranks of the set of mapped sequence regions; providing ranks of the set of mapped genes in the received sequence-centric feature set and in the gene-centric feature sets in the knowledge base, wherein the gene-centric feature sets comprise one or more of genes ranked by activity and microarray-based gene expression data; performing by one or more processors of the computer system iterative rank based processes to calculate sequence-gene scores indicating the correlations between the received sequence-centric feature set and the gene-centric feature sets using the ranks of the set of mapped genes; storing the received sequence-centric feature set, the sequence-sequence scores, and the sequence-gene scores on the storage device; receiving a query sequence region or a query gene as a query input; and displaying information based on one or more sequence-sequence scores or one or more sequence-gene scores that correspond to the query sequence region or the query gene. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A computer program product comprising a machine readable non-transitory medium on which is provided program instructions for integrating a sequence-centric feature set into a knowledge based on a storage device comprising sequence-centric feature sets and gene-centric feature sets, the program instructions comprising:
-
code for receiving a sequence-centric feature set provided by a user, wherein the sequence-centric feature set comprises a plurality of sequence regions and associated statistics, wherein the plurality of sequence regions comprise one or more SNPs, methylated regions, or genomic variations; code for mapping the plurality of sequence regions to genes within the knowledge base to provide a set of mapped genes for the received sequence-centric feature set, wherein the plurality of sequence regions and the genes within the knowledge base are related by genomic coordinate, physical proximity, haplotype, function, or phenotype; code for mapping the plurality of sequence regions to other sequence regions within the knowledge base to provide a set of mapped sequence regions for the received sequence-centric feature set, wherein the plurality of sequence regions and the genes within the knowledge base are related by genomic coordinate, physical proximity, haplotype, function, or phenotype; code for providing ranks of the set of mapped sequence regions in the received sequence-centric feature set and in other sequence-centric feature sets in the knowledge base, wherein the other sequence-centric feature sets comprise a plurality of sequence regions and associated statistics; code for performing iterative rank based processes to calculate sequence-sequence scores indicating correlations between the received sequence-centric feature set and other sequence-centric feature sets in the knowledge base using the ranks of the set of mapped sequence regions; code for providing ranks of the set of mapped genes in the received sequence-centric feature set and in the gene-centric feature sets in the knowledge base, wherein the gene-centric feature sets comprise one or more of genes ranked by activity and microarray-based gene expression data; code for performing iterative rank based processes to calculate sequence-gene scores indicating the correlations between the received sequence-centric feature set and the gene-centric feature sets using the ranks of the set of mapped genes; code for storing the received sequence-centric feature set, the sequence-sequence scores, and the sequence-gene scores on the storage device; code for receiving a query sequence region or a query gene as a query input; and code for displaying information based on one or more sequence-sequence scores or one or more sequence-gene scores that correspond to the query sequence region or the query gene. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. An apparatus for integrating a sequence-centric feature set into a knowledge base comprising sequence-centric feature sets and gene-centric feature sets, comprising:
-
a memory for storing a knowledge base of scientific information; and one or more processors in communication with the memory and configured to; receive sequence-centric feature set provided by a user, wherein the sequence-centric feature set comprises a plurality of sequence regions and associated statistics, wherein the plurality of sequence regions comprise one or more SNPs, methylated regions, or genomic variations; map the plurality of sequence regions to genes within the knowledge base to provide a set of mapped genes for the received sequence-centric feature set, wherein the plurality of sequence regions and the genes within the knowledge base are related by genomic coordinate, physical proximity, haplotype, function, or phenotype; map the plurality of sequence regions to other sequence regions within the knowledge base to provide a set of mapped sequence regions for the received sequence-centric feature set, wherein the plurality of sequence regions and the other sequence regions within the knowledge base are related by genomic coordinate, physical proximity, haplotype, function, or phenotype; provide ranks of the set of mapped sequence regions in the received sequence-centric feature set and in other sequence-centric feature sets in the knowledge base, wherein the other sequence-centric feature sets comprise a plurality of sequence regions and associated statistics; perform iterative rank based processes to calculate sequence-sequence scores indicating correlations between the received sequence-centric feature set and other sequence-centric feature sets in the knowledge base using the ranks of the set of mapped sequence regions; provide ranks of the set of mapped genes in the received sequence-centric feature set and in the gene-centric feature sets in the knowledge base, wherein the gene-centric feature sets comprise one or more of genes ranked by activity and microarray-based gene expression data; perform iterative rank based processes to calculate sequence-gene scores indicating the correlations between the received sequence-centric feature set and the gene-centric feature sets using the ranks of the set of mapped genes; store the received sequence-centric feature set, the sequence-sequence scores, and the sequence-gene scores on the memory; receive a query sequence region or a query gene as a query input; and display information based on one or more sequence-sequence scores or one or more sequence-gene scores that correspond to the query sequence region or the query gene.
-
Specification