Techniques for facilitating identification of candidate genes
First Claim
1. In a computer system, a method of identifying candidate genes from a plurality of DNA sequences, the method comprising:
- obtaining results of a homology search for a first plurality of DNA sequences, the homology search results comprising information about homologs of the first plurality of DNA sequences;
obtaining annotative information for the first plurality of DNA sequences, the annotative information comprising information about biochemical functions and physiological roles of the first plurality of DNA sequences, wherein obtaining the annotative information comprises;
identifying one or more known genes from the first plurality of DNA sequences based on the homology search results, wherein a DNA sequence from the first plurality of DNA sequences is identified as a known gene if a sequence identity of the DNA sequence to a sequence stored in a first database of sequences used for the homology search is at least equal to a first threshold value;
accessing one or more information sources storing annotative information for DNA sequence;
extracting annotative information from the one or more information sources for the known genes, the extracted annotative information comprising information about one or more biochemical functions and physiological roles of each known gene; and
assigning a reference score to the extracted annotative information for each known gene based on the level of acceptance of the roles or functions of the known gene as described by the annotative information such that annotative information with a high level of acceptance is assigned a higher reference score than annotative information with a low level of acceptance;
obtaining gene expression profile data for the first plurality of DNA sequences, the gene expression profile data describing behavioral patterns of the first plurality of DNA sequences;
clustering the first plurality of DNA sequences based on the behavioral patterns of the first plurality of DNA sequences as described by the gene expression profile data;
storing the results of the homology search, the annotative information, the reference score assigned to the extracted annotative information for each known gene, the gene expression profile data, and results from clustering the first plurality of DNA sequences in a database;
receiving a query identifying criteria for the candidate genes; and
searching the database, in response to the query, to identify a set of DNA sequences from the first plurality of DNA sequences which satisfy the query criteria.
3 Assignments
0 Petitions
Accused Products
Abstract
Techniques for facilitating the identification of candidate genes from a plurality of DNA sequences. According to an embodiment of the present invention, techniques are provided for extracting and integrating information from various information sources and results of various analyses, and storing the integrated information in a form which is conducive to identification of candidate genes. The stored information may include results of a homology search for the plurality of DNA sequences, annotative information for the plurality of DNA sequences indicating the biochemical functions and physiological roles of the plurality of DNA sequences, gene expression profile data for the plurality of DNA sequences describing behavioral patterns of the plurality of DNA sequences, results from clustering the plurality of DNA sequences based on time course data as described by the gene expression profile data, and other information.
172 Citations
11 Claims
-
1. In a computer system, a method of identifying candidate genes from a plurality of DNA sequences, the method comprising:
-
obtaining results of a homology search for a first plurality of DNA sequences, the homology search results comprising information about homologs of the first plurality of DNA sequences;
obtaining annotative information for the first plurality of DNA sequences, the annotative information comprising information about biochemical functions and physiological roles of the first plurality of DNA sequences, wherein obtaining the annotative information comprises;
identifying one or more known genes from the first plurality of DNA sequences based on the homology search results, wherein a DNA sequence from the first plurality of DNA sequences is identified as a known gene if a sequence identity of the DNA sequence to a sequence stored in a first database of sequences used for the homology search is at least equal to a first threshold value;
accessing one or more information sources storing annotative information for DNA sequence;
extracting annotative information from the one or more information sources for the known genes, the extracted annotative information comprising information about one or more biochemical functions and physiological roles of each known gene; and
assigning a reference score to the extracted annotative information for each known gene based on the level of acceptance of the roles or functions of the known gene as described by the annotative information such that annotative information with a high level of acceptance is assigned a higher reference score than annotative information with a low level of acceptance;
obtaining gene expression profile data for the first plurality of DNA sequences, the gene expression profile data describing behavioral patterns of the first plurality of DNA sequences;
clustering the first plurality of DNA sequences based on the behavioral patterns of the first plurality of DNA sequences as described by the gene expression profile data;
storing the results of the homology search, the annotative information, the reference score assigned to the extracted annotative information for each known gene, the gene expression profile data, and results from clustering the first plurality of DNA sequences in a database;
receiving a query identifying criteria for the candidate genes; and
searching the database, in response to the query, to identify a set of DNA sequences from the first plurality of DNA sequences which satisfy the query criteria. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
performing the BLAST analysis on the first plurality of DNA sequences using the first database of sequences;
identifying a second plurality of DNA sequences from the first plurality of sequences based on the BLAST analysis, wherein a DNA sequence from the first plurality of DNA sequences is included in the second plurality of DNA sequences if a sequence identity of the DNA sequence to a sequence stored in the first database of sequences is less than a second threshold value;
performing Smith-Waterman analysis on the second plurality of DNA sequences using a protein database and a translated patent database;
identifying a third plurality of DNA sequences from the second plurality of sequences based on the Smith-Waterman analysis, wherein a DNA sequence from the second plurality of DNA sequences is included in the third plurality of DNA sequences if a sequence identity of the DNA sequence to a sequence stored in the first database of sequences is less than a third threshold value;
performing Hidden Markov Model (HMM) analysis and EMotif analysis on the third plurality of DNA sequences using the protein database and GenBank database; and
performing BLAST analysis on the third plurality of DNA sequences using GenBank EST database.
-
-
4. The method of claim 1 wherein the one or more information sources include Genbank database, SWISS-PROT database, Medline database, and biomedical publications.
-
5. The method of claim 1 wherein:
-
accessing the one or more information sources comprises accessing biomedical publications;
assigning the reference score to the extracted annotative information for each known gene comprises;
for annotative information extracted from each biomedical publication;
assigning a reference score to the extracted annotative information based on characteristics of the biomedical publication, the reference score indicating the level of acceptance of the roles or functions of the known genes as described by the annotative information extracted from the biomedical publication.
-
-
6. The method of claim 5 wherein assigning the reference score comprises:
using a score derived from a citation index database to calculate the reference score, the score derived from the citation index database indicating the number of times that the annotative information from the biomedical publication was referenced by other information sources.
-
7. The method of claim 5 wherein assigning the reference score further comprises:
-
ranking the biomedical publications; and
assigning the reference score to the annotative information extracted from the biomedical publication based on the ranking of the biomedical publication.
-
-
8. The method of claim 1 wherein clustering the first plurality of DNA sequences comprises determining relationships between clusters of DNA sequences from the first plurality of DNA sequences.
-
9. The method of claim 1 wherein clustering the first plurality of DNA sequences comprises clustering the first plurality of DNA sequences based on time-course data described by the gene expression profile data.
-
10. The method of claim 1 wherein storing the information in the database comprises correlating the annotative information for the first plurality of DNA sequences with the genes expression profile data for the first plurality of DNA sequences.
-
11. In a computer system, a method of identifying candidate genes comprising:
-
configuring a query identifying criteria for the candidate genes;
communicating the query to a server storing information related to a plurality of DNA sequences, the information comprising;
results of a homology search for the plurality of DNA sequences, the homology search results comprising information about homologs of the plurality of DNA sequences;
annotative information about the biochemical functions and physiological roles of the plurality of DNA sequences, wherein the annotative information is obtained by;
identifying known genes from the plurality of DNA sequences based on the homology search results, wherein a DNA sequence from the plurality of DNA sequences is identified as a known gene if a sequence identity of the DNA sequence to a sequence stored in a database of sequences used for the homology search is at least equal to a first threshold value; and
accessing one or more information sources storing annotative information for DNA sequences;
extracting annotative information from the one or more information sources for the known genes, the extracted annotative information comprising information about one or more biochemical functions and physiological roles of each known gene; and
assigning a reference score to the extracted annotative information for each known gene based on the level of acceptance of the roles or functions of the known gene as described by the annotative information such that annotative information with a high level of acceptance is assigned a higher reference score than annotative information with a low level of acceptance, wherein the annotative information stored by the server includes the reference score assigned to the extracted annotative information for each known gene;
information describing behavioral patterns of the plurality of DNA sequences; and
results from clustering the plurality of DNA sequences based on the behavioral patterns of the plurality of DNA sequences as described by the gene expression profile data; and
receiving from the server, in response to the query, a first set of DNA sequences from the plurality of DNA sequences, wherein the first set of DNA sequences satisfy the criteria for the candidate genes identified in the query.
-
Specification