System and method for searching and processing databases comprising named annotated text strings
First Claim
1. A method for searching a genetic sequence database comprising loci, each locus having a unique name, one or more annotations, and an ordered text string, the genetic sequence database being stored in one or more database files, the method comprising the steps of:
- assigning a unique ID for each locus;
assigning an annotation identifier for each predefined annotation type;
constructing a parsed skeleton file associated with each of the database files, wherein each entry in the parsed skeleton file is associated with a particular locus and comprises one or more searchable object names, a length and an offset for each searchable object within the particular locus; and
building an index file associated with each of the database files, wherein each entry in the index file comprises an offset and length into a database file for each locus and an offset and length of the corresponding entry in the parsed skeleton file.
2 Assignments
0 Petitions
Accused Products
Abstract
A system and method for processing, searching, and performing in-context searches on named annotated text string databases. The system and method provides users with a means for interactively refining database searches in order to account for differences in keywords used to describe similar phenomena. The system and method provides a means for performing searches for particular predefined target strings in context of particular predefined context strings. Data is represented using data types referred to as Hits and E-Hits. Hits data contains locations of search results and the E-Hits data contains text of search results. Hits lists are sorted and duplicate entries are discarded. Context search results are segregated from non-context search results by sorting the Hits lists. The Search module operates on a Hits list and selects those elements that match one or more search key(s). The output from a Search module is a Results Hits list. The Context Search module accepts two inputs in addition to the search key(s), a Context Hits list and a Target Hits list. The output of the Context Search module is a Hits list that contains matches found within the specified context. The Select module accepts a stream of Hits as input parameters and can be used to add or subtract annotations to the results of a search, remove base text sub-strings from the results of a search, or perform additional processing on Hits that may be useful for context searching. The Extract module is used to extract actual data from a Hits list, typically for display to a user and/or for converting results to keywords used for a subsequent search.
-
Citations
26 Claims
-
1. A method for searching a genetic sequence database comprising loci, each locus having a unique name, one or more annotations, and an ordered text string, the genetic sequence database being stored in one or more database files, the method comprising the steps of:
-
assigning a unique ID for each locus;
assigning an annotation identifier for each predefined annotation type;
constructing a parsed skeleton file associated with each of the database files, wherein each entry in the parsed skeleton file is associated with a particular locus and comprises one or more searchable object names, a length and an offset for each searchable object within the particular locus; and
building an index file associated with each of the database files, wherein each entry in the index file comprises an offset and length into a database file for each locus and an offset and length of the corresponding entry in the parsed skeleton file. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
selecting a portion of the genetic sequence database in which to conduct a search;
constructing an input hits list comprising the unique ID of each locus identified in said portion;
specifying a search key comprising one or more keywords and one or more annotation types;
performing a first database search using the input hits list and the search key; and
outputting matches into a results hits list.
-
-
3. The method of claim 2, wherein said step of performing a first database search comprises the steps of:
-
reading the unique ID from the input hit list;
determining which of the one or more database files contains the unique ID;
calculating an offset into the associated index file, where the associated index entry is stored;
consulting the associated index entry to determine an offset and length of the locus and an offset and length of the associated parsed skeleton file entry;
reading the associated parsed skeleton file entry; and
searching for a match of the search key using the parsed skeleton file to parse the locus.
-
-
4. The method of claim 3, further comprising the steps of:
-
presenting text associated with the results hits list to a user;
accepting input from the user for selecting one or more of the results;
converting the one or more of the results into one or more additional search keys; and
performing a second database search using the search key from the first database search and the additional search keys.
-
-
5. The method of claim 3, wherein the results hits list comprises a unique multiple digit number representing each of the matches from said outputting step, wherein an entry in the results hits list comprises:
-
a first number in a first digit comprising the unique ID of the matched result; and
a second number in a second digit comprising the annotation identifier of the matched result.
-
-
6. The method of claim 5, wherein the entry in the results hits list further comprises:
-
a third number comprising an offset of the ordered text string associated with the matched result; and
a fourth number comprising a length of the ordered string associated with the matched result.
-
-
7. The method of claim 6, wherein the offset is appended to the second digit and the length is placed in a third digit of the multiple digit number.
-
8. The method of claim 5, wherein the entry in the results hits list further comprises a third number comprising an annotation order.
-
9. The method of claim 7, wherein the annotation order is stored in the most significant bits of a third digit and a zero is stored in the least significant bits of the third digit.
-
10. The method of claim 5, wherein said presenting step comprises the steps of:
-
constructing a results E-Hits list from the results hits list, wherein each element of the results E-Hits lists corresponds to a particular element in the results hits list and comprises;
string representation of the unique name corresponding to the unique ID;
string representation of the annotation type corresponding to the annotation identifier; and
string representation of the value of the annotation or base text represented by the associated results hits list element.
-
-
11. A method for searching a genetic sequence database comprising a plurality of loci, each locus having a unique name, one or more annotations, and genetic sequence data represented as an ordered text string, the genetic database being stored in one or more database files, the method comprising the steps of:
-
assigning a unique ID for each locus;
assigning an annotation identifier for each predefined annotation type;
constructing a parsed skeleton file associated with each of the database files, wherein each entry in the parsed skeleton file is associated with a particular locus and comprises a length and offset for each of the annotations and the genetic sequence data within the particular locus;
building an index file associated with each of the database files, wherein each entry in the index file comprises an offset and length into a database file for each locus and an offset and length of the corresponding entry in the parsed skeleton file;
selecting a portion of the genetic database in which to conduct a search;
constructing an input hits list comprising the unique ID of each locus identified in said portion;
specifying a search key comprising one or more keywords and one or more annotation types;
performing a first database search using the input hits list and the search key; and
outputting matches into a results hits list. - View Dependent Claims (12, 13)
reading the unique ID from the input hit list;
determining which of the one or more database files contains the unique ID;
calculating an offset into the associated index file, where the associated index entry is stored;
consulting the associated index entry to determine an offset and length of the locus and an offset and length of the associated parsed skeleton file entry;
reading the associated parsed skeleton file entry; and
searching for a match of the search key using the parsed skeleton file to parse the locus.
-
-
13. The method of claim 12, further comprising the steps of:
-
presenting text associated with the results hits list to a user;
accepting input from the user for selecting one or more of the results;
converting the one or more of the results into one or more additional search keys; and
performing a second database search using the search key from the first database search and the additional search keys.
-
-
14. A system for searching a genetic sequence database comprising loci, each locus having a unique name, one or more annotations, and an ordered text string, the genetic sequence database being stored in one or more database files, the system comprising:
-
a global index file generator coupled to the genetic sequence database for assigning a unique ID for each locus;
an annotation definition module coupled to the genetic sequence database for assigning an annotation identifier for each predefined annotation type;
a parsed skeleton file generator coupled to the genetic sequence database for constructing a parsed skeleton file associated with each of the database files, wherein each entry in the parsed skeleton file is associated with a particular locus and comprises one or more searchable object names, a length and an offset for each searchable object within the particular locus; and
an index file generator coupled to the genetic sequence database for building an index file associated with each of the database files, wherein each entry in the index file comprises an offset and length into a database file for each locus and an offset and length of the corresponding entry in the parsed skeleton file. - View Dependent Claims (15, 16, 17)
a read database module coupled to the genetic sequence database for selecting a portion of the genetic sequence database in which to conduct a search and for constructing an input hits list comprising the unique ID of each locus identified in said portion; and
a search module coupled to said read database module for specifying a search key comprising one or more keywords and one or more annotation types and for performing a first database search using said input hits list and said search key; and
outputting matches into a results hits list.
-
-
16. The system of claim 15 further comprising:
-
a context hits list coupled to said read database module and defining a context in which to conduct a context search;
a target hits list coupled to said read database module for defining a target to search in a context search; and
a context search module coupled to said context and target hits lists for searching said portion of the genetic sequence database for instances of targets defined by said target hits list in a context as defined by said context hits list and for constructing a results hits list therefrom.
-
-
17. The system of claim 16, wherein said results hits list comprises entries representing context and target matches, wherein said entries representing target matches include a pointer to an entry representing the relevant context.
-
18. A method for performing a context search on a genetic sequence database comprising loci, each locus having a unique name, one or more annotations, and an ordered text string, the genetic sequence database being stored in one or more database files, the method comprising the steps of:
-
reading an ordered string;
partitioning the ordered string into a plurality of sub-strings each marked either target or context;
specifying one or more context relationships;
searching for sub-strings marked target within regions that satisfy the specified context relationships;
storing matches found in said searching step; and
marking each sub-string found in said searching step with its associated context;
wherein said storing and marking steps comprise the steps of; creating a results hits list comprising an array wherein each entry of the array comprises an iref number, a type field, and mark field;
storing a pointer within each mark field that points to the associated context reference entry.
-
-
19. A computer program product comprising a computer useable medium having computer program logic stored therein, said computer program logic for enabling a computer to perform a context search on a genetic sequence database comprising loci, each locus having a unique name, one or more annotations, and an ordered text string, the genetic sequence database being stored in one or more database files, wherein said computer program logic comprises:
-
read means for enabling the computer to read an ordered string;
partition means for enabling the computer to partition the ordered string into a plurality of sub-strings each marked either target or context;
means for enabling the computer to specify one or more context relationships;
searching means for enabling the computer to search for sub-strings marked target within regions that satisfy the specified context relationships;
storage means for enabling the computer to store matches found in said searching step;
means for enabling the computer to mark each sub-string found in said searching step with its associated context;
means for enabling the computer to create a results hits list comprising an array, wherein each entry of the array comprises an iref number, a type field, and a mark field; and
means for enabling the computer to store a pointer within each mark field that points to the associated context reference entry.
-
-
20. A method for searching a genetic sequence database, the database comprising loci, each locus having a unique name, one or more annotations, and an ordered text string, the database being stored in one or more database files, the method comprising the steps of:
-
constructing a file map for the database, said file map comprising the file name of each database file in the database and the number of loci within each file;
constructing a global index comprising the names of all the loci and a unique ID for each locus;
building a parsed skeleton file associated with each database file, said parsed skeleton file comprising a plurality of entries, each entry associated with an individual locus, wherein each entry comprises one or more searchable object names, and an offset and length for each searchable object with a locus;
building an index file associated with each database file, said index file comprising a plurality of entries, each entry associated with an individual locus, wherein each entry comprises an offset into a database file, a length of the locus, an offset into the corresponding parsed skeleton file, and a length of the parsed skeleton file;
retrieving a unique ID associated with a particular locus of interest;
consulting the file map to determine the database file that contains the particular locus of interest;
calculating the offset into said index file associated with said database file;
reading the index file entry and the parsed skeleton file entry into memory; and
reading a first search query and conducting a first database search. - View Dependent Claims (21, 22, 23, 24, 25, 26)
presenting text associated with the results hit list to a user;
accepting input from the user for selecting one or more of the results;
converting the one or more results into one or more additional search queries; and
performing a second database search using the search query from the first database search and the additional one or more search queries.
-
-
25. A method according to claim 20, further comprising the step of assigning an annotation identifier for each predefined annotation type, said assigning step occurring prior to the first database search.
-
26. A method according to claim 25, wherein the first search query of the first database search comprises one or more keywords and one or more annotation types.
Specification