System and method for indexing weighted-sequences in large databases
First Claim
1. A method of generating an index for a sequence that supports a non-contiguous subsequence match, comprising:
- receiving a sequence;
receiving a window size;
encoding the sequence into a weighted-sequence;
encoding the weighted sequence into one or more one-dimensional sequences, wherein the length of each of the one or more one-dimensional sequences is less than the window size;
inserting each of the one or more one-dimensional sequences into a trie structure; and
generating the index, comprising;
generating a current sequential ID and a maximum sequential ID pair for generating each of the one or more trie nodes, wherein the current sequential ID of any descendant of a given trie node is between the current sequential ID of the given trie node and the maximum sequential ID;
generating an iso-depth link for each unique symbol in each of the one or more one-dimensional sequences, wherein the iso-depth link comprises trie nodes under the symbol; and
generating an offset list comprising an original position of each of the one or more subsequences in the weighted-sequence.
3 Assignments
0 Petitions
Accused Products
Abstract
The present invention provides an index structure for managing weighted-sequences in large databases. A weighted-sequence is defined as a two-dimensional structure in which each element in the sequence is associated with a weight. A series of network events, for instance, is a weighted-sequence because each event is associated with a timestamp. Querying a large sequence database by events'"'"' occurrence patterns is a first step towards understanding the temporal causal relationships among the events. The index structure proposed herein enables the efficient retrieval from the database of all subsequences (contiguous and non-contiguous) that match a given query sequence both by events and by weights. The index structure also takes into consideration the nonuniform frequency distribution of events in the sequence data.
90 Citations
16 Claims
-
1. A method of generating an index for a sequence that supports a non-contiguous subsequence match, comprising:
-
receiving a sequence;
receiving a window size;
encoding the sequence into a weighted-sequence;
encoding the weighted sequence into one or more one-dimensional sequences, wherein the length of each of the one or more one-dimensional sequences is less than the window size;
inserting each of the one or more one-dimensional sequences into a trie structure; and
generating the index, comprising;
generating a current sequential ID and a maximum sequential ID pair for generating each of the one or more trie nodes, wherein the current sequential ID of any descendant of a given trie node is between the current sequential ID of the given trie node and the maximum sequential ID;
generating an iso-depth link for each unique symbol in each of the one or more one-dimensional sequences, wherein the iso-depth link comprises trie nodes under the symbol; and
generating an offset list comprising an original position of each of the one or more subsequences in the weighted-sequence. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14)
-
-
13. A method of matching a query sequence in a weighted-sequences index, comprising:
-
receiving the query sequence;
transforming the query sequence into a weighted sequence;
encoding the weighted sequence into one or more one-dimensional sequences; and
searching one or more iso-depth links of the weighted-sequences index structure using the one or more one-dimensional sequences and returning an offset.
-
-
15. A machine-readable medium having instructions stored thereon for execution by a processor to perform a method of generating an index for a sequence that supports a non-contiguous subsequence match, comprising the steps of:
-
receiving a sequence;
receiving a window size;
encoding the sequence into a weighted-sequence;
encoding the weighted sequence into one or more one-dimensional sequences, wherein the length of each of the one or more one-dimensional sequences is less than the window size;
inserting each of the one or more one-dimensional sequences into a trie structure; and
creating the index comprising;
generating a current sequential ID and a maximum sequential ID pair for each of the one or more trie nodes, wherein the current sequential ID of any descendant of a given trie node is between the current sequential ID of the given trie node and the maximum sequential ID;
generating an iso-depth link for each unique symbol in each of the one or more one-dimensional sequences, wherein the iso-depth link comprises trie nodes under the symbol; and
generating an offset list comprising an original position of each of the one or more subsequences in the weighted-sequence.
-
-
16. A machine-readable medium having instructions stored thereon for execution by a processor to perform a method of matching a query sequence in a weighted-sequences index, comprising the steps of:
receiving the query sequence;
transforming the query sequence into a weighted sequence;
encoding the weighted sequence into one or more one-dimensional sequences; and
searching one or more iso-depth links of the weighted-sequences index structure using the one or more one-dimensional sequences and returning an offset.
Specification