Methods and apparatus for performing pattern discovery and generation with respect to data sequences
First Claim
1. A method of detecting repeating patterns in an input data sequence, wherein the data sequence includes elements from an element alphabet, the method comprising the steps of:
- obtaining the input data sequence;
constructing a set of patterns from the input data sequence, each pattern being unique and including one or more elements from the input data sequence, and each pattern having a list associated therewith representing the location of the pattern in the input data sequence;
removing a pattern from the set when the location list of the pattern is a union of the location lists of at least two other patterns in the set;
for each pair of compatible patterns in the set, constructing a new pattern which is a concatenation of the pair of compatible patterns, each new pattern having a location list associated therewith; and
storing the patterns, and associated location lists, remaining after the removing step and the new pattern constructing step as the detected repeating patterns.
1 Assignment
0 Petitions
Accused Products
Abstract
Given an input sequence of data, a motif is a repeating pattern. The data could be a sequence of characters or sets of characters or even real values. In the first two cases, the number of motifs could potentially be exponential in the size of the input sequence and in the third case there could be uncountably infinite number of motifs. By suitably defining the notion of maximality and redundancy for any sequence with n characters, there exists only a linear (or no more than 3n) number of special motifs and every other motif can be generated from these irredundant motifs.
36 Citations
21 Claims
-
1. A method of detecting repeating patterns in an input data sequence, wherein the data sequence includes elements from an element alphabet, the method comprising the steps of:
-
obtaining the input data sequence;
constructing a set of patterns from the input data sequence, each pattern being unique and including one or more elements from the input data sequence, and each pattern having a list associated therewith representing the location of the pattern in the input data sequence;
removing a pattern from the set when the location list of the pattern is a union of the location lists of at least two other patterns in the set;
for each pair of compatible patterns in the set, constructing a new pattern which is a concatenation of the pair of compatible patterns, each new pattern having a location list associated therewith; and
storing the patterns, and associated location lists, remaining after the removing step and the new pattern constructing step as the detected repeating patterns. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
removing the vertex ν
mc1 and replacing the associated pattern mp of vertex ν
mp with a union of mp and mcl, when l is equivalent to one; and
removing the vertex ν
mp and its incident edges, when l is greater than one.
-
-
10. The method of claim 9, wherein the second constructing step further comprises:
-
defining a pair of patterns ma and mb to be compatible when the last character in ma is equivalent to the first character in mb;
for each pair of compatible patterns mi, mj, corresponding to vertices in the graph, letting ′
mi be equivalent to the sum of mi and the length of pattern mi;
constructing mnew to be equivalent to the intersection of ′
mi and mj; and
constructing pattern mnew when there exist compatible patterns mi and mj such that the union of;
(i) the sum of mi and the length of pattern mi minus one; and
(ii) mj, is equivalent to mnew, and the cardinality of mnew is greater than or equal to a value k.
-
-
11. The method of claim 10, further comprising the step of updating the graph by introducing a vertex corresponding to pattern mnew and introducing a directed edge from ν
-
mi to ν
mnew.
-
mi to ν
-
12. The method of claim 11, repeating the removing step and the new pattern constructing step until mnew is empty, for every pair of compatible patterns mi, mj with mnew equivalent to the union of mi and mj.
-
13. The method of claim 1, wherein the stored patterns are maximal and non-redundant.
-
14. The method of claim 13, further comprising the step of generating patterns, from the stored patterns, which are at least one of non-maximal and redundant.
-
15. The method of claim 14, further comprising the step of storing the at least one of non-maximal and redundant patterns.
-
16. The method of claim 14, wherein the at least one of non-maximal and redundant patterns are generated in accordance with one or more annotated tries.
-
17. The method of claim 1, wherein the input data sequence is a protein sequence.
-
18. The method of claim 17, wherein the stored patterns are used in accordance with protein sequence homology detection.
-
19. The method of claim 1, wherein the input data sequence is obtained from a client device via a network and the first constructing step, the removing step, the second constructing step and the storing step are performed in accordance with a server coupled to the network.
-
20. Apparatus for detecting repeating patterns in an input data sequence, wherein the data sequence includes elements from an element alphabet, the apparatus comprising:
at least one processor operative to;
(i) obtain the input data sequence;
(ii) construct a set of patterns from the input data sequence, each pattern being unique and including one or more elements from the input data sequence, and each pattern having a list associated therewith representing the location of the pattern in the input data sequence;
(iii) remove a pattern from the set when the location list of the pattern is a union of the location lists of at least two other patterns in the set;
(iv) for each pair of compatible patterns in the set, construct a new pattern which is a concatenation of the pair of compatible patterns, each new pattern having a location list associated therewith; and
(v) store the patterns, and associated location lists, remaining after the removing operation and the new pattern constructing operation as the detected repeating patterns.
-
21. An article of manufacture for detecting repeating patterns in an input data sequence, wherein the data sequence includes elements from an element alphabet, comprising a machine readable medium containing one or more programs which when executed implement the steps of:
-
obtaining the input data sequence;
constructing a set of patterns from the input data sequence, each pattern being unique and including one or more elements from the input data sequence, and each pattern having a list associated therewith representing the location of the pattern in the input data sequence;
removing a pattern from the set when the location list of the pattern is a union of the location lists of at least two other patterns in the set;
for each pair of compatible patterns in the set, constructing a new pattern which is a concatenation of the pair of compatible patterns, each new pattern having a location list associated therewith; and
storing the patterns, and associated location lists, remaining after the removing step and the new pattern constructing step as the detected repeating patterns.
-
Specification