Method for identifying sub-sequences of interest in a sequence
First Claim
1. A tangible, machine-readable media, comprising:
- code adapted to analyze a data sequence, consisting essentially of a plurality of symbols, based on a grammar comprising at least an initial grammar, wherein the grammar defines at least the plurality of symbols contained within the data sequence;
code adapted to partition the data sequence into a plurality of sub-sequences and to calculate a statistical heuristic for each sub-sequence of the analyzed data sequence;
code adapted to compare a selected statistical heuristic of a selected sub-sequence with one or more reference conditions, and to yield a termination result if the selected statistical heuristic is beyond a threshold defined by the one or more reference conditions or a non-termination result otherwise;
code adapted to update the grammar and the data series with a symbol representing the selected sub-sequence based upon the non-termination result of the comparison; and
code adapted to identify the selected sub-sequence as a sequence of interest based upon the termination result of the comparison.
2 Assignments
0 Petitions
Accused Products
Abstract
The present technique provides for the analysis of a data series to identify sequences of interest within the series. The analysis may be used to iteratively update a grammar used to analyze the data series or updated versions of the data series. Furthermore, the technique provides for the calculation of a minimum description length heuristic, such as a symbol compression ratio, for each sub-sequence of the analyzed data sequence. The technique may then compare a selected heuristic value against one or more reference conditions to determine if additional iteration is to be performed. The grammar and the data sequence may be updated between iterations to include a symbol representing a string corresponding to the selected heuristic value based upon a non-termination result of the comparison. Alternatively, the string corresponding to the selected heuristic value may be identified as a sequence of interest based upon a termination result of the comparison.
-
Citations
9 Claims
-
1. A tangible, machine-readable media, comprising:
-
code adapted to analyze a data sequence, consisting essentially of a plurality of symbols, based on a grammar comprising at least an initial grammar, wherein the grammar defines at least the plurality of symbols contained within the data sequence; code adapted to partition the data sequence into a plurality of sub-sequences and to calculate a statistical heuristic for each sub-sequence of the analyzed data sequence; code adapted to compare a selected statistical heuristic of a selected sub-sequence with one or more reference conditions, and to yield a termination result if the selected statistical heuristic is beyond a threshold defined by the one or more reference conditions or a non-termination result otherwise; code adapted to update the grammar and the data series with a symbol representing the selected sub-sequence based upon the non-termination result of the comparison; and code adapted to identify the selected sub-sequence as a sequence of interest based upon the termination result of the comparison. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A tangible, machine-readable media, comprising:
-
code adapted to analyze a biological polymer sequence, consisting essentially of a plurality of symbols, based on a grammar comprising at least an initial grammar, wherein the grammar defines at least the plurality of symbols contained within the biological polymer sequence; code adapted to partition the biological polymer sequence into a plurality of sub-sequences and to calculate a minimum description length heuristic for each sub-sequence of the analyzed biological polymer sequence; code adapted to compare a selected minimum description length heuristic corresponding to a selected sub-sequence with one or more reference conditions, and to yield a termination result if the selected minimum description length heuristic is beyond a threshold defined by the one or more reference conditions or a non-termination result otherwise; code adapted to update the grammar and the biological polymer sequence with a symbol representing the selected sub-sequence based upon the non-termination result of the comparison; and code adapted to identify the selected sub-sequence as a biological sequence of interest based upon the termination result of the comparison. - View Dependent Claims (8, 9)
-
Specification