Using dotplots for comparing and finding patterns in sequences of data points
First Claim
1. A method for identifying one or more patterns in a plurality of sequences, the method comprising:
- reading a set of sequential data with an analysis system, wherein the sequential data comprises a plurality of sequences, each of the plurality of sequences representing an ordered sequence of tokens;
generating, with the analysis system, a dotplot based on the tokens and representing matches between each sequence of the plurality sequences; and
identifying, with the analysis system, one or more patterns within the sequential data based on the dotplot by identifying linear relationships between the tokens and wherein identifying linear relationships between the tokens comprises;
determining a dotplot sub-matrix plotting tokens from two sequences;
identifying a set of points in the sub-matrix that corresponds to matching tokens in corresponding sub-sequences;
filtering the identified points against a pre-determined high-pass threshold;
fitting a linear regression line to the filtered points;
computing variance criterion based on Euclidean distances between the regression line and the filtered points;
filtering the filtered points to those within the variance criterion; and
re-computing the linear regression line using the points within the variance criterion.
1 Assignment
0 Petitions
Accused Products
Abstract
Embodiments of the invention provide systems and methods for analyzing sequential data. The sequential data can comprise a sequence of data points arranged in a particular order and thus representing a sequence. A number of such sequences can be analyzed, for example, to identify patterns or commonalities within the sequences or portions of sequences represented by the data. According to one embodiment, a method of identifying patterns in sequences of data points can comprise reading a set of sequential data. The sequential data can comprises a plurality of sequences and each of the plurality of sequences can represent an ordered sequence of tokens. A dotplot representing matches between each sequence of the plurality sequences can be generated. One or more patterns within the sequential data can then be identified based on the dotplot.
20 Citations
14 Claims
-
1. A method for identifying one or more patterns in a plurality of sequences, the method comprising:
-
reading a set of sequential data with an analysis system, wherein the sequential data comprises a plurality of sequences, each of the plurality of sequences representing an ordered sequence of tokens; generating, with the analysis system, a dotplot based on the tokens and representing matches between each sequence of the plurality sequences; and identifying, with the analysis system, one or more patterns within the sequential data based on the dotplot by identifying linear relationships between the tokens and wherein identifying linear relationships between the tokens comprises; determining a dotplot sub-matrix plotting tokens from two sequences; identifying a set of points in the sub-matrix that corresponds to matching tokens in corresponding sub-sequences; filtering the identified points against a pre-determined high-pass threshold; fitting a linear regression line to the filtered points; computing variance criterion based on Euclidean distances between the regression line and the filtered points; filtering the filtered points to those within the variance criterion; and re-computing the linear regression line using the points within the variance criterion. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A system for identifying one or more patterns in a plurality of sequences, the system comprising:
-
a processor; and a memory communicatively coupled with and readable by the processor and having stored therein a series of instructions which, when executed by the processor, cause the processor to read a set of sequential data, wherein the sequential data comprises a plurality of sequences, each of the plurality of sequences representing an ordered sequence of tokens, generate a dotplot representing matches between each sequence of the plurality sequences, and identify one or more patterns within the sequential data based on the dotplot by identifying linear relationships between the tokens and wherein identifying linear relationships between the tokens comprises; determining a dotplot sub-matrix plotting tokens from two sequences; identifying a set of points in the sub-matrix that corresponds to matching tokens in corresponding sub-sequences; filtering the identified points against a pre-determined high-pass threshold; fitting a linear regression line to the filtered points; computing variance criterion based on Euclidean distances between the regression line and the filtered points; filtering the filtered points to those within the variance criterion; and re-computing the linear regression line using the points within the variance criterion. - View Dependent Claims (10, 11)
-
-
12. A machine-readable memory device having stored thereon a series of instructions which, when executed by a processor, cause the processor to identify one or more patterns in a plurality of sequences by:
-
reading a set of sequential data, wherein the sequential data comprises a plurality of sequences, each of the plurality of sequences representing an ordered sequence of tokens; generating a dotplot representing matches between each sequence of the plurality sequences; and identifying one or more patterns within the sequential data based on the dotplot by identifying linear relationships between the tokens and wherein identifying linear relationships between the tokens comprises; determining a dotplot sub-matrix plotting tokens from two sequences; identifying a set of points in the sub-matrix that corresponds to matching tokens in corresponding sub-sequences; filtering the identified points against a pre-determined high-pass threshold; fitting a linear regression line to the filtered points; computing variance criterion based on Euclidean distances between the regression line and the filtered points; filtering the filtered points to those within the variance criterion; and re-computing the linear regression line using the points within the variance criterion. - View Dependent Claims (13, 14)
-
Specification