Computer implemented methods for constructing a compressed data structure from a data string and for using the data structure to find data patterns in the data string
First Claim
1. A method comprising the following steps:
- producing a matrix of sorted rotations of a data string of characters, the matrix defining an A array comprising a sorted listing of the characters in the data string and a B array which is a permutation of the data string and a C array which contains entries of correspondence information linking the characters in the A array to the same characters in the B array;
segmenting the B array into blocks;
defining indexing information used to find particular characters within the blocks to reconstruct patterns of characters within the data string;
storing the blocks of the B array and the indexing information in a data structure; and
finding patterns of characters within the data string using the data structure.
2 Assignments
0 Petitions
Accused Products
Abstract
A method for constructing a data structure for a data string of characters includes producing a matrix of sorted rotations of the data string. This matrix defines an A array which is a sorted list of the characters in the data string, a B array which is a permutation of the data string, and a correspondence array C which contains correspondence entries linking the characters in the A array to the same characters in the B array. A reduced A'"'"' array is computed to identify each unique character in the A array and a reduced C'"'"' array is computed to contain every sth entry of the C array. The B array is segmented into blocks of size s. During a search, the A'"'"' and C'"'"' arrays are used to index the B array to reconstruct any desired row from the matrix of rotations. Through this representation, the matrix of rotations can thus be used as a conventional sorted list for pattern matching or information retrieval applications. A data structure containing only the A'"'"', B, and C'"'"' has very little memory overhead. The B array contains the same number of characters as the original data string, and can be compressed in a block wise manner to reduce its size. The A'"'"' array is a fixed size equal to the size of the alphabet used to construct the data string, and the C'"'"' array is variable size according to the relationship n/s, where n is the number of characters in the data string and s is the size of the blocks of the B array. Accordingly, the data structure enables a tradeoff between access speed and memory overhead, the product of which is constant with respect to block size s.
125 Citations
35 Claims
-
1. A method comprising the following steps:
-
producing a matrix of sorted rotations of a data string of characters, the matrix defining an A array comprising a sorted listing of the characters in the data string and a B array which is a permutation of the data string and a C array which contains entries of correspondence information linking the characters in the A array to the same characters in the B array; segmenting the B array into blocks; defining indexing information used to find particular characters within the blocks to reconstruct patterns of characters within the data string; storing the blocks of the B array and the indexing information in a data structure; and finding patterns of characters within the data string using the data structure. - View Dependent Claims (2, 3, 4, 6, 7, 9, 10, 11)
-
-
5. A method comprising the following steps:
-
producing a matrix of sorted rotations of the data string, the matrix defining a B array which is a permutation of the data string; segmenting the B array into blocks; defining indexing information used to find particular characters within the blocks to reconstruct patterns of characters within the data string; storing the blocks of the B array and the indexing information in a data structure; examining each block of the B array; for each associated block, constructing character type information that identifies which types of characters and how many characters of particular types are contained within the associated block; organizing the data structure to include the blocks of the B array and a separate array containing the character type information for correlated blocks; and finding patterns of characters within the data string using the data structure.
-
-
8. A method comprising the following steps:
-
producing a matrix of sorted rotations of the data string, the matrix defining a B array which is a permutation of the data string; segmenting the B array into blocks; defining indexing information used to find particular characters within the blocks to reconstruct patterns of characters within the data string; storing the blocks of the B array and the indexing information in a data structure; examining each block of the B array; for each block, constructing character jump information to reference where a next character of a particular type occurs in the blocks of the B array; organizing the data structure to include the blocks of the B array and a separate array containing the character jump information for correlated blocks; and finding patterns of characters within the data string using the data structure.
-
-
12. A method comprising the following steps:
-
producing a matrix of sorted rotations of the data string, the matrix defining an A array comprising a sorted listing of the characters in the data string and a B array which is a permutation of the data string; computing a correspondence array C which contains correspondence entries linking the characters in the A array to the same characters in the B array; computing a reduced A'"'"' array which identifies each unique character in the A array; computing a reduced C'"'"' array which contains every sth entry of the C array; segmenting the B array into blocks of size s; storing the blocks of the B array, the A'"'"' array, and the C'"'"' array in a data structure; and finding patterns of characters within the data string using the data structure. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
-
-
24. A computer-readable memory having a data structure, the data structure being derived from a matrix of sorted rotations formed from a data string of characters, the matrix defining an A array comprising a sorted listing of the characters in the data string and a B array which is a permutation of the data string and a C array which contains entries of correspondence information linking the characters in the A array to the same characters in the B array, the data structure comprising:
-
multiple blocks having substrings of characters segmented from the B array; and at least one index array used to index the blocks. - View Dependent Claims (25, 26, 27)
-
-
28. A computer-readable memory having a data structure, the data structure being derived from a matrix of sorted rotations formed from a data string of characters, the matrix defining an A array comprising a sorted listing of the characters in the data string and a B array which is a permutation of the data string and a C array which contains entries of correspondence information linking the characters in the A array to the same characters in the B array, the data structure comprising:
-
multiple blocks having substrings of s characters segmented from the B array; an A'"'"' array which identifies each unique character in the A array; a C'"'"' array which contains every sth entry of the C array; and the A'"'"' and C'"'"' arrays being used to reconstruct the data string of characters from the multiple blocks when searching for a particular pattern of characters within the data string. - View Dependent Claims (29, 30, 31)
-
-
32. A computer system comprising:
-
a processor; an application executable on the processor to produce a matrix of sorted rotations of a data string of characters, the matrix defining an A array comprising a sorted listing of the characters in the data string and a B array which is a permutation of the data string and a C array which contains entries of correspondence information linking the characters in the A array to the same characters in the B array, the application further directing the processor to segment the B array into blocks and to define indexing information used to find particular characters within the blocks to reconstruct patterns of characters within the data string; a memory to store a data structure containing the blocks of the B array and the indexing information; and the processor finding patterns of characters within the data string using the data structure. - View Dependent Claims (33, 34, 35)
-
Specification