Computerized methods of data compression and analysis
First Claim
1. A computerized method of compressing symbolic information organized into a plurality of documents, each document having a plurality of symbols, the method comprising:
- (a) with a first document of the plurality of documents as an input document, automatically with a computer;
(i) identifying a plurality of symbol pairs, each symbol pair consisting of two sequential symbols in the input document; and
(ii) for each unique symbol pair of the plurality of symbol pairs, updating a count identifying the number of appearances of the unique symbol pair;
(b) performing part (a) on each of the other documents of the plurality of documents, wherein the respective counts for the symbol pairs identifies the number of previous appearances of that symbol pair in any of the plurality of documents; and
(c) after part (b), for at least one of the plurality of documents, producing a compressed document by causing the compressed document to include, at each position associated with one of the plurality of symbol pairs from the input document, a replacement symbol associated by a compression dictionary with the unique symbol pair matching the one of the plurality of symbol pairs, if the count for the unique symbol pair exceeds a threshold.
0 Assignments
0 Petitions
Accused Products
Abstract
A computerized method and apparatus compresses symbolic information, such as text. Symbolic information is compressed by recursively identifying pairs of symbols (e.g., pairs of words or characters) and replacing each pair with a respective replacement symbol. The number of times each symbol pair appears in the uncompressed text is counted, and pairs are only replaced if they appear more than a threshold number of times. In recursive passes, each replaced pair can include a previously substituted replacement symbol. The method and apparatus can achieve high compression especially for large datasets. Metadata, such as the number of times each pair appears, generated during compression of the documents can be used to analyze the documents and find similarities between two documents.
32 Citations
33 Claims
-
1. A computerized method of compressing symbolic information organized into a plurality of documents, each document having a plurality of symbols, the method comprising:
-
(a) with a first document of the plurality of documents as an input document, automatically with a computer; (i) identifying a plurality of symbol pairs, each symbol pair consisting of two sequential symbols in the input document; and (ii) for each unique symbol pair of the plurality of symbol pairs, updating a count identifying the number of appearances of the unique symbol pair; (b) performing part (a) on each of the other documents of the plurality of documents, wherein the respective counts for the symbol pairs identifies the number of previous appearances of that symbol pair in any of the plurality of documents; and (c) after part (b), for at least one of the plurality of documents, producing a compressed document by causing the compressed document to include, at each position associated with one of the plurality of symbol pairs from the input document, a replacement symbol associated by a compression dictionary with the unique symbol pair matching the one of the plurality of symbol pairs, if the count for the unique symbol pair exceeds a threshold. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. A computer system having at least one processor capable of compressing symbolic information organized into a plurality of documents, each document having a plurality of symbols, the computer system programmed to:
-
(a) with a first document of the plurality of documents as an input document, automatically with the processor; (i) identify a plurality of symbol pairs, each symbol pair consisting of two sequential symbols in the input document; (ii) for each unique symbol pair, update a count identifying the number of appearances of the symbol pair; (b) perform part (a) on each of the other documents of the plurality of documents, wherein the respective counts for the symbol pairs identifies the number of previous appearances of that symbol pair in any of the plurality of documents; and (c) after part (b), for at least one of the plurality of documents, produce a compressed document by causing the compressed document to include, at each position associated with one of the plurality of symbol pairs from the input document, a replacement symbol associated by a compression dictionary with the unique symbol pair matching the one of the plurality of symbol pairs, if the count for the unique symbol pair exceeds a threshold. - View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24)
-
-
25. A computer-readable storage medium that is not a transitory propagating signal storing a set of computer instructions for compressing symbolic information organized into a plurality of documents, each document having a plurality of symbols, wherein the set of computer instructions, when executed on a computer, causes the computer, automatically:
-
(a) with a first document of the plurality of documents as an input document; (i) to identify a plurality of symbol pairs, each symbol pair consisting of two sequential symbols in the input document; (ii) for each unique symbol pair, to update a count identifying the number of appearances of the symbol pair; (b) to perform part (a) on each of the other documents of the plurality of documents, wherein the respective counts for the symbol pairs identifies the number of previous appearances of that symbol pair in any of the plurality of documents; and (c) after part (b), for at least one of the plurality of documents, to produce a compressed document by causing the compressed document to include, at each position associated with one of the plurality of symbol pairs from the input document, a replacement symbol associated by a compression dictionary with the unique symbol pair matching the one of the plurality of symbol pairs, if the count for the unique symbol pair exceeds a threshold. - View Dependent Claims (26, 27, 28, 29, 30, 31, 32, 33)
-
Specification