Method and system for compression indexing and efficient proximity search of text data
First Claim
1. A computer implemented method of compression indexing, comprising the steps of:
- selecting at least one data file, said data file having target text;
identifying each and every unique token, each of the unique tokens having a frequency;
counting the frequency of each unique token;
calculating parameters;
ranking the tokens from highest frequency to lowest frequency;
compressing the frequencies;
assigning a position to each instance of each and every unique token;
compressing the positions;
aggregating tokens, frequencies, parameters, and positions to form a compression index, said compression index being exhaustive of all said tokens, such that said target text is compressed 100%;
reconstituting a portion of the data file;
displaying the portion of the data file on a screen, wherein compressed positions point to a compressed text random access memory file, wherein the step of reconstituting the data file further comprising the steps of;
a) creating the compressed text RAM file;
b) selecting a domain to display, the domain being a portion of the data file, the domain having a starting point and an ending point;
c) decompressing successive integers;
d) determining positions of the tokens in the token list;
e) extracting the tokens from the token list; and
f) writing the tokens to the screen;
repeating steps c-f until the ending point of the domain is reached;
wherein the selected domain is part of a domains list, the domains list having a plurality of domains, the domains list having the starting point and the ending point of each domain.
1 Assignment
0 Petitions
Accused Products
Abstract
A system and method of compression indexing and efficient proximity search of text data permits high speed search featuring ranking the relevance of search results according to closeness of desired terms within each portion of text found. The system includes (a) preparing target text, (b) creating a “compression index ebook”, (c) browsing in a compression index ebook, and (d) searching in a compression index ebook. To create the compression index, the method includes the steps of selecting target text, identifying tokens, such as words and punctuation strings, wherein each of the tokens has a frequency. The frequencies of each token are counted. Tokens are ranked from highest frequency to lowest frequency. The frequencies are compressed. The next step is assigning positions to each token frequency and compressing the positions to form a compression index ebook, which is stored in random access memory to eliminate disk seeks during browsing and searching.
119 Citations
22 Claims
-
1. A computer implemented method of compression indexing, comprising the steps of:
-
selecting at least one data file, said data file having target text;
identifying each and every unique token, each of the unique tokens having a frequency;
counting the frequency of each unique token;
calculating parameters;
ranking the tokens from highest frequency to lowest frequency;
compressing the frequencies;
assigning a position to each instance of each and every unique token;
compressing the positions;
aggregating tokens, frequencies, parameters, and positions to form a compression index, said compression index being exhaustive of all said tokens, such that said target text is compressed 100%;reconstituting a portion of the data file;
displaying the portion of the data file on a screen, wherein compressed positions point to a compressed text random access memory file, wherein the step of reconstituting the data file further comprising the steps of;
a) creating the compressed text RAM file;
b) selecting a domain to display, the domain being a portion of the data file, the domain having a starting point and an ending point;
c) decompressing successive integers;
d) determining positions of the tokens in the token list;
e) extracting the tokens from the token list; and
f) writing the tokens to the screen;
repeating steps c-f until the ending point of the domain is reached;
wherein the selected domain is part of a domains list, the domains list having a plurality of domains, the domains list having the starting point and the ending point of each domain. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A computer implemented method for using a compression index, comprising the steps of:
- a) creating the compression index having the steps of;
(i) providing target text, the target text being at least one data file, the target text having tokens, the tokens having frequencies;
(ii) accumulating parameters;
(iii) building a list of all unique tokens represented in the target text, together with their respective frequencies;
(iv) sorting the list in order of declining token frequencies;
(v) accumulating positions data of each instance of each token; and
(vi) combining steps i-v into the compression index, the compression index being exhaustive of all tokens, such that the target text is 100% compressed;
b) browsing and searching the compression index, wherein the step of accumulating parameters comprises;
accumulating parameters and token frequencies on a single pass through the at least one data file, the data file having lightly marked up text, wherein the single pass through of the at least one data file of lightly marked up text comprises input for the compression index, wherein the step of building a list of all tokens represented in the target text, together with their respective frequencies further comprises the steps of;
sorting all tokens in order of declining frequency;
creating a temporary parameters file to facilitate passing parameters between successive states of the method to create the compression index;
parsing the target text as a means for creating the token list, each of the tokens having a flag byte preceding the token and a null byte following the token; and
compressing and outputting the frequencies of tokens;
wherein the step of accumulating positions data further comprises the steps of;
reserving a block of random access memory to accumulate positions data;
reparsing the entire data file;
recording the position of all tokens in the random access memory block; and
compressing and outputting the positions data, wherein the position of the first instance of each token is absolute, and the position of each subsequent instance of the same token is relative to the preceding position;
wherein the step of combining steps (i)-(v) into one compression index, further comprises the steps of;
outputting public parameters as plain text at a beginning of the compression index;
outputting compressed private parameters; and
appending the tokens, frequencies and positions to complete the compression index;
wherein numeric values assigned to successive tokens from the data file are compressed and successively appended to create a compressed text random access memory file, wherein the random access memory file is computationally equivalent to the target text such that the positions within this compressed text random access memory file are used as the position values of successive tokens. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20)
- a) creating the compression index having the steps of;
-
21. A computer readable medium containing instructions for controlling a computer system to perform a method, the method comprising the steps of:
-
selecting at least one file having target text;
identifying each and every unique token, each of the unique tokens having a frequency, the tokens including every word, every number, every string of punctuation characters, and every markup tag without exception;
counting the frequency of each unique token;
calculating parameters;
ranking the tokens from highest frequency to lowest frequency;
compressing the frequencies;
assigning a position to each instance of each token;
compressing the positions;
aggregating tokens, frequencies, parameters, and positions to form a compression index, such that the target text is compressed 100%; and
browsing and searching the compression index;reconstituting a portion of the at least one file;
displaying the portion of the at least one file on a screen;
wherein compressed positions point to a compressed text random access memory file, the step of reconstituting the at least one file further comprising the steps of;
a) creating the compressed text RAM file;
b) selecting a domain to display, the domain being a portion of the at least one file, the domain having a starting point and an ending point;
c) decompressing successive integers;
d) determining positions of the tokens in the token list;
e) extracting the tokens from the token list; and
f) writing the tokens to the screen;
repeating steps c-f until the ending point of the domain is reached;
wherein the selected domain is part of a domains list, the domains list having a plurality of domains, the domains list having the starting point and the ending point of each domain.
-
-
22. An apparatus, comprising:
- means for selecting at least one file, the at least one file having target text;
means for identifying each and every unique token, each of the unique tokens having a frequency, the tokens including every word, every number, every string of punctuation characters, and every markup tag without exception;means for counting the frequency of each unique token;
means for calculating parameters;
means for ranking the tokens from highest frequency to lowest frequency;
means for compressing the frequencies;
means for assigning a position to each instance of each token;means for compressing the positions; and
means for aggregating tokens, frequencies, parameters, and positions to form a compression index, such that the target text is compressed 100%; and
means for browsing and searching text derived from the compression index;
wherein the compression index is formed by accumulating parameters and token frequencies on a single pass through the at least one data file, the at least one data file having lightly marked up text, wherein the single pass through of the at least one data file of lightly marked up text comprises input for the compression index;wherein the tokens are all sorted in order of declining frequency;
wherein the compression index further comprises a temporary parameters file to facilitate passing parameters between successive states of the method to create the compression index;wherein the target text is parsed as a means for creating the token list, each of the tokens having a flag byte preceding the token and a null byte following the token; and
wherein the frequencies of tokens are compressed and outputted;
wherein a block of random access memory is reserved to accumulate positions data;
wherein the entire data file is reparsed;
wherein the position of all tokens are recorded in the random access memory block; and
wherein the compressed and outputted positions data includes the position of the first instance of each token being absolute, and the position of each subsequent instance of the same token being relative to the preceding position;
wherein public parameters are outputted as plain text at a beginning of the compression index;
compressed private parameters are outputted; and
the tokens, frequencies and positions are appended to complete the compression index;
wherein numeric values assigned to successive tokens from the data file are compressed and successively appended to create a compressed text random access memory file, wherein the random access memory file is computationally equivalent to the target text such that the positions within this compressed text random access memory file are used as the position values of successive tokens.
- means for selecting at least one file, the at least one file having target text;
Specification