Method and system for compression of a set of mostly similar strings allowing fast retrieval
First Claim
1. A computer implemented method for selecting a string for serving as a reference string for a comparison scheme for compressing a set of strings, the method comprising the following steps:
- (a) calculating preliminary compression results for every string relative to an initial reference string, and (b) using the preliminary compression results to find a better reference string without additional compression tests.
1 Assignment
0 Petitions
Accused Products
Abstract
A computer implemented method and system for selecting a string for serving as a reference string for a comparison scheme for compressing a set of strings calculates preliminary compression results for every string relative to an initial reference string, and uses the preliminary compression results to find a better reference string without additional compression tests. According to one embodiment, a histogram is calculated showing the number of occurrences of each compressed length for each string in the set plotted against the initial reference string and the better reference string has a length corresponding to an average compression length or center of gravity of the histogram.
-
Citations
21 Claims
-
1. A computer implemented method for selecting a string for serving as a reference string for a comparison scheme for compressing a set of strings, the method comprising the following steps:
-
(a) calculating preliminary compression results for every string relative to an initial reference string, and (b) using the preliminary compression results to find a better reference string without additional compression tests. - View Dependent Claims (2)
i) arbitrarily selecting said initial reference string.
-
-
3. A computer implemented method for selecting a string for serving as a reference string for a comparison scheme for compressing a set of strings, the method comprising the following steps:
-
(a) compressing the set of strings against an initial reference string so as to produce a set of compressed strings, (b) determining a histogram of the costs of all strings in the set of compressed strings showing for each different length of string in the set of compressed strings a frequency of occurrence in the set, and an identity of at least one string whose compression length equals said different length, and (c) using said histogram to determine a better reference string. - View Dependent Claims (4, 5, 6, 7, 8)
i) arbitrarily selecting said initial reference string.
-
-
5. The method according to claim 3, wherein step (c) comprises:
-
i) determining an average column in the histogram indicating the respective length of the string in the compressed set of strings having the greatest frequency of occurrence, and ii) selecting one string in said column if said column is not empty or a string in a non-empty column proximate thereto as the reference string.
-
-
6. The method according to claim 3, wherein step (c) comprises:
-
i) selecting a portion of the histogram that is most heavily populated, ii) selecting a string from the center of the most heavily populated portion of the histogram if not empty or a non-empty column proximate thereto, iii) using the selected string as a primary reference deriving an estimated histogram as it would appear had the primary reference been used as the compression reference string, iv) using the estimated histogram in order to determine at least one secondary reference string, v) compressing each string in the set of strings using both the primary reference string and each secondary reference string so as to generate a respective first and at least one second compressed string, and vi) for each of the first and second compressed strings, selecting whichever is shorter or either one if they are of identical length.
-
-
7. The method according to claim 3, wherein step (c) comprises:
-
i) selecting a portion of the histogram that is most heavily populated, ii) selecting a string from the center of the most heavily populated portion of the histogram if not empty or a non-empty column proximate thereto, iii) using the selected string as a primary reference deriving an estimated histogram as it would appear had the primary reference been used as the compression reference string, iv) using the estimated histogram in order to determine at least one secondary reference string, and v) selecting an estimated optimal primary or secondary reference string against which to compress the strings.
-
-
8. The method according to claim 6, wherein step (i) includes:
(1) calculating the center of gravity only for strings that are estimated to still have a compression length above a given threshold.
-
9. A computer implemented method for selecting a string for serving as a reference string for a comparison scheme for compressing a set of strings, the method comprising the following steps:
-
(a) compressing the set of strings against an initial reference string so as to produce a set of compressed strings, (b) determining a histogram of the costs of all strings in the set of compressed strings showing for each different length of string in the set of compressed strings a frequency of occurrence in the set, and an identity of at least one string whose compression length equals said different length, (c) determining an average column in the histogram indicating the respective length of the string in the compressed set of strings having the greatest frequency of occurrence, (d) selecting one string in said column as a first selected string if said column is not empty or a string in a non-empty column proximate thereto as the reference string, (e) estimating a cost for compressing relative to the first selected reference string, (f) selecting a portion of the histogram that is most heavily populated, (g) selecting a string from the center of the most heavily populated portion of the histogram if not empty or a non-empty column proximate thereto, (h) using the selected string as a primary reference deriving an estimated histogram as it would appear had its primary reference been used as the compression reference string, (i) using the estimated histogram in order to determine a secondary reference string, (j) estimating a cost for compressing each string in the set of strings using both the primary reference string and the secondary reference string so as to generate a respective first and second compressed string, (k) for each of the primary and secondary reference strings, selecting as a respective per string preferred reference string whichever is shorter or either one if they are of identical length, (l) estimating a cost for compressing each string using the respective per string preferred reference string, and (m) selecting the first selected string against which to compress all the strings or selecting for each string the respective per string preferred reference string against which to compress the respective string depending on which has the lower estimated cost. - View Dependent Claims (10)
i) arbitrarily selecting said initial reference string.
-
-
11. A system for selecting a string for serving as a reference string for a comparison scheme for compressing a set of strings, the system comprising:
-
a memory for storing said set of strings, a first compression engine coupled to the memory for compressing all of the strings in said set relative to an initial reference string, an analysis unit coupled to the first compression engine for calculating preliminary compression results for every string relative to the initial reference string, and a second compression engine coupled to the analysis unit for using the preliminary compression results to find a better reference string without additional compression tests. - View Dependent Claims (12, 13, 14, 15, 16, 17)
i) determine an average column in the histogram indicating the respective length of the string in the compressed set of strings having the greatest frequency of occurrence, and ii) select one string in said column if said column is not empty or a string in a non-empty column proximate thereto as the reference string.
-
-
14. The system according to claim 12, wherein the analysis unit is adapted to:
-
i) select a portion of the histogram that is most heavily populated, ii) select a string from the center of the most heavily populated portion of the histogram if not empty or a non-empty column proximate thereto, iii) use the selected string as a primary reference deriving an estimated histogram as it would appear had the primary reference been used as the compression reference string, iv) use the estimated histogram in order to determine at least one secondary reference string, v) compress each string in the set of strings using both the primary reference string and each secondary reference string so as to generate a respective first and at least one second compressed string, and vi) for each of the first and second compressed strings, select whichever is shorter or either one if they are of identical length.
-
-
15. The system according to claim 12, wherein the analysis unit is adapted to:
-
i) select a portion of the histogram that is most heavily populated, ii) select a string from the center of the most heavily populated portion of the histogram if not empty or a non-empty column proximate thereto, iii) use the selected string as a primary reference deriving an estimated histogram as it would appear had the primary reference been used as the compression reference string, iv) use the estimated histogram in order to determine at least one secondary reference string, and v) select an estimated optimal primary or secondary reference string against which to compress the strings.
-
-
16. The system according to claim 14, wherein the analysis unit is adapted to calculate the center of gravity only for strings that are estimated to still have a compression length above a given threshold.
-
17. The system according to claim 12, wherein the analysis unit is adapted to:
-
(a) compress the set of strings against an initial reference string so as to produce a set of compressed strings, (b) determine a histogram of the costs of all strings in the set of compressed strings showing for each different length of string in the set of compressed strings a frequency of occurrence in the set, and an identity of at least one string whose compression length equals said different length, (c) determine an average column in the histogram indicating the respective length of the string in the compressed set of strings having the greatest frequency of occurrence, (d) select one string in said column as a first selected string if said column is not empty or a string in a non-empty column proximate thereto as the reference string, (e) estimate a cost for compressing relative to the first selected reference string, (f) select a portion of the histogram that is most heavily populated, (g) select a string from the center of the most heavily populated portion of the histogram if not empty or a non-empty column proximate thereto, (h) use the selected string as a primary reference deriving an estimated histogram as it would appear had its primary reference been used as the compression reference string, (i) use the estimated histogram in order to determine at least one secondary reference string, (j) estimate a cost for compressing each string in the set of strings using both the primary reference string and the at least one secondary reference string so as to generate a respective first and at least one second compressed string, (k) for each of the primary and secondary reference strings, selecting as a respective per string preferred reference string whichever is shorter or either one if they are of identical length, (l) estimating a cost for compressing each string using the respective per string preferred reference string, and (m) selecting the first selected string against which to compress all the strings or selecting for each string the respective per string preferred reference string against which to compress the respective string depending on which has the lower estimated cost.
-
-
18. A computer implemented program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for selecting a string for serving as a reference string for a comparison scheme for compressing a set of strings, the method comprising the following steps:
-
(a) calculating preliminary compression results for every string relative to an initial reference string, and (b) using the preliminary compression results to find a better reference string without additional compression tests.
-
-
19. A computer implemented computer program product comprising a computer useable medium having computer readable program code embodied therein for selecting a string for serving as a reference string for a comparison scheme for compressing a set of strings, the computer program product comprising:
-
computer readable program code for receiving an initial reference string and causing the computer to calculate preliminary compression results for every string relative to the initial reference string, and computer readable program code for causing the computer to use the preliminary compression results to find a better reference string without additional compression tests.
-
-
20. A computer implemented program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for selecting a string for serving as a reference string for a comparison scheme for compressing a set of strings, the method comprising the following steps:
-
(a) compressing the set of strings against an initial reference string so as to produce a set of compressed strings, (b) determining a histogram of the costs of all strings in the set of compressed strings showing for each different length of string in the set of compressed strings a frequency of occurrence in the set, and an identity of at least one string whose compression length equals said different length, (c) determining an average column in the histogram indicating the respective length of the string in the compressed set of strings having the greatest frequency of occurrence, (d) selecting one string in said column as a first selected string if said column is not empty or a string in a non-empty column proximate thereto as the reference string, (e) estimating a cost for compressing the first selected reference string, (f) selecting a portion of the histogram that is most heavily populated, (g) selecting a string from the center of the most heavily populated portion of the histogram if not empty or a non-empty column proximate thereto, (h) using the selected string as a primary reference deriving an estimated histogram as it would appear had its primary reference been used as the compression reference string, (i) using the estimated histogram in order to determine a secondary reference string, (j) estimating a cost for compressing each string in the set of strings using both the primary reference string and the secondary reference string so as to generate a respective first and second compressed string, (k) for each of the primary and secondary reference strings, selecting as a respective per string preferred reference string whichever is shorter or either one if they are of identical length, (l) estimating a cost for compressing each string using the respective per string preferred reference string, and (m) selecting the first selected string against which to compress all the strings or selecting for each string the respective per string preferred reference string against which to compress the respective string depending on which has the lower estimated cost.
-
-
21. A computer implemented computer program product comprising a computer useable medium having computer readable program code embodied therein for selecting a string for serving as a reference string for a comparison scheme for compressing a set of strings, the computer program product comprising:
-
computer readable program code for causing the computer to compress the set of strings against an initial reference string so as to produce a set of compressed strings, computer readable program code for causing the computer to determine a histogram of the costs of all strings in the set of compressed strings showing for each different length of string in the set of compressed strings a frequency of occurrence in the set, and an identity of at least one string whose compression length equals said different length, computer readable program code for causing the computer to determine an average column in the histogram indicating the respective length of the string in the compressed set of strings having the greatest frequency of occurrence, computer readable program code for causing the computer to select one string in said column as a first selected string if said column is not empty or a string in a non-empty column proximate thereto as the reference string, computer readable program code for causing the computer to estimate a cost for compressing the first selected reference string, computer readable program code for causing the computer to select a portion of the histogram that is most heavily populated, computer readable program code for causing the computer to select a string from the center of the most heavily populated portion of the histogram if not empty or a non-empty column proximate thereto, computer readable program code for causing the computer to us the selected string as a primary reference deriving an estimated histogram as it would appear had its primary reference been used as the compression reference string, computer readable program code for causing the computer to use the estimated histogram in order to determine a secondary reference string, computer readable program code for causing the computer to estimate a cost for compressing each string in the set of strings using both the primary reference string and the secondary reference string so as to generate a respective first and second compressed string, computer readable program code for causing the computer to select as a respective per string preferred reference string for each of the primary and secondary reference strings, whichever is shorter or either one if they are of identical length, computer readable program code for causing the computer to estimate a cost for compressing each string using the respective per string preferred reference string, and computer readable program code for causing the computer to select the first selected string against which to compress all the strings or to select for each string the respective per string preferred reference string against which to compress the respective string depending on which has the lower estimated cost.
-
Specification