Content-based segmentation scheme for data compression in storage and transmission including hierarchical segment representation
First Claim
1. A method of encoding data within a system, the method comprising:
- determining a set of cut points in input data, the input data including a sequence of symbols, wherein a cut point is determined using a fingerprint representation of a number of sequential symbols in the sequence of symbols;
segmenting the input data as indicated by the set of cut points;
for each segment, determining whether the segment is to be a referenced segment;
for each referenced segment, replacing the segment data of the referenced segment with a reference label;
for each referenced segment not already present in a persistent segment store, storing a reference binding in the persistent segment store, wherein a reference binding associates a referenced segment'"'"'s data and its reference label;
determining whether any sequence of segments is to be grouped as a reference group;
for each reference group, replacing the references in the group with a group label; and
for each reference group not already present in the persistent segment store, storing a group reference binding in the persistent segment store, wherein a group reference binding associates a reference group'"'"'s references with its group label.
21 Assignments
0 Petitions
Accused Products
Abstract
In a coding system, input data within a system is encoded. The input data might include sequences of symbols that repeat in the input data or occur in other input data encoded in the system. The encoding includes determining a target segment size, determining a window size, identifying a fingerprint within a window of symbols at an offset in the input data, determining whether the offset is to be designated as a cut point and segmenting the input data as indicated by the set of cut points. For each segment so identified, the encoder determines whether the segment is to be a referenced segment or an unreferenced segment, replacing the segment data of each referenced segment with a reference label and storing a reference binding in a persistent segment store for each referenced segment, if needed. Hierarchically, the process can be repeated by grouping references into groups, replacing the grouped references with a group label, storing a binding between the grouped references and group label, if one is not already present, and repeating the process. The number of levels of hierarchy can be fixed in advanced or it can be determined from the content encoded.
-
Citations
36 Claims
-
1. A method of encoding data within a system, the method comprising:
-
determining a set of cut points in input data, the input data including a sequence of symbols, wherein a cut point is determined using a fingerprint representation of a number of sequential symbols in the sequence of symbols;
segmenting the input data as indicated by the set of cut points;
for each segment, determining whether the segment is to be a referenced segment;
for each referenced segment, replacing the segment data of the referenced segment with a reference label;
for each referenced segment not already present in a persistent segment store, storing a reference binding in the persistent segment store, wherein a reference binding associates a referenced segment'"'"'s data and its reference label;
determining whether any sequence of segments is to be grouped as a reference group;
for each reference group, replacing the references in the group with a group label; and
for each reference group not already present in the persistent segment store, storing a group reference binding in the persistent segment store, wherein a group reference binding associates a reference group'"'"'s references with its group label. - View Dependent Claims (2, 3, 4, 5, 6)
recursively identifying groups of labels into higher level groups, wherein groups of labels are one or more of groups of reference labels and groups of group labels;
for each higher level group, replacing the higher level group with a group label; and
for each higher level group not already present in the persistent segment store, storing a group reference binding in the persistent segment store for the higher level group.
-
-
3. The method of claim 1, wherein the input data comprises payloads of messages between clients and servers in a client-server network.
-
4. The method of claim 1, wherein the input data comprises portions of files in an on-line backup system, further comprising representing files in the on-line backup system as sequences of at least one of reference labels and group labels, and storing contents of the persistent segment store as part of the on-line backup system.
-
5. The method of claim 1, wherein the input data comprises portions of files in a file system, further comprising representing files in the file system as sequences of at least one of reference labels and group labels.
-
6. The method of claim 1, wherein the input data comprises portions of files to be used in a file system, the method further comprising:
-
when storing a file to the file system, encoding it with at least one segment of the file being represented as a segment referenced in the persistent segment store; and
when retrieving a file from the file system, caching the file in a local file store as a decoded file, wherein each reference label and each group label is replaced with corresponding segment data from the persistent segment store.
-
-
7. A method for encoding data in a system, the method comprising:
-
determining a set of cut points for input data based on a fingerprint function, the fingerprint function indicating a cut point based on a number of symbols input into the fingerprint function;
segmenting the input data based on the set of cut points;
for each segment, determining whether the segment is to be a referenced segment;
for each referenced segment, replacing segments in the segmented input data with reference labels;
for each referenced segment not already present in a persistent segment store, storing a reference binding in the persistent segment store, wherein a reference binding associates a referenced segment'"'"'s data and its reference label;
determining whether a group of reference labels should be grouped as a reference group;
for each reference group determined, replacing the references in the group with a group label; and
for each reference group not already present in the persistent segment store, storing a group reference binding in the persistent segment store, wherein a group reference binding associates a reference group'"'"'s references with its group label. - View Dependent Claims (8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
determining a fingerprint window comprising a sequence of input symbols, wherein the fingerprint window is associated with an offset;
inputting the sequence of input symbols into the fingerprint function, the fingerprint function outputting a fingerprint value; and
determining from the fingerprint value if a cut point should be determined at the offset.
-
-
10. The method of claim 9, wherein determining the set of cut points comprises:
-
if it is not determined from the fingerprint value that a cut point should be determined at a new offset, advancing the fingerprint window to comprise a new sequence of input symbols, wherein the fingerprint window is associated with the offset;
inputting the new sequence of input symbols into the fingerprint function, the fingerprint function outputting a new fingerprint value; and
determining from the new fingerprint value if a cut point should be determined at the new offset.
-
-
11. The method of claim 10, further comprising repeating the advancing, inputting, and determining steps until a cut point is determined.
-
12. The method of claim 7, wherein determining whether the group of references should be grouped as the reference group comprises:
-
inputting the group of references into the fingerprint function, the fingerprint function outputting a fingerprint value; and
determining from the fingerprint value if the group of references should be grouped as a reference group.
-
-
13. The method of claim 12, further comprising:
-
if it is not determined from the fingerprint value that group of references should be grouped as the reference group, advancing the fingerprint window to comprise a new group of reference labels;
inputting the new group of reference labels into the fingerprint function, the fingerprint function outputting a new fingerprint value; and
determining from the new fingerprint value if the new group of reference labels should be a grouped as a reference group.
-
-
14. The method of claim 13, further comprising repeating the advancing, inputting, and determining steps until the reference group is determined.
-
15. The method of claim 7, wherein the reference group comprises at least one of a reference label and input data.
-
16. The method of claim 7, further comprising sending the segmented input data, the segmented input data including at least one of a reference label and a group label.
-
17. The method of claim 16, further comprising:
for each reference label in the segmented input data, retrieving from the persistent segment store the segment'"'"'s data that is associated with the reference label.
-
18. The method of claim 16, further comprising:
-
for each group label in the segmented input data, retrieving from the persistent segment store the reference labels that are associated with the group label; and
for each reference label retrieved, retrieving from the persistent segment store the segment'"'"'s data that is associated with the retrieved reference label.
-
-
19. An encoder for encoding data, the encoder comprising:
-
an input for receiving input data;
fingerprint logic configured to determine a fingerprint representation of a number of sequential symbols in a sequence of symbols;
a cutpoint determiner configured to determine a set of cut points in input data, 6 wherein a cut point is determined using the fingerprint representation of the number of sequential symbols in the sequence of symbols;
a segmenter configured to segment the input data as indicated by the set of cut points;
a replacer comprising;
for each segment, logic configured to determine whether the segment is to be a referenced segment;
for each referenced segment, logic configured to replace the segment data of the referenced segment with a reference label;
for each referenced segment not already present in a persistent segment store, logic configured to store a reference binding in the persistent segment store, wherein a reference binding associates a referenced segment'"'"'s data and its reference label;
logic configured to determine whether any sequence of segments is to be grouped as a reference group;
for each reference group, logic configured to replace the references in the group with a group label; and
for each reference group not already present in the persistent segment store, logic configured to store a group reference binding in the persistent segment store, wherein a group reference binding associates a reference group'"'"'s references with its group label. - View Dependent Claims (20, 21, 22, 23, 24)
logic configured to recursively identify groups of labels into higher level groups, wherein groups of labels are one or more of groups of reference labels and groups of group labels;
for each higher level group, logic configured to replace the higher level group with a group label; and
for each higher level group not already present in the persistent segment store, logic configured to store a group reference binding in the persistent segment store for the higher level group.
-
-
21. The encoder of claim 19, wherein the input data comprises payloads of messages between clients and servers in a client-server network.
-
22. The encoder of claim 19, wherein the input data comprises portions of files in an on-line backup system, further comprising logic configured to represent files in the on-line backup system as sequences of at least one of reference labels and group labels, and logic configured to store contents of the persistent segment store as part of the on-line backup system.
-
23. The encoder of claim 19, wherein the input data comprises portions of files in a file system, further comprising logic configured to represent files in the file system as sequences of at least one of reference labels and group labels and a segment store.
-
24. The encoder of claim 19, wherein the input data comprises portions of files to be used in a file system, the encoder further comprising:
-
logic configured to encode a file with at least one segment of the file being represented as a segment referenced in the persistent segment store when storing the file to the file system; and
logic configured to cache a file in a local file store as a decoded file, wherein each reference label and each group label is replaced with corresponding segment data from the persistent segment store when retrieving the file from the file system.
-
-
25. A coder for processing data, the coder comprising:
-
a cut point determiner configured to determine a set of cut points for input data based on a fingerprint function, the fingerprint function indicating a cut point based on a number of symbols input into the fingerprint function;
a segmenter configured to segment the input data based on the set of cut points;
a segment replacer comprising;
for each segment, logic configured to determine whether the segment is to be a referenced segment;
for each referenced segment, logic configured to replace segments in the segmented input data with reference labels; and
for each referenced segment not already present in a persistent segment store, logic configured to store a reference binding in the persistent segment store, wherein a reference binding associates a referenced segment'"'"'s data and its reference label;
a reference replacer comprising;
logic configured to determine whether a group of reference labels should be grouped as a reference group;
for each reference group determined, logic configured to replace the references in the group with a group label; and
for each reference group not already present in the persistent segment store, logic configured to store a group reference binding in the persistent segment store, wherein a group reference binding associates a reference group'"'"'s references with its group label. - View Dependent Claims (26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36)
determine a fingerprint window comprising a sequence of input symbols, wherein the fingerprint window is associated with an offset;
input the sequence of input symbols into the fingerprint function, the fingerprint function outputting a fingerprint value; and
determine from the fingerprint value if a cut point should be determined at the offset.
-
-
28. The coder of claim 27, wherein the cut point determiner is configured to:
-
if it is not determined from the fingerprint value that a cut point should be determined at a new offset, advance the fingerprint window to comprise a new sequence of input symbols, wherein the fingerprint window is associated with the offset;
input the new sequence of input symbols into the fingerprint function, the fingerprint function outputting a new fingerprint value; and
determine from the new fingerprint value if a cut point should be determined at the new offset.
-
-
29. The coder of claim 28, wherein the cutpoint determiner is configured to repeatedly advance, input, and determine until a cut point is determined.
-
30. The coder of claim 25, wherein the logic configured to determine whether the group of references should be grouped as the reference group comprises:
-
logic to input the group of references into the fingerprint function, the fingerprint function outputting a fingerprint value; and
logic to determine from the fingerprint value if the group of references should be a grouped as a reference group.
-
-
31. The coder of claim 30, wherein the logic configured to determine whether the group of references should be grouped as the reference group comprises:
-
if it is not determined from the fingerprint value that the group of references should be grouped as the reference group, logic configured to advance the fingerprint window to comprise a new group of reference labels;
logic configured to input the new group of reference labels into the fingerprint function, the fingerprint function outputting a new fingerprint value; and
logic configured to determine from the new fingerprint value if the new group of reference labels should be a grouped as a reference group.
-
-
32. The coder of claim 31, wherein the reference replacer is further configured to repeatedly advance, input, and determine until the reference group is determined.
-
33. The coder of claim 25, wherein the reference group comprises at least one of a reference label and input data.
-
34. The coder of claim 25, further comprising a communicator configured to send the segmented input data, the segmented input data including at least one of a reference label and a group label.
-
35. The coder of claim 34, further comprising:
for each reference label in the segmented input data, a decoder configured to retrieve from the persistent segment store the segment'"'"'s data that is associated with the reference label.
-
36. The coder of claim 35, wherein the decoder is configured to:
-
for each group label in the segmented input data, retrieve from the persistent segment store the reference labels that are associated with the group label; and
for each reference label retrieved, retrieve from the persistent segment store the segment'"'"'s data that is associated with the retrieved reference label.
-
Specification