Content-based segmentation scheme for data compression in storage and transmission including hierarchical segment representation
First Claim
1. An apparatus configured to process input data, wherein the input data is segmented into a plurality of segments, the apparatus comprising:
- logic configured to determine one or more segments in the plurality of segments that are to be referenced segments;
for each of the one or more referenced segments, logic configured to replace data for the referenced segments with a reference label;
for each referenced segment not already present in a persistent segment store, logic configured to store a reference binding in the persistent segment store, wherein a reference binding associates a referenced segment'"'"'s data and its reference label;
logic configured to determine whether any sequence of segments is to be grouped as a reference group;
for each reference group, logic configured to replace a reference label or data for segments in the sequence with a group label; and
for each reference group not already present in the persistent segment store, logic configured to store a group reference binding in the persistent segment store, wherein a group reference binding associates a reference group'"'"'s reference labels or data with its group label.
22 Assignments
0 Petitions
Accused Products
Abstract
In a coding system, input data within a system is encoded. The input data might include sequences of symbols that repeat in the input data or occur in other input data encoded in the system. The encoding includes determining a target segment size, determining a window size, identifying a fingerprint within a window of symbols at an offset in the input data, determining whether the offset is to be designated as a cut point and segmenting the input data as indicated by the set of cut points. For each segment so identified, the encoder determines whether the segment is to be a referenced segment or an unreferenced segment, replacing the segment data of each referenced segment with a reference label and storing a reference binding in a persistent segment store for each referenced segment, if needed. Hierarchically, the process can be repeated by grouping references into groups, replacing the grouped references with a group label, storing a binding between the grouped references and group label, if one is not already present, and repeating the process. The number of levels of hierarchy can be fixed in advanced or it can be determined from the content encoded.
-
Citations
17 Claims
-
1. An apparatus configured to process input data, wherein the input data is segmented into a plurality of segments, the apparatus comprising:
-
logic configured to determine one or more segments in the plurality of segments that are to be referenced segments; for each of the one or more referenced segments, logic configured to replace data for the referenced segments with a reference label; for each referenced segment not already present in a persistent segment store, logic configured to store a reference binding in the persistent segment store, wherein a reference binding associates a referenced segment'"'"'s data and its reference label; logic configured to determine whether any sequence of segments is to be grouped as a reference group; for each reference group, logic configured to replace a reference label or data for segments in the sequence with a group label; and for each reference group not already present in the persistent segment store, logic configured to store a group reference binding in the persistent segment store, wherein a group reference binding associates a reference group'"'"'s reference labels or data with its group label. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. An apparatus configured to process input data, the apparatus comprising:
-
logic configured to determine a first set of segments for the input data based on a first fingerprint function; logic configured to replace segments in the first set of segments for the input data with reference labels if it is determined they are referenced segments; for each referenced segment not already present in a persistent segment store, logic configured to store a reference binding in the persistent segment store, wherein a reference binding associates a referenced segment'"'"'s data and its reference label; logic configured to determine a second set of segments for the segmented input data based on a second fingerprint function; logic configured to replace segments in the second set of segments for the input data with group labels if it is determined they are referenced groups; and for each reference group not already present in the persistent segment store, logic configured to store a group reference binding in the persistent segment store, wherein a group reference binding associates a reference group'"'"'s references with its group label. - View Dependent Claims (11, 12, 13)
-
-
14. A method for processing input data, the method comprising:
-
(a) identifying sequences of segments in the input data, wherein sequences of segments are determined as groups to be referenced; (b) for each group, replacing the group with a label in the input data; and (c) for each group not already present in a persistent segment store, storing a reference binding in the persistent segment store for the group; and (d) recursively performing steps (a–
c) with the replaced groups in the input data a number of times equal to a number of desired levels of reference bindings, wherein at least two levels of reference bindings are stored in the persistent segment store. - View Dependent Claims (15, 16, 17)
-
Specification