System and method for data driven de-duplication
First Claim
1. A computer implemented method of locating redundancy within data, the method comprising;
- recording target locations within target data where a summary that identifies a particular pattern within the target data equals a predetermined value;
recording reference locations within reference data where a summary that identifies the particular pattern within the reference data equals the predetermined value;
determining a reference set of summaries of the reference data, each member of the reference set of summaries including one or more summaries indicative of one or more patterns of reference data located at one or more recorded reference locations within the reference data;
determining a target set of summaries of the target data, each member of the target set of summaries including one or more summaries indicative of one or more patterns of target data located at one or more recorded target locations within the target data;
prioritizing at least one member from the reference set of summaries relative to at least one other member of the reference set of summaries by comparing at least one summary included in the at least one member from the reference set of summaries to at least one summary included in the at least one other member of the reference set of summaries;
prioritizing at least one member from the target set of summaries relative to at least one other member of the target set of summaries by comparing at least one summary included in the at least one member from the target set of summaries to at least one summary included in the at least one other member of the target set of summaries;
identifying a subset of the reference data that is likely to match a subset of the target data by comparing the at least one summary included in the at least one member from the reference set of summaries to the at least one summary included in the at least one member from the target set of summaries; and
de-duplicating the subset of the target data with reference to the subset of the reference data, wherein recording the target locations includes recording target locations within the target data where a subset of a rolling hash value calculated for a region of the target data equals the predetermined value.
6 Assignments
0 Petitions
Accused Products
Abstract
Described are computer-based methods and apparatuses, including computer program products, for removing redundant data from a storage system. In one example, a data delineation process delineates data targeted for de-duplication into regions using a plurality of markers. The de-duplication system determines which of these regions should be subject to further de-duplication processing by comparing metadata representing the regions to metadata representing regions of a reference data set. The de-duplication system identifies an area of data that incorporates the regions that should be subject to further de-duplication processing and de-duplicates this area with reference to a corresponding area within the reference data set.
85 Citations
15 Claims
-
1. A computer implemented method of locating redundancy within data, the method comprising;
-
recording target locations within target data where a summary that identifies a particular pattern within the target data equals a predetermined value; recording reference locations within reference data where a summary that identifies the particular pattern within the reference data equals the predetermined value; determining a reference set of summaries of the reference data, each member of the reference set of summaries including one or more summaries indicative of one or more patterns of reference data located at one or more recorded reference locations within the reference data; determining a target set of summaries of the target data, each member of the target set of summaries including one or more summaries indicative of one or more patterns of target data located at one or more recorded target locations within the target data; prioritizing at least one member from the reference set of summaries relative to at least one other member of the reference set of summaries by comparing at least one summary included in the at least one member from the reference set of summaries to at least one summary included in the at least one other member of the reference set of summaries; prioritizing at least one member from the target set of summaries relative to at least one other member of the target set of summaries by comparing at least one summary included in the at least one member from the target set of summaries to at least one summary included in the at least one other member of the target set of summaries; identifying a subset of the reference data that is likely to match a subset of the target data by comparing the at least one summary included in the at least one member from the reference set of summaries to the at least one summary included in the at least one member from the target set of summaries; and de-duplicating the subset of the target data with reference to the subset of the reference data, wherein recording the target locations includes recording target locations within the target data where a subset of a rolling hash value calculated for a region of the target data equals the predetermined value. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A system configured to locate redundancy within data, the system comprising:
-
data storage storing reference data and target data; and a processor coupled to the data storage and configured to; record target locations within the target data where a summary that identifies a particular pattern within the target data equals a predetermined value; record reference locations within the reference data where a summary that identifies the particular pattern within the reference data equals the predetermined value; determine a reference set of summaries of the reference data, each member of the reference set of summaries including one or more summaries indicative of one or more patterns of reference data located at one or more recorded reference locations within the reference data; determine a target set of summaries of the target data, each member of the target set of summaries including one or more summaries indicative of one or more patterns of target data located at one or more recorded target locations within the target data; prioritize at least one member from the reference set of summaries relative to at least one other member of the reference set of summaries by comparing at least one summary included in the at least one member from the reference set of summaries to at least one summary included in the at least one other member of the reference set of summaries; prioritize at least one member from the target set of summaries relative to at least one other member of the target set of summaries by comparing at least one summary included in the at least one member from the target set of summaries to at least one summary included in the at least one other member of the target set of summaries; identify a subset of the reference data that is likely to match a subset of the target data by comparing the at least one summary included in the at least one member from the reference set of summaries to the at least one summary included in the at least one member from the target set of summaries; and de-duplicate the subset of the target data with reference to the subset of the reference data, wherein the processor is configured to record the target locations by recording target locations within the target data where a subset of a rolling hash value calculated for a region of the target data equals the predetermined value. - View Dependent Claims (7, 8, 9, 10)
-
-
11. A non-transitory computer readable medium storing computer readable instructions that, when executed by at least one processor, instruct the at least one processor to perform a method of locating redundancy within data, the method comprising:
-
recording target locations within target data where a summary that identifies a particular pattern within the target data equals a predetermined value; recording reference locations within reference data where a summary that identifies the particular pattern within the reference data equals the predetermined value; determining a reference set of summaries of the reference data, each member of the reference set of summaries including one or more summaries indicative of one or more patterns of reference data located at one or more recorded reference locations within the reference data; determining a target set of summaries of the target data, each member of the target set of summaries including one or more summaries indicative of one or more patterns of target data located at one or more recorded target locations within the target data; prioritizing at least one member from the reference set of summaries relative to at least one other member of the reference set of summaries by comparing at least one summary included in the at least one member from the reference set of summaries to at least one summary included in the at least one other member of the reference set of summaries; prioritizing at least one member from the target set of summaries relative to at least one other member of the target set of summaries by comparing at least one summary included in the at least one member from the target set of summaries to at least one summary included in the at least one other member of the target set of summaries; identifying a subset of the reference data that is likely to match a subset of the target data by comparing the at least one summary included in the at least one member from the reference set of summaries to the at least one summary included in the at least one member from the target set of summaries; and de-duplicating the subset of the target data with reference to the subset of the reference data, wherein the instructions for recording the target locations instruct the processor to perform acts including recording target locations within the target data where a subset of a rolling hash value calculated for a region of the target data equals the predetermined value. - View Dependent Claims (12, 13, 14, 15)
-
Specification