System and Method for Data Driven De-Duplication
First Claim
1. A computer implemented method of locating redundancy within data, the method comprising:
- recording target locations within target data where a summary that identifies a particular pattern within the target data equals a predetermined value;
recording reference locations within reference data where a summary that identifies the particular pattern within the reference data equals the predetermined value;
determining a reference set of summaries of the reference data, each member of the reference set of summaries including a plurality of summaries indicative of patterns of reference data located at recorded reference locations;
determining a target set of summaries of the target data, each member of the target set of summaries including a plurality of summaries indicative of patterns of target data located at recorded target locations; and
identifying a subset of the reference data that is likely to match a subset of the target data by comparing members of the reference set of summaries to members of the target set of summaries.
6 Assignments
0 Petitions
Accused Products
Abstract
Described are computer-based methods and apparatuses, including computer program products, for removing redundant data from a storage system. In one example, a data delineation process delineates data targeted for de-duplication into regions using a plurality of markers. The de-duplication system determines which of these regions should be subject to further de-duplication processing by comparing metadata representing the regions to metadata representing regions of a reference data set. The de-duplication system identifies an area of data that incorporates the regions that should be subject to further de-duplication processing and de-duplicates this area with reference to a corresponding area within the reference data set.
-
Citations
20 Claims
-
1. A computer implemented method of locating redundancy within data, the method comprising:
-
recording target locations within target data where a summary that identifies a particular pattern within the target data equals a predetermined value; recording reference locations within reference data where a summary that identifies the particular pattern within the reference data equals the predetermined value; determining a reference set of summaries of the reference data, each member of the reference set of summaries including a plurality of summaries indicative of patterns of reference data located at recorded reference locations; determining a target set of summaries of the target data, each member of the target set of summaries including a plurality of summaries indicative of patterns of target data located at recorded target locations; and identifying a subset of the reference data that is likely to match a subset of the target data by comparing members of the reference set of summaries to members of the target set of summaries. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system configured to locate redundancy within data, the system comprising:
-
data storage storing reference data and target data; and a processor coupled to the data storage and configured to; record target locations within the target data where a summary that identifies a particular pattern within the target data equals a predetermined value; record reference locations within the reference data where a summary that identifies the particular pattern within the reference data equals the predetermined value; determine a reference set of summaries of the reference data, each member of the reference set of summaries including a plurality of summaries indicative of patterns of reference data located at recorded reference locations; determine a target set of summaries of the target data, each member of the target set of summaries including a plurality of summaries indicative of patterns of target data located at recorded target locations; and identify a subset of the reference data that is likely to match a subset of the target data by comparing members of the reference set of summaries to members of the target set of summaries. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A non-transitory computer readable medium storing computer readable instructions that, when executed by at least one processor, instruct the at least one processor to perform a method of locating redundancy within data, the method comprising:
-
recording target locations within target data where a summary that identifies a particular pattern within the target data equals a predetermined value; recording reference locations within reference data where a summary that identifies the particular pattern within the reference data equals the predetermined value; determining a reference set of summaries of the reference data, each member of the reference set of summaries including a plurality of summaries indicative of patterns of reference data located at recorded reference locations; determining a target set of summaries of the target data, each member of the target set of summaries including a plurality of summaries indicative of patterns of target data located at recorded target locations; and identifying a subset of the reference data that is likely to match a subset of the target data by comparing members of the reference set of summaries to members of the target set of summaries. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification