System and method for sampling based elimination of duplicate data
First Claim
Patent Images
1. A method for removing duplicate data from a data set, the method comprising:
- identifying, by a processor, an anchor within the data set, wherein the anchor is a specific section within the data set that defines a region of interest for potential data de-duplication;
determining, by the processor, whether the identified anchor already exists within an anchor database;
in response to determining that the identified anchor already exists within the anchor database, performing, by the processor, a data comparison between the data set and a stored data set to identify a forward delta value and a backward delta value which collectively identify a number of consecutive bits of data that match between the data set and the stored data set forward and backward from the identified anchor, respectively; and
replacing, by the processor, a specific region of the data set identified by the anchor, the forward delta value and the backward delta value as duplicate data with a storage indicator to form a modified data set.
2 Assignments
0 Petitions
Accused Products
Abstract
A technique for eliminating duplicate data is provided. Upon receipt of a new data set, one or more anchor points are identified within the data set. A bit-by-bit data comparison is then performed of the region surrounding the anchor point in the received data set with the region surrounding an anchor point stored within a pattern database to identify forward/backward delta values. The duplicate data identified by the anchor point, forward and backward delta values is then replaced in the received data set with a storage indicator.
109 Citations
24 Claims
-
1. A method for removing duplicate data from a data set, the method comprising:
-
identifying, by a processor, an anchor within the data set, wherein the anchor is a specific section within the data set that defines a region of interest for potential data de-duplication; determining, by the processor, whether the identified anchor already exists within an anchor database; in response to determining that the identified anchor already exists within the anchor database, performing, by the processor, a data comparison between the data set and a stored data set to identify a forward delta value and a backward delta value which collectively identify a number of consecutive bits of data that match between the data set and the stored data set forward and backward from the identified anchor, respectively; and replacing, by the processor, a specific region of the data set identified by the anchor, the forward delta value and the backward delta value as duplicate data with a storage indicator to form a modified data set. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A system having a processor and configured to remove duplicate data from a data set, the system comprising:
-
means for identifying an anchor within the data set, wherein the anchor is a specific section within the data set that defines a region of interest for potential data de-duplication; means for determining whether the identified anchor already exists within an anchor database; in response to determining that the identified anchor already exists within the anchor database, means for performing a data comparison between the data set and a stored data set to identify a forward delta value and a backward delta value relative to the identified anchor which collectively identify a number of consecutive bits of data that match between the data set and the stored data set forward and backward from the identified anchor, respectively; and means for replacing a specific region of the data set identified by the anchor, the forward delta value and the backward delta value as duplicate data with a storage indicator to form a modified data set. - View Dependent Claims (13, 14, 15, 16, 17)
-
-
18. A system configured to remove duplicate data from a data set, the system comprising:
-
a storage system having a processor and configured to serve the data set; and a virtual tape library system configured to receive the data set from the storage system, the virtual tape library system configured to identify an anchor within the data set, wherein the anchor is a specific section within the data set that defines a region of interest for potential data de-duplication, determine whether the identified anchor already exists within an anchor database, perform a data comparison between the data set and a stored data set to identify a forward delta value and a backward delta value which collectively identify a number of consecutive bits of data that match between the data set and the stored data set forward and backward from the identified anchor, respectively, in response to a determination that the identified anchor already exists within the anchor database, and replace a specific region of the data set identified by the anchor, the forward delta value and the backward delta value as duplicate data with a storage indicator to form a modified data set. - View Dependent Claims (19, 20, 21, 22, 23)
-
-
24. A non-transitory computer readable medium containing program instructions executed by a processor, comprising:
-
program instructions that identify an anchor within a data set, wherein the anchor is a specific section within the data set that defines a region of interest for potential data de-duplication; program instructions that determine whether the identified anchor already exists within an anchor database; program instructions that perform a data comparison between the data set and a stored data set to identify a forward delta value and a backward delta value relative to the identified anchor which collectively identify a number of consecutive bits of data that match between the data set and the stored data set forward and backward from the identified anchor, respectively, in response to a determination that the identified anchor already exists within the anchor database; and program instructions that replace a specific region of the data set identified by the anchor, the forward delta value and the backward delta value as duplicate data with a storage indicator to form a modified data set.
-
Specification