System and method for sampling based elimination of duplicate data

US 8,165,221 B2
Filed: 04/28/2006
Issued: 04/24/2012
Est. Priority Date: 04/28/2006
Status: Active Grant

First Claim

Patent Images

1. A method for removing duplicate data from a data set, the method comprising:

identifying, by a processor, an anchor within the data set, wherein the anchor is a specific section within the data set that defines a region of interest for potential data de-duplication;

determining, by the processor, whether the identified anchor already exists within an anchor database;

in response to determining that the identified anchor already exists within the anchor database, performing, by the processor, a data comparison between the data set and a stored data set to identify a forward delta value and a backward delta value which collectively identify a number of consecutive bits of data that match between the data set and the stored data set forward and backward from the identified anchor, respectively; and

replacing, by the processor, a specific region of the data set identified by the anchor, the forward delta value and the backward delta value as duplicate data with a storage indicator to form a modified data set.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A technique for eliminating duplicate data is provided. Upon receipt of a new data set, one or more anchor points are identified within the data set. A bit-by-bit data comparison is then performed of the region surrounding the anchor point in the received data set with the region surrounding an anchor point stored within a pattern database to identify forward/backward delta values. The duplicate data identified by the anchor point, forward and backward delta values is then replaced in the received data set with a storage indicator.

109 Citations

View as Search Results

24 Claims

1. A method for removing duplicate data from a data set, the method comprising:
- identifying, by a processor, an anchor within the data set, wherein the anchor is a specific section within the data set that defines a region of interest for potential data de-duplication;
  
  determining, by the processor, whether the identified anchor already exists within an anchor database;
  
  in response to determining that the identified anchor already exists within the anchor database, performing, by the processor, a data comparison between the data set and a stored data set to identify a forward delta value and a backward delta value which collectively identify a number of consecutive bits of data that match between the data set and the stored data set forward and backward from the identified anchor, respectively; and
  
  replacing, by the processor, a specific region of the data set identified by the anchor, the forward delta value and the backward delta value as duplicate data with a storage indicator to form a modified data set.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1 wherein identifying the anchor comprises performing a rolling hash on the data set.
  - 3. The method of claim 1 wherein identifying the anchor comprises placing the anchor at a predetermined location within the data set.
  - 4. The method of claim 1 wherein the stored data set is stored in a pattern database.
  - 5. The method of claim 1 further comprising storing the modified data set in a data object store.
  - 6. The method of claim 5 wherein the data object store comprises a file system.
  - 7. The method of claim 1 further comprising finding duplicate data between two similar data sets.
  - 8. The method of claim 7 wherein the similar data sets comprise data sets with intermixed common data and unique data.
  - 9. The method of claim 7 wherein the duplicate data is found by comparing anchor points of the data set with the anchor points of similar data.
  - 10. The method of claim 9 wherein the anchor points of similar data are identified based on a pattern of anchor points within the data set.
  - 11. The method of claim 1 further comprising forming an anchor hierarchy by computing a hash on a plurality of adjacent anchors within the data set.

12. A system having a processor and configured to remove duplicate data from a data set, the system comprising:
- means for identifying an anchor within the data set, wherein the anchor is a specific section within the data set that defines a region of interest for potential data de-duplication;
  
  means for determining whether the identified anchor already exists within an anchor database;
  
  in response to determining that the identified anchor already exists within the anchor database, means for performing a data comparison between the data set and a stored data set to identify a forward delta value and a backward delta value relative to the identified anchor which collectively identify a number of consecutive bits of data that match between the data set and the stored data set forward and backward from the identified anchor, respectively; and
  
  means for replacing a specific region of the data set identified by the anchor, the forward delta value and the backward delta value as duplicate data with a storage indicator to form a modified data set.
- View Dependent Claims (13, 14, 15, 16, 17)
- - 13. The system of claim 12 wherein the means for identifying the anchor comprises means for performing a rolling hash on the data set.
  - 14. The system of claim 12 wherein the means for identifying the anchor comprises means for placing the anchor at a predetermined location within the data set.
  - 15. The system of claim 12 wherein the stored data set is stored in a pattern database.
  - 16. The system of claim 12 wherein the storage indicator comprises an anchor identifier, the forward delta value and the backward delta value.
  - 17. The system of claim 12 further comprising means for forming an anchor hierarchy by computing a hash on a plurality of adjacent anchors within the data set.

18. A system configured to remove duplicate data from a data set, the system comprising:
- a storage system having a processor and configured to serve the data set; and
  
  a virtual tape library system configured to receive the data set from the storage system, the virtual tape library system configured to identify an anchor within the data set, wherein the anchor is a specific section within the data set that defines a region of interest for potential data de-duplication, determine whether the identified anchor already exists within an anchor database, perform a data comparison between the data set and a stored data set to identify a forward delta value and a backward delta value which collectively identify a number of consecutive bits of data that match between the data set and the stored data set forward and backward from the identified anchor, respectively, in response to a determination that the identified anchor already exists within the anchor database, and replace a specific region of the data set identified by the anchor, the forward delta value and the backward delta value as duplicate data with a storage indicator to form a modified data set.
- View Dependent Claims (19, 20, 21, 22, 23)
- - 19. The system of claim 18 further comprising a pattern database configured to store the stored data set.
  - 20. The system of claim 18 wherein the anchor is identified by performing a rolling hash on the data set.
  - 21. The system of claim 18 wherein the anchor is identified by placing the anchor at a predetermined location within the data set.
  - 22. The system of claim 18 wherein the data set comprises a backup data stream.
  - 23. The system of claim 18 wherein the virtual tape library system is further configured to form an anchor hierarchy by computing a hash on a plurality of adjacent anchors within the data set.

24. A non-transitory computer readable medium containing program instructions executed by a processor, comprising:
- program instructions that identify an anchor within a data set, wherein the anchor is a specific section within the data set that defines a region of interest for potential data de-duplication;
  
  program instructions that determine whether the identified anchor already exists within an anchor database;
  
  program instructions that perform a data comparison between the data set and a stored data set to identify a forward delta value and a backward delta value relative to the identified anchor which collectively identify a number of consecutive bits of data that match between the data set and the stored data set forward and backward from the identified anchor, respectively, in response to a determination that the identified anchor already exists within the anchor database; and
  
  program instructions that replace a specific region of the data set identified by the anchor, the forward delta value and the backward delta value as duplicate data with a storage indicator to form a modified data set.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
NetApp, Inc.
Original Assignee
NetApp, Inc.
Inventors
Zheng, Ling, Stager, Roger, Johnston, Craig, Trimmer, Don, Frandzel, Yuval
Primary Examiner(s)
Rao, Andy

Application Number

US11/414,600
Publication Number

US 20070255758A1
Time in Patent Office

2,188 Days
Field of Search

37524001-24029, 341 60- 90, 711/111, 711/202
US Class Current

375/240.26
CPC Class Codes

H03M 7/00   Conversion of a code where ...

H04N 19/20   using video object coding

H04N 19/23   with coding of regions that...

H04N 19/25   with scene description codi...

System and method for sampling based elimination of duplicate data

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

109 Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for sampling based elimination of duplicate data

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

109 Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links