System and method for sampling based elimination of duplicate data

US 20070255758A1
Filed: 04/28/2006
Published: 11/01/2007
Est. Priority Date: 04/28/2006
Status: Active Grant

First Claim

Patent Images

1. A method for removing duplicate data from a data set, the method comprising the steps of:

identifying an anchor within the data set;

determining whether the identified anchor exists within an anchor database;

in response to determining that the anchor exists within the anchor database, performing a data comparison between the data set and a stored data set to identify a forward delta value and a backward delta value relative to the identified anchor; and

replacing a region of the data set identified by the anchor, the forward delta value and the backward delta value with a storage indicator to form a modified data set.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A technique for eliminating duplicate data is provided. Upon receipt of a new data set, one or more anchor points are identified within the data set. A bit-by-bit data comparison is then performed of the region surrounding the anchor point in the received data set with the region surrounding an anchor point stored within a pattern database to identify forward/backward delta values. The duplicate data identified by the anchor point, forward and backward delta values is then replaced in the received data set with a storage indicator.

313 Citations

28 Claims

1. A method for removing duplicate data from a data set, the method comprising the steps of:
- identifying an anchor within the data set;
  
  determining whether the identified anchor exists within an anchor database;
  
  in response to determining that the anchor exists within the anchor database, performing a data comparison between the data set and a stored data set to identify a forward delta value and a backward delta value relative to the identified anchor; and
  
  replacing a region of the data set identified by the anchor, the forward delta value and the backward delta value with a storage indicator to form a modified data set.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1 wherein the step of identifying the anchor comprises the step of performing a rolling hash on the data set.
  - 3. The method of claim 1 wherein the step of identifying the anchor comprises placing the anchor at a predetermined location within the data set.
  - 4. The method of claim 1 wherein the stored data set is stored in a pattern database.
  - 5. The method of claim 1 wherein the storage indicator comprises an anchor identifier, the forward delta value and the backward delta value.
  - 6. The method of claim 1 further comprising the step of storing the modified data set in a data object store.
  - 7. The method of claim 6 wherein the data object store comprises a file system.
  - 8. The method of claim 1 further comprising the step of finding duplicate data between two similar data sets.
  - 9. The method of claim 8 wherein the similar data sets comprise data sets with intermixed common data and unique data.
  - 10. The method of claim 8 wherein duplicate data is found by comparing anchor points of the new data set with the anchor points of similar data.
  - 11. The method of claim 10 wherein the anchor points of similar data are identified based on a pattern of anchor points within the new data set.
  - 12. The method of claim 1 further comprising the step of forming an anchor hierarchy by computing a hash on a plurality of adjacent anchors within the data set.

13. A system configured to remove duplicate data from a data set, the system comprising:
- means for identifying an anchor within the data set;
  
  means for determining whether the identified anchor exists within an anchor database;
  
  in response to determining that the anchor exists within the anchor database, means for performing a data comparison between the data set and a stored data set to identify a forward delta value and a backward delta value relative to the identified anchor; and
  
  means for replacing a region of the data set identified by the anchor, the forward delta value and the backward delta value with a storage indicator.
- View Dependent Claims (14, 15, 16, 17, 18)
- - 14. The system of claim 13 wherein the means for identifying the anchor comprises means for performing a rolling hash on the data set.
  - 15. The system of claim 13 wherein the means for identifying the anchor comprises means for placing the anchor at a predetermined location within the data set.
  - 16. The system of claim 13 wherein the stored data set is stored in a pattern database.
  - 17. The system of claim 13 wherein the storage indicator for comprises an anchor identifier, the forward delta value and the backward delta value.
  - 18. The system of claim 13 further comprising means for forming an anchor hierarchy by computing a hash on a plurality of adjacent anchors within the data set.

19. A system configured to remove duplicate data from a data set, the system comprising:
- a storage system configured to serve the data set; and
  
  a virtual tape library system adapted to receive the data set from the storage system, the virtual tape library system adapted to identify an anchor within the data set and further adapted to determine whether the identified anchor exists within an anchor database.
- View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27)
- - 20. The system of claim 19 wherein the virtual tape library system is further adapted to, in response to, determining that the anchor exists within the anchor database, perform a data comparison between the data set and a stored data set to identify a forward delta value and a backward delta value.
  - 21. The system of claim 20 further comprising a pattern data base adopted to store the stored data set.
  - 22. The system of claim 20 wherein the virtual tape library system is further adapted to replace a region of the data set identified by the anchor, the forward delta value and the backward delta value with a storage indicator to form a modified data set.
  - 23. The system of claim 22 wherein the storage indicator comprises an anchor identifier, the forward delta value and the backward delta value.
  - 24. The system of claim 19 wherein the anchor is identified by performing a rolling hash on the data set.
  - 25. The system of claim 19 wherein the anchor is identified by placing the anchor at a predetermined location within the data set.
  - 26. The system of claim 19 wherein the data set comprises a backup data stream.
  - 27. The system of claim 19 wherein the virtual tape library system is further adapted to form an anchor hierarchy by computing a hash on a plurality of adjacent anchors within the data set.

28. A computer readable medium for removing duplicate data from a data set, the computer readable medium including program instructions for performing the steps of:
- identifying an anchor within the data set;
  
  determining whether the identified anchor exists within an anchor database;
  
  in response to determining that the anchor exists within the anchor database, performing a data comparison between the data set and a stored data set to identify a forward delta value and a backward delta value relative to the identified anchor; and
  
  replacing a region of the data set identified by the anchor, the forward delta value and the backward delta value with a storage indicator to form a modified data set.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
NetApp, Inc.
Original Assignee
NetApp, Inc.
Inventors
Zheng, Ling, Trimmer, Don, Stager, Roger, Johnston, Craig, Frandzel, Yuval

Granted Patent

US 8,165,221 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

H03M 7/00   Conversion of a code where ...

H04N 19/20   using video object coding

H04N 19/23   with coding of regions that...

H04N 19/25   with scene description codi...

System and method for sampling based elimination of duplicate data

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

313 Citations

28 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for sampling based elimination of duplicate data

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

313 Citations

28 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links