System and method for eliminating duplicate data by generating data fingerprints using adaptive fixed-length windows

US 8,180,740 B1
Filed: 08/12/2009
Issued: 05/15/2012
Est. Priority Date: 08/12/2009
Status: Active Grant

First Claim

Patent Images

1. A method for removing duplicate data from a sequence of bytes at a storage server, the method comprising:

generating a first data fingerprint based on a first data interval in the sequence of bytes, the first data interval having a first length;

detecting an anchor in the sequence of bytes at a point after the first interval;

defining a second data interval in the sequence of bytes extending from a first position in the sequence to a second position located a specified interval after the location of the anchor, the second data interval having a second length greater than the first length;

generating a second data fingerprint based on the second window;

finding a first stored data fingerprint in a data fingerprint database corresponding to the first data fingerprint;

finding a second stored data fingerprint in the fingerprint database corresponding to the second data fingerprint; and

generating a modified sequence of bytes by replacing the first data interval in the sequence of bytes with a first storage indicator corresponding to the first stored data fingerprint and replacing the second data interval in the sequence of bytes with a second storage indicator corresponding to the second stored data fingerprint.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system for generating data fingerprints is used to de-duplicate a data set having a high level of redundancy. A fingerprint generator generates a data fingerprint based on a data window. Each byte of the data set is added to the fingerprint generator and used to detect an anchor within the received data. If no anchor is detected, the system continues receiving bytes until a predefined window size is reached. When the window size is reached, the system records a data fingerprint based on the data window and resets the window size. If an anchor is detected, the system extends the window size such that the window ends a specified length after the location of the anchor. If the extended window is greater than a maximum size, the system ignores the anchor. The generated fingerprints are compared to a fingerprint database. The data set is then de-duplicated by replacing matching data segments with references to corresponding stored data segments.

85 Citations

View as Search Results

28 Claims

1. A method for removing duplicate data from a sequence of bytes at a storage server, the method comprising:
- generating a first data fingerprint based on a first data interval in the sequence of bytes, the first data interval having a first length;
  
  detecting an anchor in the sequence of bytes at a point after the first interval;
  
  defining a second data interval in the sequence of bytes extending from a first position in the sequence to a second position located a specified interval after the location of the anchor, the second data interval having a second length greater than the first length;
  
  generating a second data fingerprint based on the second window;
  
  finding a first stored data fingerprint in a data fingerprint database corresponding to the first data fingerprint;
  
  finding a second stored data fingerprint in the fingerprint database corresponding to the second data fingerprint; and
  
  generating a modified sequence of bytes by replacing the first data interval in the sequence of bytes with a first storage indicator corresponding to the first stored data fingerprint and replacing the second data interval in the sequence of bytes with a second storage indicator corresponding to the second stored data fingerprint.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1, further comprising storing the modified sequence of bytes in a data store.
  - 3. The method of claim 1, further comprising transmitting the modified sequence of bytes to a mirror server.
  - 4. The method of claim 1, wherein the anchor is a first anchor and wherein generating the first data fingerprint comprises:
    - detecting a second anchor in the first data interval;
      
      defining a third data interval ending at a specified length after the location of the first anchor;
      
      determining whether the length of the third data interval is greater than a maximum size threshold; and
      
      in response to determining that the length of the third data interval is greater than the maximum size threshold, generating the first data fingerprint based on the first data interval.
  - 5. The method of claim 1, wherein detecting the anchor comprises performing a rolling hash on the sequence of bytes.
  - 6. The method of claim 1, further comprising storing the second data fingerprint and a reference to the contents of the second data interval in a fingerprint database.

7. A storage system for processing a backup data set, the storage system comprising:
- a processor;
  
  a memory;
  
  a data provider component configured to receive the backup data set from a storage server;
  
  an anchor detector component configured to detect an anchor at an anchor location in the backup data set;
  
  a data window control component configured to;
  
  define an initial data window in the backup data set extending from a beginning point to a first end point, wherein the initial data window has an initial size; and
  
  in response to determining that the anchor location is within the initial data window, define an extended data window in the backup data set extending from the beginning point of the initial data window to an end point a specified length after the anchor location, the extended data window having a second size different from the first size;
  
  a fingerprint generator component configured to generate a data fingerprint based on the portion of the data set in the extended data window;
  
  a data set de-duplication component configured to detect potentially duplicated data based on the data fingerprint and to generate a de-duplicated data set by replacing the data in the extended fingerprint window with a reference to a stored data segment; and
  
  a storage interface configured to communicate with a storage facility to store the de-duplicated data set.
- View Dependent Claims (8, 9, 10, 11, 12, 13)
- - 8. The storage system of claim 7, further comprising:
    - a fingerprint database configured to store a plurality of data fingerprints,wherein the data set de-duplication component is configured to detect potentially duplicated data by comparing the generated data fingerprint to the plurality of data fingerprints and wherein the reference is a database reference to an individual stored data fingerprint of the plurality of data fingerprints.
  - 9. The storage system of claim 7, wherein detecting the anchor comprises performing a hash function on a portion of the data set.
  - 10. The storage system of claim 7, further comprising a fingerprint storage component configured to store the generated data fingerprint and a reference to the contents of the extended fingerprint window in a fingerprint database.
  - 11. The storage system of claim 7, wherein the anchor is a first anchor, wherein the anchor location is a first anchor location, wherein the anchor detection component is further configured to detect a second anchor at a second anchor location within the extended data window, and wherein the data window control component is further configured to extend the extended data fingerprint window such that the extended data fingerprint window has an updated end point that is a specified interval after the location of the second anchor.
  - 12. The storage system of claim 7, wherein the data de-duplication component is configured to:
    - compare the generated data fingerprint to a stored data fingerprint in a fingerprint database;
      
      perform a bitwise comparison between the contents of the data fingerprint window and a data segment stored corresponding to the stored data fingerprint;
      
      based on the comparison, replace a region of the data set defined by the data fingerprint window with a storage indicator corresponding to the stored data fingerprint.
  - 13. The storage system of claim 7, wherein the storage system is a Virtual Tape Library (VTL) system.

14. A method for processing a data set at a storage server, wherein the data set is a sequence of individual data units, the method comprising:
- defining a data fingerprint window in the data set, wherein the data fingerprint window extends from a beginning point to a first end point in the data set, the data fingerprint window having a first size;
  
  detecting an anchor within the data fingerprint window;
  
  in response to detecting the anchor within the data fingerprint window, extending the data fingerprint window such that the extended data fingerprint window extends from the beginning point to a second end point that is a specified interval after the location of the anchor, the extended data fingerprint window having a second size different from the first size; and
  
  generating a data fingerprint based on the data fingerprint window.
- View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22)
- - 15. The method of claim 14, further comprising:
    - comparing the generated data fingerprint to a set of stored data fingerprints in a data fingerprint database; and
      
      replacing the data in the fingerprint window with a reference to a data segment corresponding to an individual stored data fingerprint in the fingerprint database.
  - 16. The method of claim 14, further comprising:
    - comparing the generated data fingerprint to a stored data fingerprint in a data fingerprint database; and
      
      based on the comparison, transmitting the generated data fingerprint to a mirror server.
  - 17. The method of claim 14, wherein extending the fingerprint window comprises:
    - determining whether the length of the extended fingerprint window is greater than a maximum size threshold; and
      
      in response to determining that the length is greater than the maximum size threshold, ignoring the anchor.
  - 18. The method of claim 14, further comprising generating a data fingerprint based on the beginning point and the first end point when an anchor is not detected in the data fingerprint window.
  - 19. The method of claim 14, wherein identifying the anchor comprises performing a hash function on a portion of the data set.
  - 20. The method of claim 14, further comprising storing the generated data fingerprint and a reference to a region of the data set defined by the extended data fingerprint window in a fingerprint database.
  - 21. The method of claim 14, wherein the anchor is a first anchor and further comprising detecting a second anchor in the data set and extending the extended data fingerprint window such that the extended data fingerprint window has a third end point that is the specified interval after the location of the second anchor.
  - 22. The method of claim 14, further comprising:
    - comparing the generated data fingerprint to a stored data fingerprint in a fingerprint database;
      
      performing a bitwise comparison between a region of the data set defined by the extended data fingerprint window and a data segment corresponding to the stored data fingerprint;
      
      based on the comparison, replacing the region of the data set defined by the data fingerprint window with a storage indicator corresponding to the stored data fingerprint.

23. A method of facilitating data de-duplication, the method comprising:
- receiving a set of data;
  
  attempting to detect an anchor in the set of data by applying a data window of a specified length to the set of data;
  
  if an anchor is detected within the set of data, extending the length of the data window such that the extended data window has a length greater than the specified length, and otherwise maintaining the length of the data window;
  
  computing a data fingerprint for the set of data based on the data window; and
  
  using the data fingerprint to detect potentially duplicated data.
- View Dependent Claims (24, 25, 26, 27, 28)
- - 24. The method of claim 23, further comprising replacing the potentially duplicated data with a reference to a stored data segment.
  - 25. The method of claim 23, wherein using the data fingerprint comprises comparing the data fingerprint to a set of stored data fingerprints in a data fingerprint database and further comprising:
    - replacing the data in the data fingerprint window with a storage indicator corresponding to an individual stored data fingerprint in the fingerprint database.
  - 26. The method of claim 23, wherein attempting to detect the anchor comprises performing a hash function on a portion of the set of data.
  - 27. The method of claim 23, further comprising storing the data fingerprint and a reference to a portion of the set of data defined by the data window in a fingerprint database.
  - 28. The method of claim 23, further comprising:
    - comparing the data fingerprint to a stored data fingerprint in a fingerprint database;
      
      performing a bitwise comparison between a portion of the set of data defined by the data window and a data segment corresponding to the stored data fingerprint;
      
      based on the comparison, replacing the portion of the set of data defined by the data window with a storage indicator corresponding to the stored data fingerprint.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
NetApp, Inc.
Original Assignee
NetApp, Inc.
Inventors
Stager, Roger Keith, Johnston, Craig Anthony
Primary Examiner(s)
Vital, Pierre
Assistant Examiner(s)
Obisesan, Augustine

Application Number

US12/539,867
Time in Patent Office

1,007 Days
Field of Search

None
US Class Current

707/692
CPC Class Codes

G06F 11/1453   using de-duplication of the...

G06F 16/1752   based on file chunks

G06F 2201/83   the solution involving sign...

System and method for eliminating duplicate data by generating data fingerprints using adaptive fixed-length windows

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

85 Citations

28 Claims

Specification

Use Cases

Quick Links

Others

System and method for eliminating duplicate data by generating data fingerprints using adaptive fixed-length windows

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

85 Citations

28 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others