System and method for eliminating duplicate data by generating data fingerprints using adaptive fixed-length windows
First Claim
1. A method for removing duplicate data from a sequence of bytes at a storage server, the method comprising:
- generating a first data fingerprint based on a first data interval in the sequence of bytes, the first data interval having a first length;
detecting an anchor in the sequence of bytes at a point after the first interval;
defining a second data interval in the sequence of bytes extending from a first position in the sequence to a second position located a specified interval after the location of the anchor, the second data interval having a second length greater than the first length;
generating a second data fingerprint based on the second window;
finding a first stored data fingerprint in a data fingerprint database corresponding to the first data fingerprint;
finding a second stored data fingerprint in the fingerprint database corresponding to the second data fingerprint; and
generating a modified sequence of bytes by replacing the first data interval in the sequence of bytes with a first storage indicator corresponding to the first stored data fingerprint and replacing the second data interval in the sequence of bytes with a second storage indicator corresponding to the second stored data fingerprint.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and system for generating data fingerprints is used to de-duplicate a data set having a high level of redundancy. A fingerprint generator generates a data fingerprint based on a data window. Each byte of the data set is added to the fingerprint generator and used to detect an anchor within the received data. If no anchor is detected, the system continues receiving bytes until a predefined window size is reached. When the window size is reached, the system records a data fingerprint based on the data window and resets the window size. If an anchor is detected, the system extends the window size such that the window ends a specified length after the location of the anchor. If the extended window is greater than a maximum size, the system ignores the anchor. The generated fingerprints are compared to a fingerprint database. The data set is then de-duplicated by replacing matching data segments with references to corresponding stored data segments.
85 Citations
28 Claims
-
1. A method for removing duplicate data from a sequence of bytes at a storage server, the method comprising:
-
generating a first data fingerprint based on a first data interval in the sequence of bytes, the first data interval having a first length; detecting an anchor in the sequence of bytes at a point after the first interval; defining a second data interval in the sequence of bytes extending from a first position in the sequence to a second position located a specified interval after the location of the anchor, the second data interval having a second length greater than the first length; generating a second data fingerprint based on the second window; finding a first stored data fingerprint in a data fingerprint database corresponding to the first data fingerprint; finding a second stored data fingerprint in the fingerprint database corresponding to the second data fingerprint; and generating a modified sequence of bytes by replacing the first data interval in the sequence of bytes with a first storage indicator corresponding to the first stored data fingerprint and replacing the second data interval in the sequence of bytes with a second storage indicator corresponding to the second stored data fingerprint. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A storage system for processing a backup data set, the storage system comprising:
-
a processor; a memory; a data provider component configured to receive the backup data set from a storage server; an anchor detector component configured to detect an anchor at an anchor location in the backup data set; a data window control component configured to; define an initial data window in the backup data set extending from a beginning point to a first end point, wherein the initial data window has an initial size; and in response to determining that the anchor location is within the initial data window, define an extended data window in the backup data set extending from the beginning point of the initial data window to an end point a specified length after the anchor location, the extended data window having a second size different from the first size; a fingerprint generator component configured to generate a data fingerprint based on the portion of the data set in the extended data window; a data set de-duplication component configured to detect potentially duplicated data based on the data fingerprint and to generate a de-duplicated data set by replacing the data in the extended fingerprint window with a reference to a stored data segment; and a storage interface configured to communicate with a storage facility to store the de-duplicated data set. - View Dependent Claims (8, 9, 10, 11, 12, 13)
-
-
14. A method for processing a data set at a storage server, wherein the data set is a sequence of individual data units, the method comprising:
-
defining a data fingerprint window in the data set, wherein the data fingerprint window extends from a beginning point to a first end point in the data set, the data fingerprint window having a first size; detecting an anchor within the data fingerprint window; in response to detecting the anchor within the data fingerprint window, extending the data fingerprint window such that the extended data fingerprint window extends from the beginning point to a second end point that is a specified interval after the location of the anchor, the extended data fingerprint window having a second size different from the first size; and generating a data fingerprint based on the data fingerprint window. - View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22)
-
-
23. A method of facilitating data de-duplication, the method comprising:
-
receiving a set of data; attempting to detect an anchor in the set of data by applying a data window of a specified length to the set of data; if an anchor is detected within the set of data, extending the length of the data window such that the extended data window has a length greater than the specified length, and otherwise maintaining the length of the data window; computing a data fingerprint for the set of data based on the data window; and using the data fingerprint to detect potentially duplicated data. - View Dependent Claims (24, 25, 26, 27, 28)
-
Specification