System and method for estimating duplicate data
First Claim
1. A method for estimating duplicate data, comprising:
- executing a duplicate estimation application on a system having a processor and memory;
selecting a data element stored on a storage device of the system;
reading a plurality of segments of data from the data element;
computing a fingerprint for each of the plurality of segments to produce a plurality of fingerprints;
storing the plurality of fingerprints in a fingerprint database;
identifying a total number of fingerprints entries in the fingerprint database;
identifying a total number of unique fingerprint entries of the total number of fingerprint entries in the fingerprint database, wherein each unique fingerprint represents a single instance of a fingerprint in the total number of fingerprint entries in the fingerprint database;
calculating an estimated amount of duplicate data by multiplying a size of a segment of data to a value obtained by subtracting the total number of unique fingerprint entries in the fingerprint database from the total number of fingerprint entries in the fingerprint database; and
providing the calculated estimated amount of duplicate data to a display, wherein the calculated estimated amount of duplicate data indicates estimated storage space saving that is realized by employing a data de-duplication technique to eliminate the duplicate data.
2 Assignments
0 Petitions
Accused Products
Abstract
The present invention provides a system and method for estimating duplicate data in a storage system. A duplicate estimation application executes on a client of a storage system selects an element from an intended destination such as, e.g., a data store of the storage system. If the element is a file (or other data container), the application reads data from the file and computes a fingerprint of the read data. The computed fingerprint is then logged in a fingerprint database, which is illustratively stored on a storage device connected to the client executing the application. This process repeats until the entire file (or other data container) has been read and fingerprinted. Once all elements have been scanned, fingerprinted and recorded, the application identifies any unique entries within the fingerprint database. Utilizing this information, the application computes an estimated space savings that may be realized by employing a data de-duplication technique.
58 Citations
17 Claims
-
1. A method for estimating duplicate data, comprising:
-
executing a duplicate estimation application on a system having a processor and memory; selecting a data element stored on a storage device of the system; reading a plurality of segments of data from the data element; computing a fingerprint for each of the plurality of segments to produce a plurality of fingerprints; storing the plurality of fingerprints in a fingerprint database; identifying a total number of fingerprints entries in the fingerprint database; identifying a total number of unique fingerprint entries of the total number of fingerprint entries in the fingerprint database, wherein each unique fingerprint represents a single instance of a fingerprint in the total number of fingerprint entries in the fingerprint database; calculating an estimated amount of duplicate data by multiplying a size of a segment of data to a value obtained by subtracting the total number of unique fingerprint entries in the fingerprint database from the total number of fingerprint entries in the fingerprint database; and providing the calculated estimated amount of duplicate data to a display, wherein the calculated estimated amount of duplicate data indicates estimated storage space saving that is realized by employing a data de-duplication technique to eliminate the duplicate data. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A non-transitory computer-readable storage medium stored with executable program instructions for execution by a processor, the non-transitory computer-readable storage medium comprising:
-
program instructions that select a data element stored on a computer; program instructions that read a plurality of segments of data from the data element; program instructions that compute a fingerprint for each of the plurality of segments to produce a plurality of fingerprints; program instructions that store the plurality of fingerprints in a fingerprint database; program instructions that identify a total number of fingerprint entries in the fingerprint database; program instructions that identify a total number of unique fingerprint entries of the total number of fingerprint entries in the fingerprint database, wherein each unique fingerprint represents a single instance of a fingerprint in the total number of fingerprint entries in the fingerprint database; program instructions that calculate an estimated amount of duplicate data by multiplying a size of a segment of data to a value obtained by subtracting the total number of unique fingerprint entries in the fingerprint database from the total number of fingerprint entries in the fingerprint database; and program instructions that provide the calculated estimated amount of duplicate data to a display, wherein the calculated estimated amount of duplicate data indicates estimated storage space saving that is realized by employing a data de-duplication technique to eliminate the duplicate data.
-
-
8. A computing system, comprising:
-
a fingerprint database; and a processor configured to operatively connect with the fingerprint database and a storage system, the processor configured to execute a duplicate estimation application to; (i) select a plurality of data elements from the storage system, (ii) read data from the plurality of data elements, (iii) compute a fingerprint for each of the plurality of data elements to produce a plurality of fingerprints, (iv) populate the fingerprint database with the plurality of fingerprints, (v) identify a total number of fingerprints entries in the fingerprint database populated with the plurality of fingerprints, (vi) identify a total number of unique fingerprint entries of the total number of fingerprint entries in the fingerprint database wherein each unique fingerprint represents a single instance of a fingerprint in the total number of is fingerprint entries in the fingerprint database, (vii) compute an estimated amount of duplicate data by multiplying a size of a segment of data to a value obtained by subtracting the total number of unique fingerprint entries in the fingerprint database from the total number of fingerprint entries in the fingerprint database, and (viii) providing the estimated amount of duplicate data to a display, wherein the estimated amount of duplicate data indicates estimated storage space saving that is realized by employing a data de-duplication technique to eliminate the duplicate data. - View Dependent Claims (9, 10, 11, 12, 13, 14, 15)
-
-
16. A system configured to estimate duplicate data, comprising:
-
means for executing a duplicate estimation application on a system having a processor and a memory; means for selecting a data element stored on a storage device of the system; means for reading a plurality of segments from the data element; means for computing a fingerprint for each of the plurality of segments to produce a plurality of fingerprints; means for storing the plurality of fingerprint in a fingerprint database; means for identifying a total number of fingerprint entries in the fingerprint database; means for identifying a total number of unique fingerprint entries of the total number of fingerprint entries in the fingerprint database wherein each unique fingerprint is represents a single instance of a fingerprint in the total number of fingerprint entries in the fingerprint database; means for calculating an estimated amount of duplicate data by multiplying a size of a segment of data to a value obtained by subtracting the total number of unique fingerprint entries in the fingerprint database from the total number of fingerprint entries in the fingerprint database; and means for providing the calculated estimated amount of duplicate data to a display wherein the calculated estimated amount of duplicate data indicates estimated storage space saving that is realized by employing a data de-duplication technique to eliminate the duplicate data. - View Dependent Claims (17)
-
Specification