Efficiently estimating compression ratio in a deduplicating file system

US 9,026,752 B1
Filed: 12/22/2011
Issued: 05/05/2015
Est. Priority Date: 12/22/2011
Status: Active Grant

First Claim

Patent Images

1. A system for estimating a compression ratio of a deduplicating storage system, comprising:

a processor configured to;

process an incoming stream of data into a set of segments;

for each of k times, associate a bin of an ordered set of bins with each received identifier using a hash function, wherein each received identifier comprises a fingerprint of a segment of the set of segments processed by a data fingerprinter coupled to the deduplicating storage system;

store only a minimum bin number resulting from the k times of hashing each received identifier, wherein the processor only stores one value for the k times of hashing each received identifier;

repeat the k times of associating a bin with a received identifier for n trials, where n is greater than two;

determine an average minimum associated bin number, wherein the average minimum associated bin number comprises an average of the minimum bin number over the n trials;

determining an estimate of a quantity of unique identifiers comprises dividing a total number of bins by the average minimum associated bin value and subtracting one;

determine an estimation of a compression ratio for a deduplicating file system based at least in part on the estimate of the quantity of unique identifiers; and

a memory coupled to the processor and configured to provide the processor with instructions.

View all claims

9 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system for estimating a quantity of unique identifiers comprises a processor and a memory. The processor is configured to, for each of k times, associate a bin of a set of bins with each received identifier. The processor is further configured to determine an estimate of the quantity of unique identifiers based at least in part on an average minimum associated bin value. The memory is coupled to the processor and configured to provide the processor with instructions.

Citations

19 Claims

1. A system for estimating a compression ratio of a deduplicating storage system, comprising:
- a processor configured to;
  
  process an incoming stream of data into a set of segments;
  
  for each of k times, associate a bin of an ordered set of bins with each received identifier using a hash function, wherein each received identifier comprises a fingerprint of a segment of the set of segments processed by a data fingerprinter coupled to the deduplicating storage system;
  
  store only a minimum bin number resulting from the k times of hashing each received identifier, wherein the processor only stores one value for the k times of hashing each received identifier;
  
  repeat the k times of associating a bin with a received identifier for n trials, where n is greater than two;
  
  determine an average minimum associated bin number, wherein the average minimum associated bin number comprises an average of the minimum bin number over the n trials;
  
  determining an estimate of a quantity of unique identifiers comprises dividing a total number of bins by the average minimum associated bin value and subtracting one;
  
  determine an estimation of a compression ratio for a deduplicating file system based at least in part on the estimate of the quantity of unique identifiers; and
  
  a memory coupled to the processor and configured to provide the processor with instructions.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 17, 18, 19)
- - 2. The system as in claim 1, wherein hash function comprises one hash function of k hash functions.
  - 3. The system as in claim 2, wherein each of the k hash functions is selected to distribute a set of input values evenly across the ordered set of bins.
  - 4. The system as in claim 1, wherein the bin associated with each received identifier has a number identifying a place in the ordered set of bins.
  - 5. The system as in claim 4, wherein a minimum bin number is determined associated with received identifiers for each of the k times.
  - 6. The system as in claim 5, wherein the minimum bin number comprises a lowest bin number with an associated bin for a received identifier.
  - 7. The system as in claim 5, wherein the minimum bin number is averaged for each of the k times to determine the average bin number.
  - 8. The system as in claim 1, wherein the set of segments comprise segments determined from a data set desired to be stored by the deduplicating storage system.
  - 9. The system as in claim 1, wherein the deduplicating storage system stores a reference to a stored segment in the event that a segment is already stored.
  - 10. The system as in claim 1, wherein the deduplicating storage system stores meta information for reconstructing the data set from the set of segments.
  - 11. The system as in claim 1, wherein an estimate of a required storage space is determined based at least in part on the estimate of the compression ratio for the deduplicating file system.
  - 12. The system as in claim 1, wherein an error in the estimate of the quantity of unique identifiers is determined.
  - 13. The system as in claim 12, wherein the error is based at least in part on k.
  - 14. The system as in claim 12, wherein an estimate of a required storage space is determined based at least in part on the error in the estimate of the quantity of unique identifiers.
  - 17. The system as in claim 1, wherein determining the estimation of the compression ratio for the deduplicating file system comprises dividing the estimate of the quantity of unique identifiers by a total number of identifiers in the deduplicating storage system.
  - 18. The system as in claim 1, wherein the average minimum associated bin number is a running average between a current trial and a last trial.
  - 19. The system as in claim 1, wherein the average minimum associated bin number is a sum of minimum values for each trial and then divided by n at the end of n trials.

15. A method for estimating a compression ratio of a deduplicating storage system comprising:
- process, using a processor, an incoming stream of data into a set of segments;
  
  for each of k times, associating, using the processor, a bin of an ordered set of bins with each received identifier using a hash function, wherein each received identifier comprises a fingerprint of a segment of the set of segments processed by a data fingerprinter coupled to the deduplicating storage system;
  
  store, in a memory, only a minimum bin number resulting from the k times of hashing each received identifier, wherein the processor only stores in the memory one value for the k times of hashing each received identifier;
  
  repeat the k times of associating a bin with a received identifier for n trials, where n is greater than two;
  
  determining an average minimum associated bin number, wherein the average minimum associated bin number comprises an average of the minimum bin number over the n trials;
  
  determining an estimate of a quantity of unique identifiers comprises dividing a total number of bins by the average minimum associated bin value and subtracting one; and
  
  determining an estimation of a compression ratio for a deduplicating file system based at least in part on the estimate of the quantity of unique identifiers.

16. A computer program product, the computer program product being embedded in a tangible non-transitory computer readable storage medium and comprising computer instructions for:
- process an incoming stream of data into a set of segments;
  
  for each of k times, associating a bin of an ordered set of bins with each received identifier using a hash function, wherein each received identifier comprises a fingerprint of a segment of the set of segments processed by a data fingerprinter coupled to the deduplicating storage system;
  
  store only a minimum bin number resulting from the k times of hashing each received identifier, wherein the processor only stores one value for the k times of hashing each received identifier;
  
  repeat the k times of associating a bin with a received identifier for n trials, where n is greater than two;
  
  determining an average minimum associated bin number, wherein the average minimum associated bin number comprises an average of the minimum bin number over the n trials;
  
  determining an estimate of a quantity of unique identifiers comprises dividing a total number of bins by the average minimum associated bin value and subtracting one; and
  
  determining an estimation of a compression ratio for a deduplicating file system based at least in part on the estimate of the quantity of unique identifiers.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Emc IP Holding Company LLC (Dell Technologies Inc.)
Original Assignee
EMC Corporation (Dell Technologies Inc.)
Inventors
Botelho, Fabiano
Primary Examiner(s)
Rones, Charles
Assistant Examiner(s)
DOAN, HAN V

Application Number

US13/334,499
Time in Patent Office

1,230 Days
Field of Search
US Class Current

711/162
CPC Class Codes

G06F 11/1456   Hardware arrangements for b...

G06F 11/2074   Asynchronous techniques

G06F 16/1748   De-duplication implemented ...

G06F 16/215   Improving data quality; Dat...

Efficiently estimating compression ratio in a deduplicating file system

First Claim

9 Assignments

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Efficiently estimating compression ratio in a deduplicating file system

First Claim

9 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links