Efficiently estimating compression ratio in a deduplicating file system

US 10,114,845 B2
Filed: 04/02/2015
Issued: 10/30/2018
Est. Priority Date: 12/22/2011
Status: Active Grant

First Claim

Patent Images

1. A deduplicating storage system, comprising:

a processor configured to;

for each of k times;

associate a bin of an ordered set of bins with each received identifier, wherein each bin in the ordered set of bins has a bin number and each received identifier comprises a fingerprint of a segment of a set of segments stored on a file system of the deduplicating storage system;

determine a minimum bin number associated with each received identifier, the minimum bin number being the bin number that is minimum among the bins associated with the each received identifier;

repeat the k times of associating a bin with a received identifier for n trials, where n is greater than two;

determine an estimate of a quantity of unique identifiers based at least in part on an average of the minimum associated bin number;

determine a data compression ratio of the segments stored in the file system of the deduplicating storage system based on the estimated quantity of the unique identifiers without having to record a list of the unique identifiers and check the list of the unique identifiers for the each received identifier;

determine a capacity of the deduplicating storage system; and

back up data to the system of the deduplicating storage system based on the determined capacity of the deduplicating storage system and the determined data compression ratio of the segments stored therein; and

a memory coupled to the processor and configured to provide the processor with instructions.

View all claims

9 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system for estimating a quantity of unique identifiers comprises a processor and a memory. The processor is configured to, for each of k times, associate a bin of a set of bins with each received identifier. The processor is further configured to determine an estimate of the quantity of unique identifiers based at least in part on an average minimum associated bin value. The memory is coupled to the processor and configured to provide the processor with instructions.

11 Citations

View as Search Results

23 Claims

1. A deduplicating storage system, comprising:
- a processor configured to;
  
  for each of k times;
  
  associate a bin of an ordered set of bins with each received identifier, wherein each bin in the ordered set of bins has a bin number and each received identifier comprises a fingerprint of a segment of a set of segments stored on a file system of the deduplicating storage system;
  
  determine a minimum bin number associated with each received identifier, the minimum bin number being the bin number that is minimum among the bins associated with the each received identifier;
  
  repeat the k times of associating a bin with a received identifier for n trials, where n is greater than two;
  
  determine an estimate of a quantity of unique identifiers based at least in part on an average of the minimum associated bin number;
  
  determine a data compression ratio of the segments stored in the file system of the deduplicating storage system based on the estimated quantity of the unique identifiers without having to record a list of the unique identifiers and check the list of the unique identifiers for the each received identifier;
  
  determine a capacity of the deduplicating storage system; and
  
  back up data to the system of the deduplicating storage system based on the determined capacity of the deduplicating storage system and the determined data compression ratio of the segments stored therein; and
  
  a memory coupled to the processor and configured to provide the processor with instructions.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 2. A system as in claim 1, wherein the bin of the ordered set of bins is associated with each received identifier for each of the k times using one hash function of k hash functions.
  - 3. A system as in claim 2, wherein each of the k hash functions is selected to distribute a set of input values evenly across the ordered set of bins.
  - 4. A system as in claim 1, wherein the bin associated with each received identifier has a number identifying the bin'"'"'s place in the ordered set of bins.
  - 5. A system as in claim 4, wherein a minimum bin number is determined associated with received identifiers for each of the k times.
  - 6. A system as in claim 5, wherein the minimum bin number comprises a lowest bin number with an associated bin for a received identifier.
  - 7. A system as in claim 5, wherein the minimum bin number is averaged for each of the k times to determine an average bin number.
  - 8. A system as in claim 1, wherein determining the estimate of the quantity of unique identifiers comprises dividing a total number of bins by an average minimum associated bin value and subtracting one.
  - 9. A system as in claim 1, wherein each received identifier comprises one of a plurality of identifiers determined from a segment of the set of segments.
  - 10. A system as in claim 9, wherein the set of segments comprise segments determined from a data set desired to be stored by the deduplicating storage system.
  - 11. A system as in claim 10, wherein the deduplicating storage system stores a reference to a stored segment in the event that a segment is already stored.
  - 12. A system as in claim 10, wherein the deduplicating storage system stores meta information for reconstructing the data set from the set of segments.
  - 13. A system as in claim 9, wherein the received identifier comprises a fingerprint of the segment.
  - 14. A system as in claim 1, wherein an estimate of a required storage space is determined based at least in part on the estimate of the quantity of unique identifiers.
  - 15. A system as in claim 1, wherein an error in the estimate of the quantity of unique identifiers is determined.
  - 16. A system as in claim 15, wherein the error is based at least in part on k.
  - 17. A system as in claim 15, wherein an estimate of a required storage space is determined based at least in part on the error in the estimate of the quantity of unique identifiers.

18. A method for data backup based on estimates of a quantity of unique identifiers in a deduplicating storage system comprising:
- for each of k times;
  
  associating a bin of an ordered set of bins with each received identifier, wherein each bin in the ordered set of bins has a bin number and each received identifier comprises a fingerprint of a segment of a set of segments stored on a file system of the deduplicating storage system;
  
  determining, using a processor, a minimum bin number associated with each received identifier, the minimum bin number being the bin number that is minimum among the bins associated with the each received identifier;
  
  repeating the k times of associating a bin with a received identifier for n trials, where n is greater than two;
  
  determining an estimate of a quantity of unique identifiers based at least in part on an average minimum associated bin number;
  
  determining a data compression ratio of the segments stored in the file system of the deduplicating storage system based on the estimated quantity of the unique identifiers without having to record a list of the unique identifiers and check the list of the unique identifiers for the each received identifier;
  
  determining a capacity of the deduplicating storage system; and
  
  backing up data to the file system of the deduplicating storage system based on the determined capacity of the deduplicating storage system and the determined data compression ratio of the segments stored therein.
- View Dependent Claims (19, 20)
- - 19. The method of claim 18, wherein the bin associated with each received identifier has a number identifying the bin'"'"'s place in the ordered set of bins.
  - 20. The method of claim 18, wherein determining the estimate of the quantity of unique identifiers comprises dividing a total number of bins by an average minimum associated bin value and subtracting one.

21. A computer program product, the computer program product being embedded in a non-transitory computer readable storage medium and comprising computer instructions for performing a method for data backup based on estimates of a quantity of unique identifiers in a deduplicating storage system comprising:
- for each of k times;
  
  associating a bin of an ordered set of bins with each received identifier, wherein each bin in the ordered set of bins has a bin number and each received identifier comprises a fingerprint of a segment of a set of segments stored on a file system of the deduplicating storage system;
  
  determining, using a processor, a minimum bin number associated with each received identifier, the minimum bin number being the bin number that is minimum among the bins associated with the each received identifier;
  
  repeating the k times of associating a bin with a received identifier for n trials, where n is greater than two;
  
  determining an estimate of a quantity of unique identifiers based at least in part on an average minimum associated bin number;
  
  determining a data compression ratio of the segments stored in the file system of the deduplicating storage system based on the estimated quantity of the unique identifiers without having to record a list of the unique identifiers and check the list of the unique identifiers for the each received identifier;
  
  determining a capacity of the deduplicating storage system; and
  
  backing up data to the file system of the deduplicating storage system based on the determined capacity of the deduplicating storage system and the determined data compression ratio of the segments stored therein.
- View Dependent Claims (22, 23)
- - 22. The computer program product of claim 21, wherein the bin associated with each received identifier has a number identifying the bin'"'"'s place in the ordered set of bins.
  - 23. The computer program product of claim 21, wherein determining the estimate of the quantity of unique identifiers comprises dividing a total number of bins by an average minimum associated bin value and subtracting one.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Emc IP Holding Company LLC (Dell Technologies Inc.)
Original Assignee
Emc IP Holding Company LLC (Dell Technologies Inc.)
Inventors
Botelho, Fabiano
Primary Examiner(s)
Rones, Charles
Assistant Examiner(s)
Doan, Han

Application Number

US14/677,822
Publication Number

US 20150363438A1
Time in Patent Office

1,307 Days
Field of Search
US Class Current
CPC Class Codes

G06F 11/1456   Hardware arrangements for b...

G06F 11/2074   Asynchronous techniques

G06F 16/1748   De-duplication implemented ...

G06F 16/215   Improving data quality; Dat...

Efficiently estimating compression ratio in a deduplicating file system

First Claim

9 Assignments

0 Petitions

Accused Products

Abstract

11 Citations

23 Claims

Specification

Solutions

Use Cases

Quick Links

Efficiently estimating compression ratio in a deduplicating file system

First Claim

9 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

11 Citations

23 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links