Efficiently estimating compression ratio in a deduplicating file system
First Claim
Patent Images
1. A system for estimating a compression ratio of a deduplicating storage system, comprising:
- a processor configured to;
process an incoming stream of data into a set of segments;
for each of k times, associate a bin of an ordered set of bins with each received identifier using a hash function, wherein each received identifier comprises a fingerprint of a segment of the set of segments processed by a data fingerprinter coupled to the deduplicating storage system;
store only a minimum bin number resulting from the k times of hashing each received identifier, wherein the processor only stores one value for the k times of hashing each received identifier;
repeat the k times of associating a bin with a received identifier for n trials, where n is greater than two;
determine an average minimum associated bin number, wherein the average minimum associated bin number comprises an average of the minimum bin number over the n trials;
determining an estimate of a quantity of unique identifiers comprises dividing a total number of bins by the average minimum associated bin value and subtracting one;
determine an estimation of a compression ratio for a deduplicating file system based at least in part on the estimate of the quantity of unique identifiers; and
a memory coupled to the processor and configured to provide the processor with instructions.
9 Assignments
0 Petitions
Accused Products
Abstract
A system for estimating a quantity of unique identifiers comprises a processor and a memory. The processor is configured to, for each of k times, associate a bin of a set of bins with each received identifier. The processor is further configured to determine an estimate of the quantity of unique identifiers based at least in part on an average minimum associated bin value. The memory is coupled to the processor and configured to provide the processor with instructions.
-
Citations
19 Claims
-
1. A system for estimating a compression ratio of a deduplicating storage system, comprising:
a processor configured to; process an incoming stream of data into a set of segments; for each of k times, associate a bin of an ordered set of bins with each received identifier using a hash function, wherein each received identifier comprises a fingerprint of a segment of the set of segments processed by a data fingerprinter coupled to the deduplicating storage system; store only a minimum bin number resulting from the k times of hashing each received identifier, wherein the processor only stores one value for the k times of hashing each received identifier; repeat the k times of associating a bin with a received identifier for n trials, where n is greater than two; determine an average minimum associated bin number, wherein the average minimum associated bin number comprises an average of the minimum bin number over the n trials; determining an estimate of a quantity of unique identifiers comprises dividing a total number of bins by the average minimum associated bin value and subtracting one; determine an estimation of a compression ratio for a deduplicating file system based at least in part on the estimate of the quantity of unique identifiers; and a memory coupled to the processor and configured to provide the processor with instructions. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 17, 18, 19)
-
15. A method for estimating a compression ratio of a deduplicating storage system comprising:
-
process, using a processor, an incoming stream of data into a set of segments; for each of k times, associating, using the processor, a bin of an ordered set of bins with each received identifier using a hash function, wherein each received identifier comprises a fingerprint of a segment of the set of segments processed by a data fingerprinter coupled to the deduplicating storage system; store, in a memory, only a minimum bin number resulting from the k times of hashing each received identifier, wherein the processor only stores in the memory one value for the k times of hashing each received identifier; repeat the k times of associating a bin with a received identifier for n trials, where n is greater than two; determining an average minimum associated bin number, wherein the average minimum associated bin number comprises an average of the minimum bin number over the n trials; determining an estimate of a quantity of unique identifiers comprises dividing a total number of bins by the average minimum associated bin value and subtracting one; and determining an estimation of a compression ratio for a deduplicating file system based at least in part on the estimate of the quantity of unique identifiers.
-
-
16. A computer program product, the computer program product being embedded in a tangible non-transitory computer readable storage medium and comprising computer instructions for:
-
process an incoming stream of data into a set of segments; for each of k times, associating a bin of an ordered set of bins with each received identifier using a hash function, wherein each received identifier comprises a fingerprint of a segment of the set of segments processed by a data fingerprinter coupled to the deduplicating storage system; store only a minimum bin number resulting from the k times of hashing each received identifier, wherein the processor only stores one value for the k times of hashing each received identifier; repeat the k times of associating a bin with a received identifier for n trials, where n is greater than two; determining an average minimum associated bin number, wherein the average minimum associated bin number comprises an average of the minimum bin number over the n trials; determining an estimate of a quantity of unique identifiers comprises dividing a total number of bins by the average minimum associated bin value and subtracting one; and determining an estimation of a compression ratio for a deduplicating file system based at least in part on the estimate of the quantity of unique identifiers.
-
Specification