Predictive probabilistic deduplication of storage
First Claim
1. A method for probability-based deduplication of storage, said method comprising:
- receiving, by a processor, a plurality of input/output (I/O) commands, said plurality of commands including content subdivided into a first plurality of data blocks;
setting the first plurality of data blocks as unique;
writing the first plurality of data blocks to storage;
sampling the first plurality of data blocks based on the first plurality of data blocks being set as unique to check for unique and duplicate blocks in the first plurality of the blocks and updating a key-value table with the sampled blocks;
predicting, by the processor, based on the sampling, whether a second plurality of blocks is expected to be unique or duplicate, wherein said predicting is performed without writing the second plurality of blocks to the storage; and
upon predicting that the second plurality of blocks is duplicate;
updating the key-value table with the duplicate blocks;
tallying unique blocks in the second plurality of blocks;
writing the unique blocks to the storage and updating a value in a uniqueness counter corresponding to the tallying; and
upon the value in the uniqueness counter exceeding a threshold, predicting that a next plurality of blocks is expected to be unique; and
upon predicting that the second plurality of blocks is unique;
writing the second plurality of blocks to the storage; and
continuing to perform said sampling and predicting with blocks of the received plurality of I/O commands, thereby deduplicating the content.
1 Assignment
0 Petitions
Accused Products
Abstract
Examples perform predictive probabilistic deduplication of storage, such as virtualized or physical disks. Incoming input/output (I/O) commands include data, which is written to storage and tracked in a key-value store. The key-value store includes a hash of the data as the key, and a reference counter and the address of the data as the value. When a certain percentage of sampled incoming data is found to be duplicate, it is predicted that the I/O commands have become not unique (e.g., duplicate). Based on the prediction, subsequent incoming data is not written to storage, and instead the reference counter associated with the hash of the data is incremented. In this manner, predictions on the uniqueness of future data is made based on previous data, and extraneous writes and deletions from the chunk store are avoided.
-
Citations
20 Claims
-
1. A method for probability-based deduplication of storage, said method comprising:
-
receiving, by a processor, a plurality of input/output (I/O) commands, said plurality of commands including content subdivided into a first plurality of data blocks; setting the first plurality of data blocks as unique; writing the first plurality of data blocks to storage; sampling the first plurality of data blocks based on the first plurality of data blocks being set as unique to check for unique and duplicate blocks in the first plurality of the blocks and updating a key-value table with the sampled blocks; predicting, by the processor, based on the sampling, whether a second plurality of blocks is expected to be unique or duplicate, wherein said predicting is performed without writing the second plurality of blocks to the storage; and upon predicting that the second plurality of blocks is duplicate; updating the key-value table with the duplicate blocks; tallying unique blocks in the second plurality of blocks; writing the unique blocks to the storage and updating a value in a uniqueness counter corresponding to the tallying; and upon the value in the uniqueness counter exceeding a threshold, predicting that a next plurality of blocks is expected to be unique; and upon predicting that the second plurality of blocks is unique; writing the second plurality of blocks to the storage; and continuing to perform said sampling and predicting with blocks of the received plurality of I/O commands, thereby deduplicating the content. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A non-transitory computer readable storage medium having stored thereon computer-executable instructions that, when executed by a processor, cause the processor to perform operations for probability-based deduplication of storage by:
-
initializing a first plurality of blocks included in a set of input/output (I/O) commands to unique; writing the first plurality of blocks to storage; sampling the first plurality of blocks based on the first plurality of blocks being initialized to unique; updating a key-value table with the sampled blocks; predicting, based on the sampling, whether other incoming blocks included in the set of I/O commands are unique or duplicate based on the sampling, wherein said predicting is performed without storing the other incoming blocks in the storage; and designating subsequent incoming blocks included in the set of I/O commands as unique or duplicate based on the prediction. - View Dependent Claims (10, 11, 12, 13)
-
-
14. A system for deduplicating storage in a predictive probabilistic manner, said system comprising:
-
an input/output (I/O) stack programmed to; receive a stream of data blocks; select a first plurality of data blocks from the received stream of data blocks; set the first plurality of data blocks as unique; write the first plurality of data blocks to storage; sample the first plurality of data blocks based on the first plurality of data blocks being set as unique to check for unique and duplicate blocks in the first plurality of data blocks; update a key-value table in a content-based chunk store with the sampled data blocks; predict, based on the sampling, whether subsequent data blocks in the stream of data blocks are unique or duplicate, wherein said predicting is performed without writing the subsequent data blocks to the storage; and based on the prediction, designate the subsequent data blocks as unique or duplicate for further deduplication. - View Dependent Claims (15, 16, 17, 18, 19, 20)
-
Specification