Predictive probabilistic deduplication of storage

US 9,940,337 B2
Filed: 05/31/2015
Issued: 04/10/2018
Est. Priority Date: 05/31/2015
Status: Active Grant

First Claim

Patent Images

1. A method for probability-based deduplication of storage, said method comprising:

receiving, by a processor, a plurality of input/output (I/O) commands, said plurality of commands including content subdivided into a first plurality of data blocks;

setting the first plurality of data blocks as unique;

writing the first plurality of data blocks to storage;

sampling the first plurality of data blocks based on the first plurality of data blocks being set as unique to check for unique and duplicate blocks in the first plurality of the blocks and updating a key-value table with the sampled blocks;

predicting, by the processor, based on the sampling, whether a second plurality of blocks is expected to be unique or duplicate, wherein said predicting is performed without writing the second plurality of blocks to the storage; and

upon predicting that the second plurality of blocks is duplicate;

updating the key-value table with the duplicate blocks;

tallying unique blocks in the second plurality of blocks;

writing the unique blocks to the storage and updating a value in a uniqueness counter corresponding to the tallying; and

upon the value in the uniqueness counter exceeding a threshold, predicting that a next plurality of blocks is expected to be unique; and

upon predicting that the second plurality of blocks is unique;

writing the second plurality of blocks to the storage; and

continuing to perform said sampling and predicting with blocks of the received plurality of I/O commands, thereby deduplicating the content.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Examples perform predictive probabilistic deduplication of storage, such as virtualized or physical disks. Incoming input/output (I/O) commands include data, which is written to storage and tracked in a key-value store. The key-value store includes a hash of the data as the key, and a reference counter and the address of the data as the value. When a certain percentage of sampled incoming data is found to be duplicate, it is predicted that the I/O commands have become not unique (e.g., duplicate). Based on the prediction, subsequent incoming data is not written to storage, and instead the reference counter associated with the hash of the data is incremented. In this manner, predictions on the uniqueness of future data is made based on previous data, and extraneous writes and deletions from the chunk store are avoided.

Citations

20 Claims

1. A method for probability-based deduplication of storage, said method comprising:
- receiving, by a processor, a plurality of input/output (I/O) commands, said plurality of commands including content subdivided into a first plurality of data blocks;
  
  setting the first plurality of data blocks as unique;
  
  writing the first plurality of data blocks to storage;
  
  sampling the first plurality of data blocks based on the first plurality of data blocks being set as unique to check for unique and duplicate blocks in the first plurality of the blocks and updating a key-value table with the sampled blocks;
  
  predicting, by the processor, based on the sampling, whether a second plurality of blocks is expected to be unique or duplicate, wherein said predicting is performed without writing the second plurality of blocks to the storage; and
  
  upon predicting that the second plurality of blocks is duplicate;
  
  updating the key-value table with the duplicate blocks;
  
  tallying unique blocks in the second plurality of blocks;
  
  writing the unique blocks to the storage and updating a value in a uniqueness counter corresponding to the tallying; and
  
  upon the value in the uniqueness counter exceeding a threshold, predicting that a next plurality of blocks is expected to be unique; and
  
  upon predicting that the second plurality of blocks is unique;
  
  writing the second plurality of blocks to the storage; and
  
  continuing to perform said sampling and predicting with blocks of the received plurality of I/O commands, thereby deduplicating the content.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, wherein updating the key-value table further comprises calculating a hash of a unique block, inserting the hash of the unique block as a key into the key-value table, and inserting a reference counter and an address of the block in the storage as the associated value of that key in the key-value table.
  - 3. The method of claim 2, wherein the uniqueness counter is initialized at zero for a first instance of the unique block.
  - 4. The method of claim 3, wherein upon receiving subsequent instances of the unique block a logical storage of that block points to the entry of the block in the key-value table which points to a location of that block on a physical storage.
  - 5. The method of claim 1, wherein tallying unique blocks further comprises initializing the tally at zero upon changing to a prediction that the next plurality of blocks is duplicate from a prediction that the next plurality of blocks is unique.
  - 6. The method of claim 1, wherein quality of predictions is tuned by sampling more frequently, changing the size of the zone, or changing the type of sampling based on previous predictions.
  - 7. The method of claim 1, wherein the sampling is performed in at least one of the following ways:
    - random, stratified, cluster, multistage, systematic, or in accordance with an algorithm.
  - 8. The method of claim 1, wherein the processor performs operations to implement an I/O stack.

9. A non-transitory computer readable storage medium having stored thereon computer-executable instructions that, when executed by a processor, cause the processor to perform operations for probability-based deduplication of storage by:
- initializing a first plurality of blocks included in a set of input/output (I/O) commands to unique;
  
  writing the first plurality of blocks to storage;
  
  sampling the first plurality of blocks based on the first plurality of blocks being initialized to unique;
  
  updating a key-value table with the sampled blocks;
  
  predicting, based on the sampling, whether other incoming blocks included in the set of I/O commands are unique or duplicate based on the sampling, wherein said predicting is performed without storing the other incoming blocks in the storage; and
  
  designating subsequent incoming blocks included in the set of I/O commands as unique or duplicate based on the prediction.
- View Dependent Claims (10, 11, 12, 13)
- - 10. The non-transitory computer readable storage medium of claim 9, wherein the computer-executable instructions further cause the processor to deduplicate the storage without performing read commands on the storage.
  - 11. The non-transitory computer readable storage medium of claim 9, wherein the computer-executable instructions further cause the processor to deduplicate the storage asynchronously or inline.
  - 12. The non-transitory computer readable storage medium of claim 9, wherein the computer-executable instructions further cause the processor to sample blocks more often or less often based on the prediction.
  - 13. The non-transitory computer readable storage medium of claim 9, wherein the computer-executable instructions further cause the processor to sample in at least one of the following ways:
    - random, stratified, cluster, multistage, systematic, or in accordance with an algorithm.

14. A system for deduplicating storage in a predictive probabilistic manner, said system comprising:
- an input/output (I/O) stack programmed to;
  
  receive a stream of data blocks;
  
  select a first plurality of data blocks from the received stream of data blocks;
  
  set the first plurality of data blocks as unique;
  
  write the first plurality of data blocks to storage;
  
  sample the first plurality of data blocks based on the first plurality of data blocks being set as unique to check for unique and duplicate blocks in the first plurality of data blocks;
  
  update a key-value table in a content-based chunk store with the sampled data blocks;
  
  predict, based on the sampling, whether subsequent data blocks in the stream of data blocks are unique or duplicate, wherein said predicting is performed without writing the subsequent data blocks to the storage; and
  
  based on the prediction, designate the subsequent data blocks as unique or duplicate for further deduplication.
- View Dependent Claims (15, 16, 17, 18, 19, 20)
- - 15. The system of claim 14, wherein the I/O stack further assigns the subsequent data blocks to a unique zone or a duplicate zone based on the prediction.
  - 16. The system of claim 15, wherein the I/O stack further tallies unique blocks found in the duplicate zone.
  - 17. The system of claim 16, wherein the I/O stack changes from the duplicate zone to the unique zone when the tallied unique blocks reach a threshold.
  - 18. The system of claim 15, wherein the I/O stack further tracks a size of the unique zone and the duplicate zone and adjusts the sample based on the tracked size.
  - 19. The system of claim 18, wherein the I/O stack adjusts the sample as a function of a weighted average of the tracked duplicate zone size.
  - 20. The system of claim 14, wherein the I/O stack adjusts the sample based on the prediction.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
VMware, Inc. (Broadcom, Inc.), Vmware LLC (Broadcom, Inc.)
Original Assignee
VMware, Inc. (Broadcom, Inc.)
Inventors
Wang, Wenguang, Luo, Tian
Primary Examiner(s)
CHANNAVAJJALA, SRIRAMA T

Application Number

US14/726,597
Publication Number

US 20160350324A1
Time in Patent Office

1,045 Days
Field of Search
US Class Current
CPC Class Codes

G06F 11/1453   using de-duplication of the...

G06F 16/137   Hash-based content-based in...

G06F 16/1748   De-duplication implemented ...

G06F 16/1752   based on file chunks

G06F 16/2255   Hash tables

G06F 3/0608   Saving storage space on sto...

G06F 3/0641   De-duplication techniques

Predictive probabilistic deduplication of storage

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Predictive probabilistic deduplication of storage

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links