Scalable post-process deduplication

US 9,946,724 B1
Filed: 03/31/2014
Issued: 04/17/2018
Est. Priority Date: 03/31/2014
Status: Active Grant

First Claim

Patent Images

1. A system comprising:

a memory that has stored thereon computer executable components; and

at least one processor that executes the following computer executable components stored in the memory;

a phase rotation component that generates a set of datasets of a file system, wherein the set of datasets includes at least a first dataset and a second dataset, and wherein the phase rotation component sends the first dataset to the enumeration component for ingestion;

an enumeration component that ingests a dataset by;

reading a set of low level hashes associated with the dataset, wherein low level hashes in the set of low level hashes are associated with a logical block identifier of the file system;

analyzing the set of low level hashes and determining a set of potential matching candidates;

generating a set of high level hashes based on the set of potential matching candidates and associated logical block identifiers; and

adding the set of high level hashes and associated logical block identifiers to a candidate table;

a disk pool policy component that in response to the enumeration component generating a set of high level hashes, determines and associates a disk pool policy identifier with high level hashes in the set of high level hashes;

a commonality component that determines a set of shareable blocks by comparing high level hashes in the set of high level hashes of the candidate table with other high level hashes of the candidate table and an index table, wherein the index table contains a set of high level hashes, associated disk pool policy identifiers, and associated shadow store logical block identifiers, and wherein the set of shareable blocks is based on common disk pool policy identifiers; and

a sharing component that updates the file system based on the set of shareable blocks, wherein, in response to the sharing component updating the file system, the phase rotation component sends the second dataset to the enumeration component for ingestion.

View all claims

9 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Implementations are provided herein for data deduplication, and more particularly, to post-process data deduplication on a large scale out storage system. Multiple techniques and implementations are disclosed that offer greater efficiency, higher performance, and more stability when performing post-process data deduplication at large scale. Disclosed implementations are based on a process for data deduplication involving four main phases: enumeration, commonality, sharing, and update. Multi-level hashing can be used to identify candidates for deduplication during the enumeration phase, providing a more efficient use of compute resources. In addition, datasets can be phase rotated through the post-process deduplication steps providing a more controllable deduplication environment as well as a more efficient use of resources.

45 Citations

View as Search Results

18 Claims

1. A system comprising:
- a memory that has stored thereon computer executable components; and
  
  at least one processor that executes the following computer executable components stored in the memory;
  
  a phase rotation component that generates a set of datasets of a file system, wherein the set of datasets includes at least a first dataset and a second dataset, and wherein the phase rotation component sends the first dataset to the enumeration component for ingestion;
  
  an enumeration component that ingests a dataset by;
  
  reading a set of low level hashes associated with the dataset, wherein low level hashes in the set of low level hashes are associated with a logical block identifier of the file system;
  
  analyzing the set of low level hashes and determining a set of potential matching candidates;
  
  generating a set of high level hashes based on the set of potential matching candidates and associated logical block identifiers; and
  
  adding the set of high level hashes and associated logical block identifiers to a candidate table;
  
  a disk pool policy component that in response to the enumeration component generating a set of high level hashes, determines and associates a disk pool policy identifier with high level hashes in the set of high level hashes;
  
  a commonality component that determines a set of shareable blocks by comparing high level hashes in the set of high level hashes of the candidate table with other high level hashes of the candidate table and an index table, wherein the index table contains a set of high level hashes, associated disk pool policy identifiers, and associated shadow store logical block identifiers, and wherein the set of shareable blocks is based on common disk pool policy identifiers; and
  
  a sharing component that updates the file system based on the set of shareable blocks, wherein, in response to the sharing component updating the file system, the phase rotation component sends the second dataset to the enumeration component for ingestion.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The system of claim 1, wherein the sharing component updates the file system by at least one of:
    - storing a set of data blocks in the shadow store based on the block address associated with a shareable block in the set of shareable blocks;
      
      generating a shadow store pointer for a shareable block in the set of shareable blocks and updating a metadata structure associated with the shareable block with the shadow store pointer, wherein the shadow store pointer points to a shadow store block address;
      
      orupdating the index table based on the high level hash associated with the shareable block and a shadow store logical block identifier.
  - 3. The system of claim 1, wherein the first dataset and the second dataset do not overlap.
  - 4. The system of claim 1, further comprising:
    - a commonality range extension component that determines a largest shareable range for shareable blocks in the set of shareable blocks by analyzing neighboring blocks of the logical block identifiers associated with the shareable blocks.
  - 5. The system of claim 1, wherein the enumeration component ingests the dataset based on at least one of a sampled attribute associated with files of the dataset or an exclude attribute associated with files of the dataset.
  - 6. The system of claim 1, wherein the sharing component updates the file system by at least adding an entry to a reverse mapping table for shareable blocks in the set of shareable blocks wherein the entry includes at least a file identifier and a shadow store logical block identifier.
  - 7. The system of claim 1, wherein the index table is empty.
  - 8. The system of claim 1, wherein the low level hashes are 32-bit checksums.
  - 9. The system of claim 1, wherein the high level hashes are 160 bit SHA1 hashes.

10. A method comprising:
- generating a set of datasets of a file system, wherein the set of datasets includes at least a first dataset and a second dataset;
  
  ingesting the first dataset, wherein ingesting a dataset includes;
  
  scanning the dataset and reading a set of low level hashes based on the scanning, wherein low level hashes in the set of low level hashes are associated with a logical block identifier of the file system;
  
  analyzing the set of low level hashes and determining a set of potential matching candidates;
  
  generating a set of high level hashes based on the set of potential matching candidates and associated logical block identifiers;
  
  in response to the generating the set of high level hashes, determining a disk pool policy identifier for high level hashes in the set of high level hashes;
  
  associating the determined disk pool policy identifier with high level hashes in the set of high level hashes; and
  
  adding the set of high level hashes and associated logical block identifiers to a candidate table;
  
  determining a set of shareable blocks by comparing high level hashes in the set of high level hashes of the candidate table with other high level hashes of the candidate table and an index table, based in part on the set of shareable blocks having common disk pool policy identifiers, wherein the index table contains a set of high level hashes, associated disk pool identifers and associated shadow store logical block identifiers;
  
  updating the file system based on the set of shareable blocks; and
  
  in response to the updating the file system, ingesting the second dataset.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
- - 11. The method of claim 10, wherein the updating the file system includes at least one of:
    - storing a set of data blocks in the shadow store based on the block address associated with a shareable block in the set of shareable blocks;
      
      generating a shadow store pointer for a shareable block in the set of shareable blocks and updating a metadata structure associated with the shareable block with the shadow store pointer, wherein the shadow store pointer points to a shadow store logical block identifier;
      
      orupdating the index table based on the high level hash associated with the shareable block and a shadow store block address.
  - 12. The method of claim 10, wherein the first dataset and the second dataset do not overlap.
  - 13. The method of claim 10, further comprising:
    - analyzing neighboring blocks of the block addresses associated with the shareable blocks,determining a largest shareable range for shareable blocks in the set of shareable blocks based on the analyzing, wherein updating the file system is based on the largest shareable range.
  - 14. The method of claim 10, wherein the scanning the dataset is based on at least one of a sampled attribute associated with files of the dataset or an exclude attribute associated with files of the dataset.
  - 15. The method of claim 10, wherein the updating the file system is at least adding an entry to a reverse mapping table for shareable blocks in the set of shareable blocks wherein the entry includes at least a file identifier and a shadow store logical block identifier.
  - 16. The method of claim 10, wherein the index table is empty.
  - 17. The method of claim 10, wherein the low level hashes are 32-bit checksums.
  - 18. The method of claim 10, wherein the high level hashes are 160 bit SHA1 hashes.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Emc IP Holding Company LLC (Dell Technologies Inc.)
Original Assignee
Emc IP Holding Company LLC (Dell Technologies Inc.)
Inventors
Ghosh, Sourav, Tremaine, Jeffrey, Fleming, Matthew, Lemar, Eric M., Rajawat, Mayank, Mahuli, Harsha
Primary Examiner(s)
Alam, Hosain
Assistant Examiner(s)
Allen, Nicholas

Application Number

US14/230,863
Time in Patent Office

1,478 Days
Field of Search

707692
US Class Current
CPC Class Codes

G06F 16/1748 De-duplication implemented ...

Scalable post-process deduplication

First Claim

9 Assignments

0 Petitions

Accused Products

Abstract

45 Citations

18 Claims

Specification

Use Cases

Quick Links

Others

Scalable post-process deduplication

First Claim

9 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

45 Citations

18 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others