Scalable post-process deduplication
First Claim
1. A system comprising:
- a memory that has stored thereon computer executable components; and
at least one processor that executes the following computer executable components stored in the memory;
a phase rotation component that generates a set of datasets of a file system, wherein the set of datasets includes at least a first dataset and a second dataset, and wherein the phase rotation component sends the first dataset to the enumeration component for ingestion;
an enumeration component that ingests a dataset by;
reading a set of low level hashes associated with the dataset, wherein low level hashes in the set of low level hashes are associated with a logical block identifier of the file system;
analyzing the set of low level hashes and determining a set of potential matching candidates;
generating a set of high level hashes based on the set of potential matching candidates and associated logical block identifiers; and
adding the set of high level hashes and associated logical block identifiers to a candidate table;
a disk pool policy component that in response to the enumeration component generating a set of high level hashes, determines and associates a disk pool policy identifier with high level hashes in the set of high level hashes;
a commonality component that determines a set of shareable blocks by comparing high level hashes in the set of high level hashes of the candidate table with other high level hashes of the candidate table and an index table, wherein the index table contains a set of high level hashes, associated disk pool policy identifiers, and associated shadow store logical block identifiers, and wherein the set of shareable blocks is based on common disk pool policy identifiers; and
a sharing component that updates the file system based on the set of shareable blocks, wherein, in response to the sharing component updating the file system, the phase rotation component sends the second dataset to the enumeration component for ingestion.
9 Assignments
0 Petitions
Accused Products
Abstract
Implementations are provided herein for data deduplication, and more particularly, to post-process data deduplication on a large scale out storage system. Multiple techniques and implementations are disclosed that offer greater efficiency, higher performance, and more stability when performing post-process data deduplication at large scale. Disclosed implementations are based on a process for data deduplication involving four main phases: enumeration, commonality, sharing, and update. Multi-level hashing can be used to identify candidates for deduplication during the enumeration phase, providing a more efficient use of compute resources. In addition, datasets can be phase rotated through the post-process deduplication steps providing a more controllable deduplication environment as well as a more efficient use of resources.
45 Citations
18 Claims
-
1. A system comprising:
-
a memory that has stored thereon computer executable components; and at least one processor that executes the following computer executable components stored in the memory; a phase rotation component that generates a set of datasets of a file system, wherein the set of datasets includes at least a first dataset and a second dataset, and wherein the phase rotation component sends the first dataset to the enumeration component for ingestion; an enumeration component that ingests a dataset by; reading a set of low level hashes associated with the dataset, wherein low level hashes in the set of low level hashes are associated with a logical block identifier of the file system; analyzing the set of low level hashes and determining a set of potential matching candidates; generating a set of high level hashes based on the set of potential matching candidates and associated logical block identifiers; and adding the set of high level hashes and associated logical block identifiers to a candidate table; a disk pool policy component that in response to the enumeration component generating a set of high level hashes, determines and associates a disk pool policy identifier with high level hashes in the set of high level hashes; a commonality component that determines a set of shareable blocks by comparing high level hashes in the set of high level hashes of the candidate table with other high level hashes of the candidate table and an index table, wherein the index table contains a set of high level hashes, associated disk pool policy identifiers, and associated shadow store logical block identifiers, and wherein the set of shareable blocks is based on common disk pool policy identifiers; and a sharing component that updates the file system based on the set of shareable blocks, wherein, in response to the sharing component updating the file system, the phase rotation component sends the second dataset to the enumeration component for ingestion. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A method comprising:
-
generating a set of datasets of a file system, wherein the set of datasets includes at least a first dataset and a second dataset; ingesting the first dataset, wherein ingesting a dataset includes; scanning the dataset and reading a set of low level hashes based on the scanning, wherein low level hashes in the set of low level hashes are associated with a logical block identifier of the file system; analyzing the set of low level hashes and determining a set of potential matching candidates; generating a set of high level hashes based on the set of potential matching candidates and associated logical block identifiers; in response to the generating the set of high level hashes, determining a disk pool policy identifier for high level hashes in the set of high level hashes; associating the determined disk pool policy identifier with high level hashes in the set of high level hashes; and adding the set of high level hashes and associated logical block identifiers to a candidate table; determining a set of shareable blocks by comparing high level hashes in the set of high level hashes of the candidate table with other high level hashes of the candidate table and an index table, based in part on the set of shareable blocks having common disk pool policy identifiers, wherein the index table contains a set of high level hashes, associated disk pool identifers and associated shadow store logical block identifiers; updating the file system based on the set of shareable blocks; and in response to the updating the file system, ingesting the second dataset. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
Specification