Extensible pipeline for data deduplication
First Claim
1. In a computing environment, a system, comprising, a data deduplication pipeline including a chunking phase configured to split data of files into chunks, in which the chunking phase comprises one or more modules that each correspond to a chunking algorithm, a deduplication detection phase configured to determine for each chunk whether that chunk is already stored in a deduplication system, and a commit phase that commits chunks to the deduplication system that are not determined by the deduplication detection phase to be stored in the in the deduplication system, and commits reference data for chunks that are already determined to be stored in the deduplication system.
4 Assignments
0 Petitions
Accused Products
Abstract
The subject disclosure is directed towards data deduplication (optimization) performed by phases/modules of a modular data deduplication pipeline. At each phase, the pipeline allows modules to be replaced, selected or extended, e.g., different algorithms can be used for chunking or compression based upon the type of data being processed. The pipeline facilitates secure data processing, batch processing, and parallel processing. The pipeline is tunable based upon feedback, e.g., by selecting modules to increase deduplication quality, performance and/or throughput. Also described is selecting, filtering, ranking, sorting and/or grouping the files to deduplicate, e.g., based upon properties and/or statistical properties of the files and/or a file dataset and/or internal or external feedback.
51 Citations
20 Claims
- 1. In a computing environment, a system, comprising, a data deduplication pipeline including a chunking phase configured to split data of files into chunks, in which the chunking phase comprises one or more modules that each correspond to a chunking algorithm, a deduplication detection phase configured to determine for each chunk whether that chunk is already stored in a deduplication system, and a commit phase that commits chunks to the deduplication system that are not determined by the deduplication detection phase to be stored in the in the deduplication system, and commits reference data for chunks that are already determined to be stored in the deduplication system.
-
14. In a computing environment, a method performed at least in part on at least one processor, comprising, receiving files to deduplicate via phases of a data deduplication pipeline;
- processing the data of the files into chunks in a modular chunking phase comprising one or more chunking algorithms;
providing the chunks to an indexing phase that determines whether each of the chunks already exists in a deduplication system;
committing each chunk in a chunk storing phase if that chunk was determined to not already exist in the deduplication system, and committing reference data for that chunk if that chunk was determined to exist in the deduplication system; and
committing reference information to the file corresponding to the chunk or chunks extracted from that file. - View Dependent Claims (15, 16, 17, 18)
- processing the data of the files into chunks in a modular chunking phase comprising one or more chunking algorithms;
-
19. One or more computer-readable media having computer-executable instructions, which when executed perform steps, comprising:
- selecting files for data deduplication;
queuing the files for batch processing;
processing the files into chunks in a secure modular chunking phase comprising one or more chunking algorithms;
queuing the chunks for batch processing;
processing the chunks to determine whether each chunk already exists in a deduplication system, and if not, storing each chunk that does not already exist to the deduplication system, and if so, storing reference data for each chunk that already exists;
committing the chunk or chunks, or chunk reference data for a session, or the chunk or chunks and chunk reference data for a session, to the deduplication system, in conjunction with updating an index to each chunk that did not already exist in the deduplication system; and
updating file metadata to associate the file with references to the chunk or chunks. - View Dependent Claims (20)
- selecting files for data deduplication;
Specification