Extensible pipeline for data deduplication

US 8,380,681 B2
Filed: 12/16/2010
Issued: 02/19/2013
Est. Priority Date: 12/16/2010
Status: Active Grant

First Claim

Patent Images

1. In a computing environment, a system, comprising, a data deduplication pipeline including a chunking phase configured to split data of files into chunks, in which the chunking phase comprises one or more modules that each correspond to a chunking algorithm, a deduplication detection phase configured to determine for each chunk whether that chunk is already stored in a deduplication system, and a commit phase that commits chunks to the deduplication system that are not determined by the deduplication detection phase to be stored in the in the deduplication system, and commits reference data for chunks that are already determined to be stored in the deduplication system.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The subject disclosure is directed towards data deduplication (optimization) performed by phases/modules of a modular data deduplication pipeline. At each phase, the pipeline allows modules to be replaced, selected or extended, e.g., different algorithms can be used for chunking or compression based upon the type of data being processed. The pipeline facilitates secure data processing, batch processing, and parallel processing. The pipeline is tunable based upon feedback, e.g., by selecting modules to increase deduplication quality, performance and/or throughput. Also described is selecting, filtering, ranking, sorting and/or grouping the files to deduplicate, e.g., based upon properties and/or statistical properties of the files and/or a file dataset and/or internal or external feedback.

51 Citations

View as Search Results

20 Claims

1. In a computing environment, a system, comprising, a data deduplication pipeline including a chunking phase configured to split data of files into chunks, in which the chunking phase comprises one or more modules that each correspond to a chunking algorithm, a deduplication detection phase configured to determine for each chunk whether that chunk is already stored in a deduplication system, and a commit phase that commits chunks to the deduplication system that are not determined by the deduplication detection phase to be stored in the in the deduplication system, and commits reference data for chunks that are already determined to be stored in the deduplication system.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The system of claim 1 further comprising a chunking algorithm selector configured to select a chunking algorithm from among a plurality of available chunking algorithms of the chunking phase.
  - 3. The system of claim 1 wherein the chunking phase comprises at least two modules that perform chunking of different subsets of the files in parallel.
  - 4. The system of claim 1 wherein the commit phase comprises at least two modules that store chunks into one or more chunk stores of the deduplication system in parallel.
  - 5. The system of claim 1 further comprising a compression phase including one or more modules that each correspond to a compression algorithm that compresses at least one of the chunks before committing that chunk to the deduplication system, and if the compression phase includes a plurality of available compression algorithms, the system further comprising a compression algorithm selector configured to select a compression algorithm from among the plurality of available compression algorithms.
  - 6. The system of claim 1 further comprising a compression phase comprising at least two modules that perform compression of different subsets of the chunks in parallel, or a hashing phase comprising at least two modules that perform hashing of different subsets of the chunks in parallel, or both at least two modules that perform compression of different subsets of the chunks in parallel, and at least two modules that perform hashing of different subsets of the chunks in parallel.
  - 7. The system of claim 1 further comprising, a scanning phase, including a groveler that selects files for deduplication via the pipeline, the groveler configured to access policy to determine which files to select for deduplication.
  - 8. The system of claim 7 wherein the groveler operates on a snapshot of the files, and processes the snapshot to log selected files for further processing.
  - 9. The system of claim 7 further comprising, a selection phase, the selection phase configured to receive the files identified via the scanning phase or another mechanism, or both, and to access policy to perform filtering, ranking, sorting or grouping of the files, or any combination of filtering, ranking, sorting or grouping of the files before providing the files for further processing via the pipeline.
  - 10. The system of claim 1 wherein the pipeline is configured to perform batch processing on a plurality of files, the plurality of files batched in a file queue or other batched grouping of files.
  - 11. The system of claim 1 wherein the pipeline is configured to perform batch processing on a plurality of chunks, the plurality of chunks batched in a chunk queue or other batched grouping of chunks.
  - 12. The system of claim 1 wherein the pipeline is coupled to a hosting process configured to host a hosted module, the hosting process configured with a data access component that securely accesses data for processing by the hosted module.
  - 13. The system of claim 1 wherein the pipeline is tunable based upon feedback to select at least one module, tune least one module, configure at least one module, change at least one module, or extend by adding at least one module to the pipeline, or any combination thereof, wherein the feedback comprises internal feedback based on data or file properties or both discovered by the pipeline, or external feedback based on information of previously deduplicated data of one or more other deduplication systems, or both internal feedback and external feedback.

14. In a computing environment, a method performed at least in part on at least one processor, comprising, receiving files to deduplicate via phases of a data deduplication pipeline;
- processing the data of the files into chunks in a modular chunking phase comprising one or more chunking algorithms;
  
  providing the chunks to an indexing phase that determines whether each of the chunks already exists in a deduplication system;
  
  committing each chunk in a chunk storing phase if that chunk was determined to not already exist in the deduplication system, and committing reference data for that chunk if that chunk was determined to exist in the deduplication system; and
  
  committing reference information to the file corresponding to the chunk or chunks extracted from that file.
- View Dependent Claims (15, 16, 17, 18)
- - 15. The method of claim 14 further comprising, selecting a chunking algorithm from among a plurality of available chunking algorithms based upon the file data to be chunked.
  - 16. The method of claim 14 further comprising, obtaining a snapshot of a set of candidate files to deduplicate, scanning the candidate files to select files to deduplicate and logging the files to deduplicate into a log, and processing the files in the log based upon properties of the files, statistical properties of the files, statistically inferred properties of a file dataset, internal feedback, or external feedback, or any combination of properties of the files, statistical properties of the files, statistically inferred properties of a file dataset, internal feedback, or external feedback, to perform filtering, ranking, sorting or grouping of the files, or any combination of filtering, ranking, sorting or grouping of the files, and outputting the files to be received for further deduplication processing.
  - 17. The method of claim 14 further comprising, compressing at least one chunk, including selecting a compression algorithm from among a plurality of available compression algorithms based upon the chunk'"'"'s data, chunk metadata, file data or file metadata, or any combination of the chunk'"'"'s data, chunk metadata, file data or file metadata.
  - 18. The method of claim 14 wherein processing the data of the files into chunks comprises accessing the data via a secure process that contains at least one of the one or more chunking algorithms, including obtaining a duplicate file handle at the secure process, and using the duplicate file handle to access the data.

19. One or more computer-readable media having computer-executable instructions, which when executed perform steps, comprising:
- selecting files for data deduplication;
  
  queuing the files for batch processing;
  
  processing the files into chunks in a secure modular chunking phase comprising one or more chunking algorithms;
  
  queuing the chunks for batch processing;
  
  processing the chunks to determine whether each chunk already exists in a deduplication system, and if not, storing each chunk that does not already exist to the deduplication system, and if so, storing reference data for each chunk that already exists;
  
  committing the chunk or chunks, or chunk reference data for a session, or the chunk or chunks and chunk reference data for a session, to the deduplication system, in conjunction with updating an index to each chunk that did not already exist in the deduplication system; and
  
  updating file metadata to associate the file with references to the chunk or chunks.
- View Dependent Claims (20)
- - 20. the one or more computer-readable of claim 19 wherein processing the files into chunks comprises executing a plurality of chunk algorithms in parallel to process different subset of the files into chunks in parallel operations.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Oltean, Paul Adrian, Kalach, Ran, Benton, James Robert, El-Shimi, Ahmed M.
Primary Examiner(s)
AL HASHEMI, SANA A

Application Number

US12/970,839
Publication Number

US 20120158672A1
Time in Patent Office

796 Days
Field of Search

707/692, 707/770, 707/803, 707/809, 707/811
US Class Current

707/692
CPC Class Codes

G06F 16/11 File system administration,...

G06F 16/13 File access structures, e.g...

Extensible pipeline for data deduplication

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

51 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Extensible pipeline for data deduplication

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

51 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links