Method and system to improve deduplication of structured datasets using hybrid chunking and block header removal
First Claim
Patent Images
1. A computer-implemented method, comprising:
- receiving a request at a system to deduplicate a file having a plurality of data blocks, each data block having a header and a data portion, wherein the file is received from a client application of a client device over a network to be stored in the system;
scanning to search a predetermined signature embedded within a header of each data block to identify a block boundary between the header and the data portion;
anchoring the data blocks using first anchors to indicate block boundaries based on the scanning of the predetermined signature, includingrecognizing a plurality of markers within the data portions of the data blocks, wherein the markers were inserted into the data blocks by the client application prior to receiving the file over the network,removing the recognized markers from the file, andanchoring the data blocks using the first anchors at locations of the removed markers, wherein an anchor denotes a boundary between two data blocks;
adding at least one second anchor within a data portion of at least one data block that has been anchored by two of the first anchors, if the data portion of at least one data block satisfies a predetermined condition, wherein the second anchor is located between two first anchors;
separating data portions of the data blocks from the headers based on the first anchors;
chunking the data portion of the data blocks based on the first anchors and the at least one second anchor, generating a plurality of data chunks; and
deduplicating the data chunks of the data portions of the data blocks.
9 Assignments
0 Petitions
Accused Products
Abstract
Techniques for deduplicating structured datasets using hybrid chunking and header removal. According to one embodiment, a request is received to deduplicate a file having a plurality of data blocks, each data block having a header and a data portion. The data blocks are anchored using first anchors to indicate block boundaries based on their headers. At least one second anchor is added within a data portion of at least one data block if the data portion of at least one data block satisfies a predetermined condition. The data blocks are then deduplicated based on the first and second anchors.
25 Citations
21 Claims
-
1. A computer-implemented method, comprising:
-
receiving a request at a system to deduplicate a file having a plurality of data blocks, each data block having a header and a data portion, wherein the file is received from a client application of a client device over a network to be stored in the system; scanning to search a predetermined signature embedded within a header of each data block to identify a block boundary between the header and the data portion; anchoring the data blocks using first anchors to indicate block boundaries based on the scanning of the predetermined signature, including recognizing a plurality of markers within the data portions of the data blocks, wherein the markers were inserted into the data blocks by the client application prior to receiving the file over the network, removing the recognized markers from the file, and anchoring the data blocks using the first anchors at locations of the removed markers, wherein an anchor denotes a boundary between two data blocks; adding at least one second anchor within a data portion of at least one data block that has been anchored by two of the first anchors, if the data portion of at least one data block satisfies a predetermined condition, wherein the second anchor is located between two first anchors; separating data portions of the data blocks from the headers based on the first anchors; chunking the data portion of the data blocks based on the first anchors and the at least one second anchor, generating a plurality of data chunks; and deduplicating the data chunks of the data portions of the data blocks. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A non-transitory computer-readable storage medium having instructions stored therein, which when executed by a processor, cause the processor to perform operations, the operations comprising:
-
receiving a request at a system to deduplicate a file having a plurality of data blocks, each data block having a header and a data portion, wherein the file is received from a client application of a client device over a network to be stored in the system; scanning to search a predetermined signature embedded within a header of each data block to identify a block boundary between the header and the data portion; anchoring the data blocks using first anchors to indicate block boundaries based on the scanning of the predetermined signature, including recognizing a plurality of markers within the data portions of the data blocks, wherein the markers were inserted into the data blocks by the client application prior to receiving the file over the network, removing the recognized markers from the file, and anchoring the data blocks using the first anchors at locations of the removed markers, wherein an anchor denotes a boundary between two data blocks; adding at least one second anchor within a data portion of at least one data block that has been anchored by two of the first anchors, if the data portion of at least one data block satisfies a predetermined condition, wherein the second anchor is located between two first anchors; separating data portions of the data blocks from the headers based on the first anchors; chunking the data portion of the data blocks based on the first anchors and the at least one second anchor, generating a plurality of data chunks; and deduplicating the data chunks of the data portions of the data blocks. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A storage system, comprising:
-
a processor; a memory; an anchor determination unit loaded in the memory and executed by the processor to receive a request to deduplicate a file having a plurality of data blocks, each data block having a header and a data portion, the anchor determination unit to scan for searching a predetermined signature embedded within a header of each data block to identify a block boundary between the header and the data portion, and to anchor the data blocks using first anchors to indicate block boundaries based on the scanning of the predetermined signature, wherein the file is received from a client application of a client device over a network to be stored in the system, wherein anchoring the data blocks using first anchors comprises recognizing a plurality of markers within the data portions of the data blocks, wherein the markers were inserted into the data blocks by the client application prior to receiving the file over the network, removing the recognized markers from the file, and anchoring the data blocks using the first anchors at locations of the removed markers, wherein an anchor denotes a boundary between two data blocks; an anchor adder executed by the processor to scan data portions of the data blocks and to add at least one second anchor within a data portion of at least one data block that has been anchored by two of the first anchors, if the data portion of at least one data block satisfies a predetermined condition, wherein the second anchor is located between two first anchors; and a duplication eliminator executed by the processor to separate data portions of the data blocks from the headers based on the first anchors, to chunk the data portion of the data blocks based on the first anchors and the at least one second anchor, generating a plurality of data chunks, and to deduplicate the data chunks of the data portions of the data blocks. - View Dependent Claims (16, 17, 18, 19, 20, 21)
-
Specification