De-duplication of files for continuous data protection with remote storage
First Claim
1. A computer-implemented method of backing-up a version of a data file to a remote storage, the method comprising executing instructions on a computer to perform the operations of:
- de-duplicating the version of the data file against a previous version master file stored on a local storage device, wherein the previous version master file comprises a single instance of each of one or more unique data blocks of a specific block size from a previous version of the data file, and wherein de-duplicating the version of the data file against the previous version master file comprises determining whether at least one block of data from the version of the data file matches at least one of the one or more unique data blocks of the specific size from the previous version of the data file by;
maintaining a lightweight checksum for each of the one or more unique data blocks in the previous version master file, andmatching a block of data read from the version of the data file to one of the one or more unique data blocks in the previous version master file by calculating a lightweight checksum value for the read block of data and comparing the calculated lightweight checksum value with the lightweight checksums maintained for the one or more unique data blocks in the previous version master file, wherein calculating the lightweight checksum for a subsequently read block of data from the version of the data file comprises subtracting one or more bytes from the lightweight checksum for a previously read block of data and adding one or more bytes from the subsequently read block of data to the lightweight checksum for the previously read block of data;
creating a supplemental file corresponding to the version of the data file and comprising one or more chunks of data from the version of the data file not matching one of the one or more unique data blocks in the previous version master file;
creating a version map file corresponding to the version of the data file and comprising one or more references to unique data blocks in the previous version master file and one or more references to chunks of data in the supplemental file, wherein each of the one or more references to unique data blocks in the previous version master file comprise an index to a unique data block and each of the one or more the references to chunks of data in the supplemental file comprise a length of a chunk of data; and
storing the supplemental file and the version map file corresponding to the version of the data file to the remote storage, wherein the remote storage contains a master file corresponding to the data file and comprising each of the unique data blocks referenced in the version map file.
3 Assignments
0 Petitions
Accused Products
Abstract
Technologies are described herein for performing data de-duplication of a version of a data file for backup to a remote storage location. A CDP module executing on a computer creates a collection of files corresponding to the version of the data file by de-duplicating the version against a previous version master file stored locally on the computer. The previous version master file contains one or more unique data blocks of a specific block size from a previous version of the data file. Once the de-duplication against the locally maintained previous version master file is complete, the CDP module stores the collection of files corresponding to the version of the data file to the remote storage location. The remote storage location also contains a master file corresponding to the data file that contains all of the unique data blocks in the previous version master file.
-
Citations
15 Claims
-
1. A computer-implemented method of backing-up a version of a data file to a remote storage, the method comprising executing instructions on a computer to perform the operations of:
-
de-duplicating the version of the data file against a previous version master file stored on a local storage device, wherein the previous version master file comprises a single instance of each of one or more unique data blocks of a specific block size from a previous version of the data file, and wherein de-duplicating the version of the data file against the previous version master file comprises determining whether at least one block of data from the version of the data file matches at least one of the one or more unique data blocks of the specific size from the previous version of the data file by; maintaining a lightweight checksum for each of the one or more unique data blocks in the previous version master file, and matching a block of data read from the version of the data file to one of the one or more unique data blocks in the previous version master file by calculating a lightweight checksum value for the read block of data and comparing the calculated lightweight checksum value with the lightweight checksums maintained for the one or more unique data blocks in the previous version master file, wherein calculating the lightweight checksum for a subsequently read block of data from the version of the data file comprises subtracting one or more bytes from the lightweight checksum for a previously read block of data and adding one or more bytes from the subsequently read block of data to the lightweight checksum for the previously read block of data; creating a supplemental file corresponding to the version of the data file and comprising one or more chunks of data from the version of the data file not matching one of the one or more unique data blocks in the previous version master file; creating a version map file corresponding to the version of the data file and comprising one or more references to unique data blocks in the previous version master file and one or more references to chunks of data in the supplemental file, wherein each of the one or more references to unique data blocks in the previous version master file comprise an index to a unique data block and each of the one or more the references to chunks of data in the supplemental file comprise a length of a chunk of data; and storing the supplemental file and the version map file corresponding to the version of the data file to the remote storage, wherein the remote storage contains a master file corresponding to the data file and comprising each of the unique data blocks referenced in the version map file. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A non-transitory computer-readable storage medium having computer-executable instructions stored thereon for de-duplicating a current version of a data file that, when executed by a computer, cause the computer to:
-
read a first block of data of a particular block size from a current offset in the current version of the data file; determine if the first block of data matches one of a plurality of unique data blocks contained in a previous version master file corresponding to a previous version of the data file, wherein the previous version master file is stored on a local storage device of the computer, wherein the previous version master file comprises a single instance of each of the plurality of unique data blocks from the previous version of the data file, wherein determining if the first block of data matches one of the plurality of unique data blocks contained in the previous version master file comprises calculating a lightweight checksum value for the first block of data and comparing the calculated lightweight checksum value with lightweight checksums maintained for the plurality of unique data blocks in the previous version master file, and wherein calculating the lightweight checksum for a subsequently read block of data from the current version of the data file comprises subtracting one or more bytes from the lightweight checksum for a previously read block of data and adding one or more bytes from the subsequently read block of data to the lightweight checksum for the previously read block of data; upon determining that the first block of data matches one of the plurality of unique data blocks contained in the previous version master file, appending a reference to the matching unique data block to a version map file corresponding to the current version of the data file; upon determining that the first block of data does not match one of the plurality of unique data blocks contained in the previous version master file, creating a supplemental file corresponding to the current version of the data file and comprising a chunk of data from the current version of the data file not matching one of the plurality of unique data blocks contained in the previous version master file, appending a reference to the chunk to the version map file and increasing the current offset by a slide size; and reading a next block of data of the particular block size from the current offset in the current version of the data file, wherein the reference to the matching unique data block comprises an index to the matching unique data block and the reference to the chunk comprises a length of the chunk. - View Dependent Claims (9, 10, 11)
-
-
12. A system backing-up a version of a data file to a remote storage, the system comprising a continuous data protection (“
- CDP”
) module executing on a user computer and configured to;create a collection of files corresponding to the version of the data file by de-duplicating the version of the data file against a previous version master file stored on a local storage device of the user computer, wherein the previous version master file comprises a single instance of each of one or more unique data blocks of a specific block size from a previous version of the data file, and wherein de-duplicating the version of the data file against the previous version master file comprises determining whether at least one block of data from the version of the data file matches at least one of the one or more unique data blocks of the specific size from the previous version of the data file by; maintaining a lightweight checksum for each of the one or more unique data blocks in the previous version master file, and matching a block of data read from the version of the data file to one of the one or more unique data blocks in the previous version master file by calculating a lightweight checksum value for the read block of data and comparing the calculated lightweight checksum value with the lightweight checksums maintained for the one or more unique data blocks in the previous version master file, wherein calculating the lightweight checksum for a subsequently read block of data from the version of the data file comprises subtracting one or more bytes from the lightweight checksum for a previously read block of data and adding one or more bytes from the subsequently read block of data to the lightweight checksum for the previously read block of data; and store the collection of files corresponding to the version of the data file to the remote storage, wherein the remote storage contains a master file corresponding to the data file and comprising the one or more unique data blocks in the previous version master file, wherein the collection of file comprises; a supplemental file corresponding to the version of the data file and comprising one or more chunks of data from the version of the data file not matching one of the one or more unique data blocks in the previous version master file; and a version map file corresponding to the version of the data file and comprising one or more references to unique data blocks in the previous version master file and one or more references to chunks of data in the supplemental file, wherein each of the one or more references to unique data blocks in the previous version master file comprise an index to a unique data block and each of the one or more the references to chunks of data in the supplemental file comprise a length of a chunk of data. - View Dependent Claims (13, 14, 15)
- CDP”
Specification