Method and system for processing checksum of a data stream to optimize deduplication
First Claim
Patent Images
1. A computer-implemented method for deduplicating data, comprising:
- receiving at a storage system over a network from a client a first data stream having a plurality of data regions and a plurality of checksums for verifying integrity of the data regions embedded therein, the first data stream representing a file or a directory of one or more files of a file system associated with the client;
scanning the first data stream to recognize a plurality of checksum markers that identify the checksums, wherein the checksum markers were inserted into the first data stream by the client prior to receiving the first data stream over the network;
extracting the checksum markers and the checksums from the first data stream to generate second data stream without the checksum markers and associated checksum data therein; and
deduplicating the second data stream into a plurality of deduplicated chunks.
9 Assignments
0 Petitions
Accused Products
Abstract
Techniques for deduplicating a data stream with checksum data embedded therein are described. According to one embodiment, a first data stream is received from a client having a plurality of data regions and a plurality of checksums for verifying integrity of the data regions embedded therein, where the first data stream represents a file or a directory of one or more files of a file system associated with the client. In response the first data stream with the checksums removed is deduplicated into a plurality of deduplicated chunks.
-
Citations
18 Claims
-
1. A computer-implemented method for deduplicating data, comprising:
-
receiving at a storage system over a network from a client a first data stream having a plurality of data regions and a plurality of checksums for verifying integrity of the data regions embedded therein, the first data stream representing a file or a directory of one or more files of a file system associated with the client; scanning the first data stream to recognize a plurality of checksum markers that identify the checksums, wherein the checksum markers were inserted into the first data stream by the client prior to receiving the first data stream over the network; extracting the checksum markers and the checksums from the first data stream to generate second data stream without the checksum markers and associated checksum data therein; and deduplicating the second data stream into a plurality of deduplicated chunks. - View Dependent Claims (2, 3, 4)
-
-
5. A computer-implemented method for deduplicating data, comprising:
-
receiving from a client a first data stream having a plurality of data regions and a plurality of checksums for verifying integrity of the data regions embedded therein, the first data stream representing a file or a directory of one or more files of a file system associated with the client; scanning the first data stream to recognize a plurality of checksum markers that identify the checksums; extracting the checksum markers and the checksums from the first data stream to generate second data stream without the checksum markers and associated checksums therein; and deduplicating the second data stream into a plurality of deduplicated chunks; and separately storing the deduplicated chunks and the checksum markers and the associated checksums in a storage device, wherein the separately stored checksum markers and the associated checksums are to be incorporated with the deduplicated chunks during a restoration of the first data stream subsequently.
-
-
6. A computer-implemented method for deduplicating data, comprising:
-
receiving from a client a first data stream having a plurality of data regions and a plurality of checksums identified by a plurality of checksum markers for verifying integrity of the data regions embedded therein, the first data stream representing a file or a directory of one or more files of a file system associated with the client, wherein each checksum marker includes a predetermined pattern a length field indicating a size of the associated checksum that immediately follows the associated checksum marker, wherein each checksum marker is recognized by matching the predetermined pattern, and wherein each checksum marker further includes a type field specifying that the corresponding marker is a checksum marker that is different from other types of markers; and deduplicating the first data stream with the checksums removed into a plurality of deduplicated chunks.
-
-
7. A non-transitory machine-readable medium having instructions stored therein, which when executed by a processor, cause the processor to perform operations for deduplicating data, the operations comprising:
-
receiving at a storage system over a network from a client a first data stream having a plurality of data regions and a plurality of checksums for verifying integrity of the data regions embedded therein, the first data stream representing a file or a directory of one or more files of a file system associated with the client; scanning the first data stream to recognize a plurality of checksum markers that identify the checksums, wherein the checksum markers were inserted into the first data stream by the client prior to receiving the first data stream over the network; extracting the checksum markers and the checksums from the first data stream to generate a second data stream without the checksum markers and associated checksum data therein; and deduplicating the second data stream into a plurality of deduplicated chunks. - View Dependent Claims (8, 9, 10)
-
-
11. A non-transitory machine-readable medium having instructions stored therein, which when executed by a processor, cause the processor to perform operations for deduplicating data, the operations comprising:
-
receiving from a client a first data stream having a plurality of data regions and a plurality of checksums for verifying integrity of the data regions embedded therein, the first data stream representing a file or a directory of one or more files of a file system associated with the client; scanning the first data stream to recognize a plurality of checksum markers that identify the checksums; extracting the checksum markers and the checksums from the first data stream to generate a second data stream without the checksum markers and associated checksums therein; and deduplicating the second data stream into a plurality of deduplicated chunks; and separately storing the deduplicated chunks and the checksum markers and the associated checksums in a storage device, wherein the separately stored checksum markers and the associated checksums are to be incorporated with the deduplicated chunks during a restoration of the first data stream subsequently.
-
-
12. A non-transitory machine-readable medium having instructions stored therein, which when executed by a processor, cause the processor to perform operations for deduplicating data, the operations comprising:
-
receiving from a client a first data stream having a plurality of data regions and a plurality of checksums identified by a plurality of checksum markers for verifying integrity of the data regions embedded therein, the first data stream representing a file or a directory of one or more files of a file system associated with the client, wherein each checksum marker includes a predetermined pattern a length field indicating a size of the associated checksum that immediately follows the associated checksum marker, wherein each checksum marker is recognized by matching the predetermined pattern, and wherein each checksum marker further includes a type field specifying that the corresponding marker is a checksum marker that is different from other types of markers; and deduplicating the first data stream with the checksums removed into a plurality of deduplicated chunks.
-
-
13. A data processing system, comprising:
-
a processor; and a memory coupled to the processor to store instructions, which when executed from the memory, cause the processor to receive from a client over a network a first data stream having a plurality of data regions and a plurality of checksums for verifying integrity of the data regions embedded therein, the first data stream representing a file or a directory of one or more files of a file system associated with the client, scan the first data stream to recognize a plurality of checksum markers that identify the checksums, wherein the checksum markers were inserted into the first data stream by the client prior to receiving the first data stream over the network, extract the checksum markers and the checksums from the first data stream to generate a second data stream without the checksum markers and associated checksum data therein, and deduplicate the second data stream into a plurality of deduplicated chunks. - View Dependent Claims (14, 15, 16)
-
-
17. A data processing system, comprising:
-
a processor; and a memory coupled to the processor to store instructions, which when executed from the memory, cause the processor to receive from a client a first data stream having a plurality of data regions and a plurality of checksums for verifying integrity of the data regions embedded therein, the first data stream representing a file or a directory of one or more files of a file system associated with the client, scan the first data stream to recognize a plurality of checksum markers that identify the checksums, extract the checksum markers and the checksums from the first data stream generate a second data stream without the checksum markers and associated checksums therein, deduplicate the second data stream into a plurality of deduplicated chunks, and separately store the deduplicated chunks and the checksum markers and the associated checksums in a storage device, wherein the separately stored checksum markers and the associated checksums are to be incorporated with the deduplicated chunks during a restoration of the first data stream subsequently.
-
-
18. A data processing system, comprising:
-
a processor; and a memory coupled to the processor to store instructions, which when executed from the memory, cause the processor to receive from a client a first data stream having a plurality of data regions and a plurality of checksums identified by a plurality of checksum markers for verifying integrity of the data regions embedded therein, the first data stream representing a file or a directory of one or more files of a file system associated with the client, each checksum marker includes a predetermined pattern and a length field indicating a size of the associated checksum that immediately follows the associated checksum marker, wherein each checksum marker is recognized by matching the predetermined pattern, and wherein each checksum marker further includes a type field specifying that the corresponding marker is a checksum marker that is different from other types of markers, and deduplicate the first data stream with the checksums removed into a plurality of deduplicated chunks.
-
Specification