Method and system for handling object boundaries of a data stream to optimize deduplication
First Claim
Patent Images
1. A computer-implemented method for deduplicating data, comprising:
- receiving at a storage system over a network from a client a data stream having a sequence of a plurality of data objects, the data stream representing a file or a directory of one or more files of a file system associated with the client, wherein the data stream includes a plurality of boundary markers inserted by the client prior to being received at the storage system;
scanning the data stream to recognize a plurality of boundary markers each being associated with each of the data objects, the boundary markers identifying boundaries of the data objects; and
deduplicating the data stream into a plurality of deduplicated chunks in view of boundaries of the data objects marked by the boundary markers, wherein deduplicating the data stream comprisesanchoring the data stream using a predetermined chunking algorithm to create a plurality of anchor points, each anchor point identifying a chunking boundary for deduplication;
relocating at least one of the anchor points to a location that is identified by at least one boundary markers; and
chunking the data stream into the deduplicated chunks based on the anchor points that include at least one relocated anchor point.
9 Assignments
0 Petitions
Accused Products
Abstract
Techniques for deduplicating a data stream based on boundary markers embedded therein are described. According to one embodiment, a data stream is received from a client having a sequence of a plurality of data objects, where to data stream represents a file or a directory of one or more files of a file system associated with the client. In response, the data stream is deduplicated into a plurality of deduplicated chunks in view of boundaries of the data objects.
-
Citations
18 Claims
-
1. A computer-implemented method for deduplicating data, comprising:
-
receiving at a storage system over a network from a client a data stream having a sequence of a plurality of data objects, the data stream representing a file or a directory of one or more files of a file system associated with the client, wherein the data stream includes a plurality of boundary markers inserted by the client prior to being received at the storage system; scanning the data stream to recognize a plurality of boundary markers each being associated with each of the data objects, the boundary markers identifying boundaries of the data objects; and deduplicating the data stream into a plurality of deduplicated chunks in view of boundaries of the data objects marked by the boundary markers, wherein deduplicating the data stream comprises anchoring the data stream using a predetermined chunking algorithm to create a plurality of anchor points, each anchor point identifying a chunking boundary for deduplication; relocating at least one of the anchor points to a location that is identified by at least one boundary markers; and chunking the data stream into the deduplicated chunks based on the anchor points that include at least one relocated anchor point. - View Dependent Claims (2, 3, 4)
-
-
5. A computer-implemented method for deduplicating data, comprising:
-
receiving from a client a data stream having a sequence of a plurality of data objects, the data stream representing a file or a directory of one or more files of a file system associated with the client; scanning the data stream to recognize a plurality of boundary markers each being associated with each of the data objects, the boundary markers identifying boundaries of the data objects; and deduplicating the data stream into a plurality of deduplicated chunks in view of boundaries of the data objects, wherein deduplicating the data stream comprises anchoring the data stream using a predetermined chunking algorithm to create a plurality of anchor points, each anchor point identifying a chunking boundary for deduplication, relocating at least one of the anchor points to a location that is identified by at least one boundary marker, including determining a distance between a first anchor point and an adjacent data object boundary that is identified by a first boundary marker, and relocating the first anchor point to the adjacent data object boundary within the data stream if the distance is below a predetermined threshold, and chunking the data stream into the deduplicated chunks based on the anchor points that include at least one relocated anchor point. - View Dependent Claims (6)
-
-
7. A non-transitory machine-readable medium having instructions stored therein, which when executed by a processor, cause the processor to perform operations for deduplicating data, the operations comprising:
-
receiving at a storage system over a network from a client a data stream having a sequence of a plurality of data objects, the data stream representing a file or a directory of one or more files of a file system associated with the client, wherein the data stream includes a plurality of boundary markers inserted by the client prior to being received at the storage system; scanning the data stream to recognize a plurality of boundary markers each being associated with each of the data objects, the boundary markers identifying boundaries of the data objects; and deduplicating the data stream into a plurality of deduplicated chunks in view of boundaries of the data objects marked by the boundary markers, wherein deduplicating the data stream comprises anchoring the data stream using a predetermined chunking algorithm to create a plurality of anchor points, each anchor point identifying a chunking boundary for deduplication; relocating at least one of the anchor points to a location that is identified by at least one boundary markers; and chunking the data stream into the deduplicated chunks based on the anchor points that include at least one relocated anchor point. - View Dependent Claims (8, 9, 10)
-
-
11. A non-transitory machine-readable medium having instructions stored therein, which when executed by a processor, cause the processor to perform operations for deduplicating data, the operations comprising:
-
receiving from a client a data stream having a sequence of a plurality of data objects, the data stream representing a file or a directory of one or more files of a file system associated with the client; scanning the data stream to recognize a plurality of boundary markers each being associated with each of the data objects, the boundary markers identifying boundaries of the data objects; and deduplicating the data stream into a plurality of deduplicated chunks in view of boundaries of the data objects, wherein deduplicating the data stream comprises anchoring the data stream using a predetermined chunking algorithm to create a plurality of anchor points, each anchor point identifying a chunking boundary for deduplication, relocating at least one of the anchor points to a location that is identified by at least one boundary marker, including determining a distance between a first anchor point and an adjacent data object boundary that is identified by a first boundary marker, and relocating the first anchor point to the adjacent data object boundary within the data stream if the distance is below a predetermined threshold, and chunking the data stream into the deduplicated chunks based on the anchor points that include at least one relocated anchor point. - View Dependent Claims (12)
-
-
13. A data processing system, comprising:
-
a processor; and a memory storing instructions, which when executed from the memory, cause the processor to receive over a network from a client a data stream having a sequence of a plurality of data objects, the data stream representing a file or a directory of one or more files of a file system associated with the client, wherein the data stream includes a plurality of boundary markers inserted by the client prior to being received at the data processing system, scan the data stream to recognize a plurality of boundary markers each being associated with each of the data objects, the boundary markers identifying boundaries of the data objects, and deduplicate the data stream into a plurality of deduplicated chunks in view of boundaries of the data objects marked by the boundary markers, wherein deduplicating the data stream comprises anchoring the data stream using a predetermined chunking algorithm to create a plurality of anchor points, each anchor point identifying a chunking boundary for deduplication; relocating at least one of the anchor points to a location that is identified by at least one boundary markers; and chunking the data stream into the deduplicated chunks based on the anchor points that include at least one relocated anchor point. - View Dependent Claims (14, 15)
-
-
16. A data processing system, comprising:
-
a processor; and a memory storing instructions, which when executed from the memory, cause the processor to receive from a client a data stream having a sequence of a plurality of data objects, the data stream representing a file or a directory of one or more files of a file system associated with the client; scan the data stream to recognize a plurality of boundary markers each being associated with each of the data objects, the boundary markers identifying boundaries of the data objects; and deduplicate the data stream into a plurality of deduplicated chunks in view of boundaries of the data objects, wherein deduplicating the data stream comprises anchoring the data stream using a predetermined chunking algorithm to create a plurality of anchor points, each anchor point identifying a chunking boundary for deduplication, relocating at least one of the anchor points to a location that is identified by at least one boundary marker, including determining a distance between a first anchor point and an adjacent data object boundary that is identified by a first boundary marker, and relocating the first anchor point to the adjacent data object boundary within the data stream if the distance is below a predetermined threshold, and chunking the data stream into the deduplicated chunks based on the anchor points that include at least one relocated anchor point. - View Dependent Claims (17, 18)
-
Specification