Methods and apparatus for content-aware data de-duplication
First Claim
Patent Images
1. A method comprising:
- partitioning digital data into a plurality of blocks, including a first block, and additional data, wherein the additional data includes at least one of position-dependent data, instance-dependent data, format-specific headers or footers, and format-specific transformations, andwherein a combination of the plurality of blocks and the additional data together represents all of the digital data;
generating a file identifier based at least in part on the digital data;
associating the file identifier with the digital data;
generating a block identifier based at least in part on the first block;
associating the block identifier with the first block;
determining if the first block has already been stored;
storing the first block if the first block has not already been stored;
determining if a block map associated with the file identifier has already been stored, wherein the block map includes block identifiers associated respectively with each block of which the digital data is comprised, andif the block map associated with the file identifier has not already been stored;
creating the block map,storing the block map,associating the additional data with the block map, andassociating the file identifier with the block map.
5 Assignments
0 Petitions
Accused Products
Abstract
The systems and methods partition digital data units in a content aware fashion without relying on any ancestry information, which enables one to find duplicate chunks in unrelated units of digital data even across millions of documents spread across thousands of computer systems.
-
Citations
20 Claims
-
1. A method comprising:
-
partitioning digital data into a plurality of blocks, including a first block, and additional data, wherein the additional data includes at least one of position-dependent data, instance-dependent data, format-specific headers or footers, and format-specific transformations, and wherein a combination of the plurality of blocks and the additional data together represents all of the digital data; generating a file identifier based at least in part on the digital data; associating the file identifier with the digital data; generating a block identifier based at least in part on the first block; associating the block identifier with the first block; determining if the first block has already been stored; storing the first block if the first block has not already been stored; determining if a block map associated with the file identifier has already been stored, wherein the block map includes block identifiers associated respectively with each block of which the digital data is comprised, and if the block map associated with the file identifier has not already been stored; creating the block map, storing the block map, associating the additional data with the block map, and associating the file identifier with the block map. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system comprising:
-
a memory capable of storing data; and a processor configured for; partitioning digital data into a plurality of blocks, including a first block, and additional data, wherein the additional data includes at least one of position-dependent data, instance-dependent data, format-specific headers or footers, and format-specific transformations, and wherein a combination of the plurality of blocks and the additional data together represents all of the digital data; generating a file identifier based at least in part on the digital data; associating the file identifier with the digital data; generating a block identifier based at least in part on the first block; associating a block identifier with the first block; determining if the first block has already been stored; storing the first block in the memory if the first block has not already been stored; determining if a block map associated with the file identifier has already been stored, wherein the block map includes block identifiers associated respectively with each block of which the digital data is comprised, and if the block map associated with the file identifier has not already been stored; creating the block map, storing the block map in the memory, associating the additional data with the block map, and associating the file identifier with the block map. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
-
Specification