METHODS AND APPARATUS FOR CONTENT-AWARE DATA PARTITIONING
First Claim
Patent Images
1. A computer-implemented method for partitioning and storing digital data comprising:
- determining a format of the digital data;
identifying a source logical object within the digital data, wherein the identification of the source logical object is accomplished at least in part by applying knowledge about the format of the digital data;
based on the determined format, performing one or more of the following operations to create a resulting object;
removing position-dependent data from the source logical object,removing instance-dependent data from the source logical object,removing one or more format-specific headers or footers from the source logical object, andremoving format-specific transformations from the source logical object;
determining whether the resulting object has already been stored; and
storing the resulting object if the resulting object has not already been stored.
5 Assignments
0 Petitions
Accused Products
Abstract
The systems and methods partition digital data units in a content aware fashion without relying on any ancestry information, which enables one to find duplicate chunks in unrelated units of digital data even across millions of documents spread across thousands of computer systems.
70 Citations
32 Claims
-
1. A computer-implemented method for partitioning and storing digital data comprising:
-
determining a format of the digital data; identifying a source logical object within the digital data, wherein the identification of the source logical object is accomplished at least in part by applying knowledge about the format of the digital data; based on the determined format, performing one or more of the following operations to create a resulting object; removing position-dependent data from the source logical object, removing instance-dependent data from the source logical object, removing one or more format-specific headers or footers from the source logical object, and removing format-specific transformations from the source logical object; determining whether the resulting object has already been stored; and storing the resulting object if the resulting object has not already been stored. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. A system comprising:
-
a memory capable of storing data; and a processor configured for; determining a format of digital data; identifying a source logical object within the digital data, wherein the identification of the source logical object is accomplished at least in part by applying knowledge about the format of the digital data; based on the determined format, performing one or more of the following operations to create a resulting object; removing position-dependent data from the source logical object, removing instance-dependent data from the source logical object, removing one or more format-specific headers or footers from the source logical object, and removing format-specific transformations from the source logical object, determining whether the resulting object has already been stored in the memory, and storing the resulting object in the memory if the resulting object has not already been stored. - View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30)
-
-
31. A computer-implemented method for partitioning digital data comprising:
-
determining a format of the digital data; identifying a source logical object within the digital data, wherein the identification of the source logical object is accomplished at least in part by applying knowledge about the format of the digital data; determining whether the resulting object has already been stored; storing the resulting object if the resulting object has not already been stored; storing additional data, wherein the additional data comprises one or more of information indicative of the original position of the logical object within the digital data, position-dependent information, instance-dependent information, format-specific headers or footers, and information indicative of format-specific transformations; associating the additional data with the resulting object; and using the stored resulting object and the associated additional data to recreate the digital data.
-
-
32. A system comprising:
-
a memory capable of storing data; and a processor configured for; determining a format of digital data, identifying a source logical object within the digital data, wherein the identification of the source logical object is accomplished at least in part by applying knowledge about the format of the digital data, determining whether the resulting object has already been stored, storing the resulting object if the resulting object has not already been stored, storing additional data, wherein the additional data comprises one or more of information indicative of the original position of the logical object within the digital data, position-dependent information, instance-dependent information associating the additional data with the resulting object, and using the stored resulting object and the associated additional data to recreate the digital data.
-
Specification