Methods and apparatus for content-aware data partitioning
First Claim
Patent Images
1. A computer-implemented method for partitioning and storing a first file in a repository, said repository comprising a plurality of stored objects corresponding to a plurality of stored files, wherein each of said stored objects corresponds to one or more of said stored files, wherein said first file comprises a plurality of data objects and is different from each of said plurality of stored files, the method comprising:
- determining a format of the first file;
identifying each of the data objects corresponding to the first file at least in part on the basis of the determined format; and
for each of said identified data objects;
processing said identified data object on the basis of the determined format in order to create a further data object, said processing comprising removing, at least in part, information relating to the identified format from said identified data object;
comparing the further data object with said plurality of stored objects in order to determine whether the further data object corresponds to one of said plurality of stored objects; and
storing the further data object in the repository in dependence on said comparison.
5 Assignments
0 Petitions
Accused Products
Abstract
The systems and methods partition digital data units in a content aware fashion without relying on any ancestry information, which enables one to find duplicate chunks in unrelated units of digital data even across millions of documents spread across thousands of computer systems.
76 Citations
28 Claims
-
1. A computer-implemented method for partitioning and storing a first file in a repository, said repository comprising a plurality of stored objects corresponding to a plurality of stored files, wherein each of said stored objects corresponds to one or more of said stored files, wherein said first file comprises a plurality of data objects and is different from each of said plurality of stored files, the method comprising:
-
determining a format of the first file; identifying each of the data objects corresponding to the first file at least in part on the basis of the determined format; and for each of said identified data objects; processing said identified data object on the basis of the determined format in order to create a further data object, said processing comprising removing, at least in part, information relating to the identified format from said identified data object; comparing the further data object with said plurality of stored objects in order to determine whether the further data object corresponds to one of said plurality of stored objects; and storing the further data object in the repository in dependence on said comparison. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A system comprising:
-
a memory comprising a plurality of stored objects corresponding to a plurality of stored files; and a processor configured to; determine a format of a first file, wherein said first file comprises a plurality of data objects and is different from each of said plurality of stored files; identify each of the data objects corresponding to the first file at least in part on the basis of the determined format; for each of said identified objects; process said identified data object on the basis of the determined format in order to create a further data object, said processing comprising removing, at least in part, information relating to the identified format from said identified data object; compare said identified data object with said plurality of stored objects in order to determine whether the further data object corresponds to one of said plurality of stored data objects; and store the further data object in the memory in dependence on said comparison. - View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
-
-
27. A computer-implemented method for partitioning and storing a first file in a repository, said repository comprising a plurality of stored objects corresponding to a plurality of stored files, wherein the first file comprises a plurality of data objects and is different from each of said plurality of stored files, the method comprising:
-
determining a format of the first file; identifying each of the data objects corresponding to the first file at least in part on the basis of the determined format; for each of said identified data objects; processing said identified data object on the basis of the determined format in order to create a further data object, said processing comprising removing, at least in part, information relating to the identified format from said identified data object; comparing the further data object with said plurality of stored objects in order to determine whether the further data object corresponds to one of said plurality of stored objects; storing the further data object in the repository in dependence on said comparison; and storing data indicative of an association between the further data object and the first file in the repository, said data indicative of said association comprising data indicative of at least one of;
a position of the identified data object corresponding to the further data object in relation to the first file, metadata associated with the first file and the determined format.
-
-
28. A system comprising:
-
a memory comprising a plurality of stored objects corresponding to a plurality of stored files; and a processor configured to; determine a format of a first file, wherein said first file comprises a plurality of data objects and is different from each of said plurality of stored files; identify each of said data objects corresponding to the first file at least in part on the basis of the determined format; for each of the identified data objects; process the identified data object on the basis of the determined format in order to create a further data object, said processing comprising removing, at least in part, information relating to the identified format from said identified data object; compare the further data object with said plurality of stored objects in order to determine whether the further data object corresponds to one of said plurality of stored data objects; store the further data object in the repository in dependence on said comparison; and store data indicative of an association between the further data object and the first file in the repository, said data indicative of said association comprising data indicative of at least one of;
a position of the identified data object corresponding to the further data object in relation to the first file, metadata associated with the first file and the determined format.
-
Specification