EFFICIENT NEAR-DUPLICATE DATA IDENTIFICATION AND ORDERING VIA ATTRIBUTE WEIGHTING AND LEARNING
First Claim
Patent Images
1. A method of reducing redundancy and increasing processing throughput of an archiving process, comprising the steps of:
- (a) providing an input data set having a plurality of data elements and/or files;
(a) detecting exact duplicate and approximately duplicate data elements or files that are either exactly similar or most likely similar; and
(b) storing references and/or differences to previously archived data;
wherein step (b) does not include the step of storing the duplicate or matched pairs of data using a standard compression technique.
1 Assignment
0 Petitions
Accused Products
Abstract
A method to efficiently detect, and thus store, approximately duplicate or most likely duplicate files or data sets that will benefit from differencing technology rather than standard compression technology. During archive creation or modification, sets of most likely files are detected and a reduced number of transformed file segments are stored in whole. During archive expansion, one or more files are recreated from each full or partial copy.
8 Citations
24 Claims
-
1. A method of reducing redundancy and increasing processing throughput of an archiving process, comprising the steps of:
-
(a) providing an input data set having a plurality of data elements and/or files; (a) detecting exact duplicate and approximately duplicate data elements or files that are either exactly similar or most likely similar; and (b) storing references and/or differences to previously archived data; wherein step (b) does not include the step of storing the duplicate or matched pairs of data using a standard compression technique. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A method for efficient full or partial duplicate data element detection and archiving, comprising the steps of:
-
detecting most likely similar data sets; encoding the most likely similar data sets using delta encoding or using the most likely similar data sets to analyze different data sets.
-
-
16. A method for efficient full or partial duplicate data element detection and archiving, comprising the steps of:
-
(a) detecting most likely similar data sets; (b) encoding the data sets using delta encoding; (c) using a final weighting to predict the outcome of using a reference/differencing technique rather than a standard compression technique; and (d) ordering of the data sets from the most likely file pairs to the least likely file pairs to benefit from using a differencing technique. - View Dependent Claims (17, 18, 19, 20, 21)
-
- 22. A method to extract data/files from an archive using a plurality of encoding methods including at least differencing, references, and standard compression techniques.
-
24. A combination compression and differencing method for processing a given a set of data and/or files that include likely matches, which on the whole may result in a smaller overall result by using a combination of compression and differencing instead of individual compression, comprising the steps of:
-
(a) using a differencing algorithm to identify one or more of the data/files to be stored and/or compressed; (b) storing and/or compressing the data/files identified in step (a); (c) storing the remaining data/files as references to the stored and/or compressed file; wherein the differencing algorithm employed in step (a) uses one or more of the following substeps; (a.1) storing and/or compressing the largest file, earliest create date, or other metric, or some combination thereof, as a source file; (a.2) storing and/or compressing each of the files differenced from the file stored as the source file; (a.3) attempting to match each of the possible likely match combinations selected from a set of possible matches with each being used as the potential source file to determine the best overall result, the best overall combination, and producing the smallest overall size, of source and differences from that source are then stored and or transmitted.
-
Specification