EFFICIENT FULL OR PARTIAL DUPLICATE FORK DETECTION AND ARCHIVING

US 20100142701A1
Filed: 12/05/2008
Published: 06/10/2010
Est. Priority Date: 12/05/2008
Status: Active Grant

First Claim

Patent Images

1. A method of reducing redundancy and increasing processing throughput of an archiving process, including the steps of:

(a) detecting identical or substantially identical files and/or forks;

(b) compressing the first instance of such files and/or forks; and

(c) storing reference information relating to the first compressed copy and bypassing compression of the second and all subsequent occurrences of said identical files and/or forks.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method to efficiently detect, store, modify, and recreate fully or partially duplicate file forks is described. During archive creation or modification, sets of fully or partially duplicate forks are detected and a reduced number of transformed forks or fork segments are stored. During archive expansion, one or more forks are recreated from each full or partial copy.

19 Citations

View as Search Results

30 Claims

1. A method of reducing redundancy and increasing processing throughput of an archiving process, including the steps of:
- (a) detecting identical or substantially identical files and/or forks;
  
  (b) compressing the first instance of such files and/or forks; and
  
  (c) storing reference information relating to the first compressed copy and bypassing compression of the second and all subsequent occurrences of said identical files and/or forks.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 2. The method of claim 1, further including the step of creating one or more sets of forks using criteria such as fork attributes.
  - 3. The method of claim 2, where differences between forks or fork segments are encoded by a differencing algorithm that produces a compact description of differences between two forks or fork segments.
  - 4. The method of claim 2, further including the step of sorting of the sets of forks by forks by fork size.
  - 5. The method of claim 2, wherein preference is given to detecting approximate duplicates by dividing the sets of forks into groups of nearly equal size.
  - 6. The method of either one of claims 4 and 5, further including the step of directly or indirectly comparing sorted forks using a hash algorithm, thereby allowing operation on arbitrarily large segments.
  - 7. The method of claim 6, including the step of choosing between comparing the fork data directly or indirectly via hash algorithm based on, a sizing strategy, the amount of memory available, the desired certainty of duplicate detection, or the desired amount of protection against intentional injection of apparently duplicate forks.
  - 8. The method of claim 7, wherein the comparison step is iterative and direct comparison or hash computation proceeds on all forks of a subset in parallel.
  - 9. The method of claim 8, wherein when the goal is to detect only fully duplicate fork sets, the method further includes the step of subdividing those sets into subsets when differences in segment data or accumulated hash values are detected.
  - 10. The method of claim 8, wherein when the goal includes detecting partially duplicate forks, the method further includes the step of creating a boundary list and segment list for each fork in a subset that includes one or more partially duplicate fork pair.
  - 11. The method of claim 10, wherein each segment list contains or points to common copies of locations and sizes of fork segments that constitute the fork.
  - 12. The method of claim 10, wherein fork data is processed in segments of equal or variable size by using a sizing strategy.
  - 13. The method of claim 12, wherein the sizing strategy is based on a priori knowledge of fork structure and attributes, and uses archive creation goals, which comprise maximizing creation speed, modification speed, expansion speed, minimizing archive size, or any combination thereof.
  - 14. The method of claim 13, wherein the sizing strategy used includes smaller, fixed or growing segment sizes matched to the type of data or media upon which it resides.
  - 15. The method of claim 14, wherein the sizing strategy may vary the rate of segment growth depending on detection progress parameters, which may include, but are not limited to the location of differences and/or the rate at which differences are found.
  - 16. The method of claim 1, further including the step of decompressing/extracting archived files.
  - 17. The method of claim 16, wherein decompressing/extraction step entails decompressing only the first occurrence of the duplicate data, and as matching files are encountered, copying the already decompressed first occurrences of the duplicate data to the appropriate forks in the matching files.

18. A method of detecting file and/or fork differences in which fork data is protected against the injection of duplicate or substantially duplicate forks, comprising the step of comparing fork segments with a cryptographically secure hashing algorithm.
- View Dependent Claims (19, 20, 21, 23, 24, 25, 26)
- - 19. The method of claim 18, further including the step of creating subsets and segment lists of duplicate and/or substantially duplicate forks and fork segments, and wherein after comparison is complete, the method includes the further step of either further hashing the resulting subsets and segment lists containing identical hash values for fork and fork segments using a longer hash value to provide an additional degree of certainty or directly comparing the forks and fork segments to make sure the forks or fork segments are indeed identical.
  - 20. The method of claim 19, further including the step of processing the resulting fork and fork segments by a forward archive transform for addition to the archive.
  - 21. The method of claim 20, wherein when a direct comparison of forks is used, the method includes the step of processing one of the duplicate forks or segments in each subset by the forward archive transform into post transform data for immediate or delayed addition to the archive.
  - 23. The method of claim 21, wherein when a hash algorithm for comparing forks is used, and perfectly certain or secure duplicate fork detection is desired, the method includes the further steps of:
    - designating one of the forks or segments as the primary fork or segment;
      
      processing the primary fork or segment by the forward archive transform while reading and comparing it to others from its subset, up to their respective ends or difference points;
      
      utilizing a sizing strategy with difference points added as additional segment boundaries; and
      
      if differences are detected, discarding transformed output and separating differing forks into new subsets for a repeat duplicate detection.
  - 24. The method of claim 23, wherein when enough temporary storage is available, the method includes the further steps of:
    - processing all forks or segments by the forward archive transform in parallel;
      
      buffering post-transform data until all forks or segments are fully transformed; and
      
      concurrently comparing pre-transform or post-transform data.
  - 25. The method of claim 24, wherein when no differences are detected, the method includes the step of discarding multiple copies of matching post-transform data using a memory-saving strategy, and when differences are detected, retaining more than one buffered post-transform output for immediate or delayed addition to the archive.
  - 26. The method of claim 24, wherein when a memory-saving strategy is used, the method includes the step of retaining additional copies of the matching (up to the point of each difference) portions of post-transform data, either directly by keeping multiple copies, or indirectly by keeping unique segments of post-transform data and the order in which they appear in each indirectly retained post-transform data stream.

27. A method of reducing redundancy and increasing processing throughput of a file archiving process, including the steps of:
- (a) creating structural information that describes sets of unique and duplicate fork segments achieved in the archive creation process to reflect the final lists of fork segments, the structural information including overall pre- and post- transform fork sizes and/or locations of unique, transformed fork data in the archived data that describe subsets of fully duplicate forks and consists of identical size and location data for all forks in a subset, and which further describes subsets of partially duplicate forks and consists of sizes and/or locations for fork segments corresponding to difference points and lists of segments that, when concatenated in listed order, reconstitute original forks; and
  
  (b) updating the information created in step (a) as needed.

28. A method of reducing redundancy in archived digital files, comprising the step of hierarchically and/or encoding structural information with a sourcer coder and/or a statistical model and/or an entropy coder.

29. A method of reducing redundancy in archived data when sequential whole-archive expansion is a desired property of the archive data, comprising the step of positioning all fork structural information prior to the fork data it describes.

30. A method of handling duplicate forks during archive expansion, where the structural information for individual forks must be located and interpreted during expansion, said method including at least one of the following steps:
- (a) when sequential archive consumption is desired during expansion, processing pre-inverse transform data consisting of forks and segments by inverse transform or transforms, and routing and/or concatenating post-inverse transform data, in the form of fully or partially duplicate forks to form one or more forks consisting of one or more fork segments, by writing post-inverse transform data in parallel to multiple files, or by writing to one file corresponding to a full fork or a collection of fork segments, and making copies of the file'"'"'s contents after its corresponding full fork or fork segments have been fully reconstructed by the inverse transform(s);
  
  (b) when sequential fork creation is desired and non-sequential archive consumption is also possible or permitted, reconstituting duplicate forks independently by processing pre-inverse transform data consisting of forks and segments by an inverse transform or transforms, wherein segments that form partially duplicate forks are concatenated after the inverse transform application;
  
  (c) when only sequential archive consumption is possible or permitted, and sequential fork creation is desired, processing pre-inverse transform data using an inverse transform or transforms and retaining post-inverse transform data with a buffer before routing and concatenating it into output forks; and
  
  (d) when differences between forks or fork segments were encoded by a differencing algorithm, using a patch transformation to produce a new fork or fork segment.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Smith Micro Software Incorporated
Original Assignee
Smith Micro Software Incorporated
Inventors
LOVATO, DARRYL, VOLKOFF, SERGE

Granted Patent

US 8,238,549 B2
Time in Patent Office

Days
Field of Search
US Class Current

380/28
CPC Class Codes

G06F 16/1744 using compression, e.g. spa...

G06F 16/1748 De-duplication implemented ...

EFFICIENT FULL OR PARTIAL DUPLICATE FORK DETECTION AND ARCHIVING

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

19 Citations

30 Claims

Specification

Solutions

Use Cases

Quick Links

EFFICIENT FULL OR PARTIAL DUPLICATE FORK DETECTION AND ARCHIVING

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

19 Citations

30 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links