System and method for differential compression of data from a plurality of binary sources
First Claim
1. A machine implementable method for forming a compressed differentially encoded image of a version file as derived from a base file, said image being defined over a set of file building operations (ADD, COPY, END), descriptors, and address pointers and utilizing the version and base files comprising the steps of:
- (a) recursively forming a hash function first signature set of the base file in a predetermined serial order or direction;
(b) recursively forming a progressively increasing hash function second signature set of the version file in said predetermined serial order or direction, and comparing each signature in the second set contemporaneous with its generation with the signatures in the first set;
(c) upon a comparison match of signatures and verification of contents, encoding a difference file ad seriatim as a portion of the version file contents up to the point of the instant comparison match from the later of either the start of the version file or the last comparison match, followed by a COPY command, a length attribute, and pointer to the base file location of the instant matching contents; and
(d) repeating steps (b) and (c) until the version file becomes exhausted.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and a system are presented for generating differentially compressed output from binary sources. Given two versions of the same file as input streams, a compact encoding of one of the input streams is generated, by representing it as a set of changes with respect to the other input stream. Algorithms for differencing files requiring time linear in the size of the input and a constant amount of space for execution are presented. In addition, advanced techniques for improving existing differencing algorithms are developed and applied to previous methods. These techniques allow algorithms to increase their efficiency without a loss of compression and to accept arbitrarily large inputs without sacrificing correctness or degrading the compression data rate. The differential compression methods provide a computationally efficient compression technique for applications that generate versioned data.
-
Citations
13 Claims
-
1. A machine implementable method for forming a compressed differentially encoded image of a version file as derived from a base file, said image being defined over a set of file building operations (ADD, COPY, END), descriptors, and address pointers and utilizing the version and base files comprising the steps of:
-
(a) recursively forming a hash function first signature set of the base file in a predetermined serial order or direction;
(b) recursively forming a progressively increasing hash function second signature set of the version file in said predetermined serial order or direction, and comparing each signature in the second set contemporaneous with its generation with the signatures in the first set;
(c) upon a comparison match of signatures and verification of contents, encoding a difference file ad seriatim as a portion of the version file contents up to the point of the instant comparison match from the later of either the start of the version file or the last comparison match, followed by a COPY command, a length attribute, and pointer to the base file location of the instant matching contents; and
(d) repeating steps (b) and (c) until the version file becomes exhausted.
-
-
2. A machine implementable method for forming a compressed differentially encoded image of a version file as derived from a base file, said image being defined over a set of file building operations (ADD, COPY, END), descriptors, and address pointers and utilizing the version and base files comprising the steps of:
-
(a) recursively forming a progressively increasing hash function first signature set of the base file in a predetermined serial order or direction and recursively forming a progressively increasing hash function second signature set of the version file in said predetermined serial order or direction;
(b) comparing each signature in the second set contemporaneous with its generation with the signatures in the first set contemporaneous with their generation;
(c) upon a comparison match of signatures and verification of contents, encoding a difference file ad seriatim as a portion of the version file contents up to the point of the comparison match from the later of either the start of the version file or the last comparison match, followed by a COPY command, length attribute, and pointer to the base file location of the instant matching contents; and
(d) repeating steps (a)-(c) starting from the end of the matching contents in the respective base and version files until exhaustion of the respective base and version files.
-
-
3. A machine implementable method for forming a compressed differentially encoded image of a version file as derived from a base file, said image being defined over a set of file building operations (ADD, COPY, END), descriptors, and address pointers and utilizing the version and base files comprising the steps of:
-
(a) recursively scanning the base file using a window of m bytes length including shifting the window in the same direction k<
m bytes/recursion, forming a hash function signature of the window contents for each recursion of the base file, and recording said base file signature in a buffer;
(b) recursively scanning the version file using a window of m bytes length including shifting the window in the same direction k<
m bytes/recursion, forming a hash function signature of the window contents for each recursion of the version file, and comparing each hash function signature as generated from the version file with the buffer stored signatures of the base file;
(c) upon a comparison match of signatures and verification of contents, encoding a difference file ad seriatim as an ADD command plus version file contents from the later of either the start of the version file or the last comparison match up to the point of the instant comparison match, followed by a COPY command, length attribute, and pointer to the base file location of the matching contents; and
(d) repeating steps (b) and (c) until the version file scan is exhausted. - View Dependent Claims (4, 5)
-
-
6. A machine implementable method for forming a compressed differentially encoded image of a version file as derived from a base file, said image being defined over a set of file building operations (ADD, COPY, END), descriptors, and address pointers and utilizing the version and base files comprising the steps of:
-
(a) recursively scanning the base file and version file respectively in time overlapped relation in a predetermined serial order or direction using respective windows of m bytes granularity and k<
m bytes alignment starting from predetermined locations in the respective files, forming a hash function signature of the window contents for each recursion of the respective files, and comparing the signatures of the version file with the signatures of the base file;
(b) upon a comparison match of signatures and verification of contents, encoding a difference file ad seriatim as an ADD command plus version file contents up to the point of the instant comparison match from the later of either the starting point or the last comparison match, followed by a COPY command, length of the matching contents, and pointer to the base file location of the instant matching contents; and
(c) repeating steps (a)-(b) starting from the end of the matching contents in the respective base and version files until exhaustion of the respective base and version files. - View Dependent Claims (7, 8, 9, 10, 11)
(c1) recording successive encodings of the difference file as a linked list in a code generated order into an addressable buffer;
(c2) ascertaining whether the matching portion of the base and version files associated with each copy command is backwards extensible;
(c3) in the event such matching portion is backwards extensible, re-encoding the copy command and its attributes to include the extensible matching portion; and
(c4) replacing the counterpart entry in the linked list.
-
-
9. The method according to claim 6, wherein said method further includes the steps of forming a set of base file signatures from a predetermined subset of all possible base file signatures;
- and comparing the signatures derived over the version file with ones of the predetermined subset of base file signatures.
-
10. The method according to claim 9, wherein the number of base file signatures in the subset should approximate one-half of the number of all possible base file signatures.
-
11. The method according to claim 9, wherein the hash function h(s) is of the form x=h(s)=s modulo q, q being a prime number, and wherein the substep of forming a hash function for an m character substring (sksk+1 . . . sk+m−
- 1) includes the step of transforming the m character string (sksk+1 . . . sk+m−
1) into an integer residue xk=skbm−
1+sk+1bm−
2+. . . +sk+m−
1, b being a radix, the value of xk being determined according to the relation xk+1=(xk−
skbm−
1)b+sk+m.
- 1) includes the step of transforming the m character string (sksk+1 . . . sk+m−
-
12. An apparatus for forming a compressed differentially encoded image of a version file as derived from a base file, said image being defined over a set of file building operations (ADD, COPY, END), descriptors, and address pointers and utilizing the version and base files comprising:
-
first and second shift registers of m-bytes in length for respectively storing a portion of the base file and version file;
first circuits coupling the first register for recursively scanning the base file by shifting the base file in the first register in the same direction k<
m bytes/recursion, said first circuits including an arrangement for forming a hash function signature of the first register contents for each recursion of the base file, and for recording said base file signature in a buffer;
second circuits coupling the second register for recursively scanning the version file by shifting the version file in the second register in the same direction k<
m bytes/recursion, said second circuits including the arrangement for forming a hash function signature of the window contents for each recursion of the version file, and third circuits for comparing each hash function signature as generated from the version file with the buffer stored signatures of the base file; and
fourth circuits coupling the first and second registers, the first, second, and third circuits and responsive to a comparison match of signatures and verification of contents for encoding a difference file ad seriatim as an ADD command plus version file contents up to the point of the instant comparison match from the later of either starting point of the last comparison match, followed by a COPY command, length of the matching contents, and pointer to the base file location of the instant matching contents, and invoking the next recursion of steps (b) and (c) until the version file scan is exhausted.
-
-
13. An article of manufacture comprising a machine readable memory having stored therein a plurality of processor executable control program steps for forming a compressed differentially encoded image of a version file derived from a base file, said image being defined over a set of file building operations (ADD, COPY, END), descriptors, and address pointers and utilizing the version and base files, said control program steps include:
-
(a) a control program step for recursively scanning the base file using a window of m bytes length including shifting the window in the same direction k<
m bytes/recursion and forming a hash function signature of the window contents for each recursion of the base file, and recording said base file signature in a buffer;
(b) a control program step for recursively scanning the version file using a window of m bytes length including shifting the window in the same direction k<
m bytes/recursion forming a hash function signature of the window contents for each recursion of the version file, and comparing each hash function signature as generated from the version file with the buffer stored signatures of the base file; and
(c) a control program step upon a comparison match of signatures and verification of contents for encoding a difference file ad seriatim as an ADD command plus version file contents from the later of either the start of the version file or the last comparison match up to the point of the instant comparison match, followed by a COPY command, length of the matching contents, and pointer to the base file location of the instant matching contents, and for invoking the next recursion of steps (b) and (c) until the version file scan is exhausted.
-
Specification