System and method for unorchestrated determination of data sequences using sticky byte factoring to determine breakpoints in digital sequences
First Claim
1. A method for partitioning a digital sequence comprising:
- performing a hash function on at least a portion of said digital sequence;
monitoring hash values produced by said hash function for a first predetermined numeric pattern found in a range of numeric values; and
marking a breakpoint in said digital sequence when said first predetermined numeric pattern occurs,said step of performing said hash function comprising a rolling hash function adapted to scan portions of said digital sequence combined with adjusting a hash value based on a length of a current partition of said digital sequence to increase a likelihood of identifying a breakpoint in said current partition as a potential length of said current partition increases.
10 Assignments
0 Petitions
Accused Products
Abstract
A system and method for unorchestrated determination of data sequences using “sticky byte” factoring to determine breakpoints in digital sequences such that common sequences can be identified. Sticky byte factoring provides an efficient method of dividing a data set into pieces that generally yields near optimal commonality. This is effectuated by employing a rolling hashsum and, in an exemplary embodiment disclosed herein, a threshold function to deterministically set divisions in a sequence of data. Both the rolling hash and the threshold function are designed to require minimal computation. This low overhead makes it possible to rapidly partition a data sequence for presentation to a factoring engine or other applications that prefer subsequent synchronization across the data set.
148 Citations
58 Claims
-
1. A method for partitioning a digital sequence comprising:
-
performing a hash function on at least a portion of said digital sequence; monitoring hash values produced by said hash function for a first predetermined numeric pattern found in a range of numeric values; and marking a breakpoint in said digital sequence when said first predetermined numeric pattern occurs, said step of performing said hash function comprising a rolling hash function adapted to scan portions of said digital sequence combined with adjusting a hash value based on a length of a current partition of said digital sequence to increase a likelihood of identifying a breakpoint in said current partition as a potential length of said current partition increases. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A method for determining at least a first breakpoint in at least a first digital sequence comprising:
-
determining a subset group of said first digital sequence; performing a hash function on said subset group of said first digital sequence beginning at a starting position in said first digital sequence until a first predetermined numeric pattern, which is found in a range of numeric values, in said hash value is obtained, wherein said hash function is combined with adjusting said hash value based on a length of a current partition of said first digital sequence to increase a likelihood of identifying said first breakpoint in said current partition as a potential length of said current partition increases; and marking said first breakpoint when said first predetermined numeric pattern in said hash value is obtained. - View Dependent Claims (12, 13, 14, 15, 16, 17)
-
-
18. A computer program product comprising:
-
a computer usable medium having computer readable code embodied therein for determining at least a first breakpoint in a first digital sequence comprising; computer readable program code devices configured to cause a computer to effect determining a subset group of said first digital sequence; computer readable program code devices configured to cause a computer to effect performing a hash function on said subset group of said first digital sequence beginning at a starting position in said first digital sequence until a first predetermined numeric pattern in said hash value is obtained, wherein said hash function is combined with adjusting said hash value based on a length of a current partition of said first digital sequence to increase a likelihood of identifying a first breakpoint in said current partition as a potential length of said current partition increases; computer readable program code devices configured to cause a computer to effect marking said first breakpoint when said first predetermined numeric pattern in said hash value is obtained; computer readable program code devices configured to cause a computer to effect storing said predetermined hash value of said first breakpoint in a data storage, said predetermined hash value of said first breakpoint pointing to a corresponding portion of said first digital data sequence from said starting point to said first breakpoint; computer readable program code devices configured to cause a computer to effect performing a hash function on a subset group beginning at a starting position in a second digital sequence until said first predetermined numeric pattern in said hash value is obtained; computer readable program code devices configured to cause a computer to effect marking a second breakpoint in said second digital sequence when said first predetermined numeric pattern in said hash value is obtained, wherein said hash function is combined with adjusting said hash value based on a length of a current partition of said second digital sequence to increase a likelihood of identifying said second breakpoint in said current partition as a potential length of said current partition increases; and computer readable program code devices configured to cause a computer to effect comparing said predetermined hash value at said first breakpoint with said predetermined hash value at said second breakpoint, wherein if said predetermined hash value of said first breakpoint is not the same as said predetermined hash value of said second breakpoint, replacing said predetermined hash value of said first breakpoint in said data storage, said predetermined hash value of said second breakpoint pointing to a corresponding portion of said second digital data sequence from said starting point to said second breakpoint; and if said predetermined hash value of said first breakpoint is the same as said predetermined hash value of said second breakpoint, storing said predetermined hash value of said second breakpoint in said data storage, said predetermine hash value of said second breakpoint pointing to said predetermined hash value of said first breakpoint. - View Dependent Claims (19, 20, 21, 22, 23)
-
-
24. A method for determining at least a first breakpoint in at least a first digital sequence comprising:
-
determining a subset group of said first digital sequence; performing a hash function on said subset group of said first digital sequence beginning at a starting position in said first digital sequence until a first predetermined numeric pattern is found in the hash values from the said hash function is obtained, wherein the hash function is a 32-bit rolling hash function and the subset group is a 32-bit pattern derived directly or indirectly from the said first digital sequence combined with adjusting said hash value based on a length of a current partition of said first digital sequence to increase a likelihood of identifying said first breakpoint in said current partition as a potential length of said current partition increases, and the performing comprises shifting the 32-bit pattern over one bit, reading a character from the first digital sequence and deriving directly or indirectly another 32-bit pattern, and repeating the shifting and the reading, until there are 32-bits in sequence that can be hashed together to provide one said hash value; marking said first breakpoint when said first predetermined numeric pattern in said hash value is obtained; further performing a hash function on another subset group of said first digital sequence from said first breakpoint until said first predetermined numeric pattern in said hash value is again obtained; and marking another breakpoint in said first digital sequence when said first predetermined numeric pattern in said hash value is again obtained. - View Dependent Claims (25, 26)
-
-
27. A computer program product comprising:
-
a computer usable medium having computer readable code embodied therein for determining a first breakpoint in a first digital sequence comprising; computer readable program code devices configured to cause a computer to effect determining a subset group of said first digital sequence; computer readable program code devices configured to cause a computer to effect performing a hash function on said subset group of said first digital sequence beginning at a starting position in said first digital sequence until a first predetermined numeric pattern in said hash value is obtained combined with adjusting said hash value based on a length of a current partition of said first digital sequence to increase a likelihood of identifying said first breakpoint in said current partition as a potential length of said current partition increases, wherein the hash function is a 32-bit rolling hash function and the subset group is a 32-bit pattern, and the performing comprises shifting the 32-bit pattern over one bit, reading a character from the first digital sequence, and repeating the shifting and the reading; and computer readable program code devices configured to cause a computer to effect marking said first breakpoint when said first predetermined numeric pattern in said hash value is obtained. - View Dependent Claims (28, 29, 30, 31)
-
-
32. A method for partitioning a digital sequence comprising:
-
performing a rolling hash function on at least a portion of a digital sequence to produce binary sequences, said hash function being combined with adjusting a hash value based on a length of a current partition of said digital sequence to increase a likelihood of identifying a breakpoint in said current partition as a potential length of said current partition increases; monitoring said binary sequences produced by said hash function for a predetermined pattern of bits; and marking said breakpoint in said digital sequence when said predetermined pattern occurs. - View Dependent Claims (33, 34, 35)
-
-
36. A computer program product comprising:
-
computer readable program code devices configured to cause a computer to effect performing a rolling hash function on at least a portion of a digital sequence to produce binary sequences, said hash function being combined with adjusting said hash value based on a length of a current partition of said first digital sequence to increase a likelihood of identifying said first breakpoint in said current partition as a potential length of said current partition increases; computer readable program code devices configured to cause a computer to effect monitoring said binary sequences produced by said hash function for a predetermined pattern of bits; and computer readable program code devices configured to cause a computer to effect marking said breakpoint in said digital sequence when said predetermined pattern occurs.
-
-
37. A method for partitioning a digital sequence comprising:
-
performing a mathematical function that produces a statistically infrequent arrangement of n bytes on at least a portion of a digital sequence to produce binary sequences, said mathematical function that produces statistically infrequent arrangement of n bytes being combined with adjusting said mathematical function that produces statistically infrequent arrangement of n bytes based on a length of a current partition of said digital sequence to increase a likelihood of identifying a breakpoint in said current partition as a potential length of said current partition increases; monitoring said binary sequences produced by the mathematical function that produces statistically infrequent arrangement of n bytes for a first predetermined pattern of bits; and marking said breakpoint in the digital sequence when the first predetermined pattern is identified as occurring during the monitoring. - View Dependent Claims (38, 39, 40, 41)
-
-
42. A computer program product comprising:
-
computer readable program code devices configured to cause a computer to effect performing a mathematical function that produces a statistically infrequent arrangement of n bytes on at least a portion of a digital sequence to produce binary sequences, said mathematical function that produces statistically infrequent arrangement of n bytes being combined with adjusting said mathematical function that produces statistically infrequent arrangement of n bytes based on a length of a current partition of said digital sequence to increase a likelihood of identifying a breakpoint in said current partition as a potential length of said current partition increases; computer readable program code devices configured to cause a computer to effect monitoring said binary sequences produced by the mathematical function that produces statistically infrequent arrangement of n bytes for a first predetermined pattern of bits; and computer readable program code devices configured to cause a computer to effect marking said breakpoint in the digital sequence when the first predetermined pattern is identified as occurring during the monitoring. - View Dependent Claims (43, 44, 45, 46)
-
-
47. A method for partitioning a digital sequence comprising:
-
performing a mathematical function that produces statistically infrequent arrangement of n bytes on at least a portion of a digital sequence to produce hash values, said mathematical function that produces statistically infrequent arrangement of n bytes being combined with adjusting said mathematical function that produces statistically infrequent arrangement of n bytes based on a length of a current partition of said digital sequence to increase a likelihood of identifying a breakpoint in said current partition as a potential length of said current partition increases; monitoring said hash values produced by the mathematical function that produces statistically infrequent arrangement of n bytes for a numeric pattern selected from a range of numeric values; and marking said breakpoint in the digital sequence when the numeric pattern occurs according to the monitoring.
-
-
48. A computer program product comprising:
-
computer readable program code devices configured to cause a computer to effect performing a mathematical function that produces statistically infrequent arrangement of n bytes on at least a portion of a digital sequence to produce hash values, said mathematical function that produces statistically infrequent arrangement of n bytes being combined with adjusting said mathematical function that produces statistically infrequent arrangement of n bytes based on a length of a current partition of said digital sequence to increase a likelihood of identifying a breakpoint in said current partition as a potential length of said current partition increases; computer readable program code devices configured to cause a computer to effect monitoring hash values produced by the mathematical function that produces statistically infrequent arrangement of n bytes for a numeric pattern selected from a range of numeric values and computer readable program code devices configured to cause a computer to effect marking said breakpoint in the digital sequence when the numeric pattern occurs according to the monitoring.
-
-
49. A method for partitioning a digital sequence comprising:
-
performing a mathematical function that produces statistically infrequent arrangement of n bytes on at least a portion of said digital sequence, the performing comprising; indexing bytes from the digital sequence into an array of alternative pre-determined bit sequences; performing a hash function on the pre-determined bit sequences, said hash function being combined with adjusting said hash function based on a length of a current partition of said digital sequence to increase a likelihood of identifying a breakpoint in said current partition as a potential length of said current partition increases; monitoring the bit sequences produced by the said hash function for a first predetermined bit pattern; and marking said breakpoint in said digital sequence when said first predetermined bit pattern occurs. - View Dependent Claims (50, 51)
-
-
52. A computer program product comprising:
-
computer readable program code devices configured to cause a computer to effect performing a mathematical function that produces statistically infrequent arrangement of n bytes on at least a portion of said digital sequence, the computer readable program code devices comprising additional computer readable program code devices configured to cause a computer to effect; indexing bytes from the digital sequence into an array of alternative pre-determined bit sequences; performing a hash function on the pre-determined bit sequences, said has function being combined with adjusting said hash function based on a length of a current partition of said digital sequence to increase a likelihood of identifying a breakpoint in said current partition as a potential length of said current partition increases; monitoring the bit sequences produced by the said hash function for a first predetermined bit pattern; and marking said breakpoint in said digital sequence when said first predetermined bit pattern occurs.
-
-
53. A method for partitioning a digital sequence comprising:
-
processing at least a portion of a digital sequence to produce a transformation; performing a rolling hash function on the transformation to produce binary sequences, said hash function being combined with adjusting said hash function based on a length of a current partition of said digital sequence to increase a likelihood of identifying a breakpoint in said current partition as a potential length of said current partition increases; monitoring said binary sequences produced by the hash function for a pattern of bits; and marking said breakpoint in the digital sequence when said first predetermined pattern occurs as determined by the monitoring. - View Dependent Claims (54, 55)
-
-
56. A computer program product comprising:
-
computer readable program code devices configured to cause a computer to effect processing at least a portion of a digital sequence to produce a transformation; computer readable program code devices configured to cause a computer to effect performing a rolling hash function on the transformation to produce binary sequences, said hash function being combined with adjusting said hash function based on a length of a current partition of said digital sequence to increase a likelihood of identifying a breakpoint in said current partition as a potential length of said current partition increases; computer readable program code devices configured to cause a computer to effect monitoring said binary sequences produced by the hash function for a pattern of bits; and computer readable program code devices configured to cause a computer to effect marking said breakpoint in the digital sequence when said first predetermined pattern occurs as determined by the monitoring. - View Dependent Claims (57, 58)
-
Specification