System and method for unorchestrated determination of data sequences using sticky byte factoring to determine breakpoints in digital sequences
First Claim
1. A method for partitioning a digital sequence comprising:
- performing a hash function on at least a portion of said digital sequence;
monitoring hash values produced by said hash function for a first predetermined numeric pattern found in a range of numeric values;
marking a breakpoint in said digital sequence when said first predetermined numeric pattern occurs;
determining a threshold restriction for said step of monitoring said hash values;
establishing at least a second predetermined numeric pattern found in a range of numeric values for said step of monitoring when said threshold restriction is met, wherein said second predetermined numeric pattern is a bit pattern and is a subset of said first predetermined numeric pattern; and
alternatively marking said breakpoint in said digital seguence when said second predetermined numeric pattern occurs.
14 Assignments
0 Petitions
Accused Products
Abstract
A system and method for unorchestrated determination of data sequences using “sticky byte” factoring to determine breakpoints in digital sequences such that common sequences can be identified. Sticky byte factoring provides an efficient method of dividing a data set into pieces that generally yields near optimal commonality. This is effectuated by employing a rolling hashsum and, in an exemplary embodiment disclosed herein, a threshold function to deterministically set divisions in a sequence of data. Both the rolling hash and the threshold function are designed to require minimal computation. This low overhead makes it possible to rapidly partition a data sequence for presentation to a factoring engine or other applications that prefer subsequent synchronization across the data set.
-
Citations
78 Claims
-
1. A method for partitioning a digital sequence comprising:
-
performing a hash function on at least a portion of said digital sequence;
monitoring hash values produced by said hash function for a first predetermined numeric pattern found in a range of numeric values;
marking a breakpoint in said digital sequence when said first predetermined numeric pattern occurs;
determining a threshold restriction for said step of monitoring said hash values;
establishing at least a second predetermined numeric pattern found in a range of numeric values for said step of monitoring when said threshold restriction is met, wherein said second predetermined numeric pattern is a bit pattern and is a subset of said first predetermined numeric pattern; and
alternatively marking said breakpoint in said digital seguence when said second predetermined numeric pattern occurs. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
determining a threshold restriction for said step of monitoring said hash values; and
increasing a probability of said marking of said breakpoint in said digital sequence.
-
-
13. The method of claim 12 wherein said step of increasing said probability of said marking of said breakpoint in said digital sequence is a function of at least a desired chunk size.
-
14. The method of claim 12 wherein said step of increasing said probability of said marking of said breakpoint in said digital sequence is a function of at least a length of a current portion of said digital sequence.
-
15. The method of claim 12 wherein said step of increasing said probability of said marking of said breakpoint in said digital sequence is carried out by the step of:
-
utilizing a second predetermined numeric pattern for said step of monitoring said hash values; and
alternatively marking said breakpoint when said second predetermined numeric pattern occurs.
-
-
16. The method of claim 12 wherein said step of increasing said probability of said marking of said breakpoint in said digital sequence is a function of some content portion of said sequence.
-
17. The method of claim 1 wherein said hash function is selected to produce preferential hashsums that do not uniformly cover the range of possible output values.
-
18. The method of claim 17, wherein said hash function selected to produce preferential hashsums uses an array of predetermined values indexed by the digital sequence to generate a distribution of hash values having a level of indirection from direct influence of the digital sequence.
-
19. The method of claim 1 wherein said hash function is modified dynamically during the execution of said hash function.
-
20. The method of claim 1 wherein said hash function includes the relative or absolute value of the current location as an input value in said hash function resulting in hash values that are affected by the relative or absolute value of the current location.
-
21. A computer program product comprising:
-
a computer usable medium having computer readable code embodied therein for partitioning a digital sequence comprising;
computer readable program code devices configured to cause a computer to effect performing a hash function on at least a portion of said digital sequence;
computer readable program code devices configured to cause a computer to effect monitoring hash values produced by said hash function for a first predetermined numeric pattern found in a range of numeric values;
computer readable program code devices configured to cause a computer to effect marking a breakpoint in said digital sequence when said first predetermined numeric pattern occurs;
computer readable program code devices configured to cause a computer to effect determining a threshold restriction for said step of monitoring said hash values;
computer readable program code devices configured to cause a computer to effect establishing at least a second predetermined numeric pattern found in a range of numeric values for said step of monitoring when said threshold restriction is met, wherein said second predetermined numeric pattern is a bit pattern and is a subset of said first predetermined numeric pattern; and
computer readable program code devices configured to cause a computer to effect alternatively marking said breakpoint in said digital sequence when said second predetermined numeric pattern occurs. - View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35)
computer readable program code devices configured to cause a computer to effect determining a threshold restriction for said step of monitoring said hash values; and
computer readable program code devices configured to cause a computer to effect increasing a probability of said marking of said breakpoint in said digital sequence.
-
-
33. The computer program product of claim 32 wherein said computer readable program code devices configured to cause a computer to effect increasing said probability of said marking of said breakpoint in said digital sequence is a function of at least a desired chunk size.
-
34. The computer program product of claim 32 wherein said computer readable program code devices configured to cause a computer to effect increasing said probability of said marking of said breakpoint in said digital sequence is a function of at least a length of a current portion of said digital sequence.
-
35. The computer program product of claim 32 wherein said computer readable program code devices configured to cause a computer to effect increasing said probability of said marking of said breakpoint in said digital sequence is carried out by:
-
computer readable program code devices configured to cause a computer to effect utilizing a second predetermined numeric pattern found in a range of numeric values for said step of monitoring said hash values; and
computer readable program code devices configured to cause a computer to effect alternatively marking said breakpoint when said second predetermined numeric pattern occurs.
-
-
36. A method for partitioning a digital sequence comprising:
-
performing a hash function on at least a portion of said digital sequence;
monitoring hash values produced by said hash function for a first predetermined numeric pattern;
marking a breakpoint in said digital sequence when said first predetermined numeric pattern occurs;
determining a threshold restriction for said step of monitoring said hash values;
establishing at least a second predetermined numeric pattern for said step of monitoring when said threshold restriction is met; and
alternatively marking said breakpoint in said digital sequence when said second predetermined numeric pattern occurs, wherein said second predetermined numeric pattern is a bit pattern and a subset of said first predetermined numeric pattern. - View Dependent Claims (37, 38, 39, 40, 41, 42, 43, 44, 45, 46)
-
-
47. A computer program product comprising:
-
a computer usable medium having computer readable code embodied therein for partitioning a digital sequence comprising;
computer readable program code devices configured to cause a computer to effect performing a hash function on at least a portion of said digital sequence;
computer readable program code devices configured to cause a computer to effect monitoring hash values produced by said hash function for a first predetermined numeric pattern;
computer readable program code devices configured to cause a computer to effect marking a breakpoint in said digital sequence when said first predetermined numeric pattern occurs;
computer readable program code devices configured to cause a computer to effect determining a threshold restriction for said step of monitoring said hash values;
computer readable program code devices configured to cause a computer to effect establishing at least a second predetermined numeric pattern for said step of monitoring when said threshold restriction is met; and
computer readable program code devices configured to cause a computer to effect alternatively marking said breakpoint in said digital sequence when said second predetermined numeric pattern occurs, wherein said second predetermined numeric pattern is a bit pattern and a subset of said first predetermined numeric pattern. - View Dependent Claims (48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59)
computer readable program code devices configured to cause a computer to effect determining a threshold restriction for said step of monitoring said hash values; and
computer readable program code devices configured to cause a computer to effect increasing a probability of said marking of said breakpoint in said digital sequence.
-
-
57. The computer program product of claim 56 wherein said computer readable program code devices configured to cause a computer to effect increasing said probability of said marking of said breakpoint in said digital sequence is a function of at least a desired chunk size.
-
58. The computer program product of claim 56 wherein said computer readable program code devices configured to cause a computer to effect increasing said probability of said marking of said breakpoint in said digital sequence is a function of at least a length of a current portion of said digital sequence.
-
59. The computer program product of claim 56 wherein said computer readable program code devices configured to cause a computer to effect increasing said probability of said marking of said breakpoint in said digital sequence is carried out by:
-
computer readable program code devices configured to cause a computer to effect utilizing a second predetermined numeric pattern for said step of monitoring said hash values; and
computer readable program code devices configured to cause a computer to effect alternatively marking said breakpoint when said second predetermined numeric pattern occurs.
-
-
60. A method for partitioning a digital sequence comprising:
-
performing a hash function on at least a portion of said digital sequence;
monitoring hash values produced by said hash function for a first predetermined numeric pattern found in a range of numeric values; and
marking a breakpoint in said digital sequence when said first predetermined numeric pattern occurs;
wherein said hash function is selected to produce preferential hashsums that do not uniformly cover a range of possible output values. - View Dependent Claims (61, 62, 63, 64, 65, 66)
-
-
67. A method for partitioning a digital sequence comprising:
-
performing a hash function on at least a portion of said digital sequence, wherein said hash function is modified dynamically during the execution of said hash function;
monitoring hash values produced by said hash function for a first predetermined numeric pattern found in a range of numeric values; and
marking a breakpoint in said digital sequence when said first predetermined numeric pattern occurs. - View Dependent Claims (68, 69, 70, 71, 72)
-
-
73. A method for partitioning a digital sequence comprising:
-
performing a hash function on at least a portion of said digital sequence;
monitoring hash values produced by said hash function for a first predetermined numeric pattern found in a range of numeric values; and
marking a breakpoint in said digital sequence when said first predetermined numeric pattern occurs;
wherein said hash function includes the relative or absolute value of the current location as an input value in said hash function resulting in hash values that are affected by the relative or absolute value of the current location. - View Dependent Claims (75, 76, 77, 78)
-
-
74. The method of claim wherein said step of performing said hash function is carried out by means of a rolling hash function, the rolling hash function scanning portions of said digital sequence and progressively reducing an impact of more distant bits in said digital sequence.
Specification