Generating sketches sensitive to high-overlap estimation
First Claim
1. A computer-implemented method comprising:
- dividing, by a computer, a first collection of data objects into m groups of average size s, wherein a data object of the first collection is assigned to one or more of the m groups;
computing a combined hash result for all members of a respective group, for each hash function in n hash functions;
constructing a first sketch vector with n elements, wherein a respective element is selected, using a selection function, from the combined hash results computed with the hash function corresponding to the element'"'"'s index;
receiving a second sketch vector for a second collection of data objects;
determining a sketch-vector overlap between the first and second sketch vectors; and
computing a data-object overlap between the first and second collections of data objects based on the sketch-vector overlap, wherein computing the data-object overlap comprises entering the sketch-vector overlap into a conversion function;
data-object overlap=(sketch-vector overlap)1/s;
wherein s indicates an average number of data objects per group.
4 Assignments
0 Petitions
Accused Products
Abstract
A versioning system determines an amount by which a first collection and a second collection of data objects overlap. The system divides the first collection of data objects into m possibly overlapping groups of average size s and computes one combined hash result for each group. The system then constructs a first sketch vector with n elements based on the combined hash results. A respective element of the first sketch vector is selected, using a selection function, from the combined hash results that are computed with the hash function corresponding to the element'"'"'s index. Next, the system receives a second sketch vector for the second collection of data objects, and determines a sketch-vector overlap between the first and second sketch vectors. The system then computes a data-object overlap between the first and second collections of data objects based on the sketch-vector overlap.
12 Citations
15 Claims
-
1. A computer-implemented method comprising:
-
dividing, by a computer, a first collection of data objects into m groups of average size s, wherein a data object of the first collection is assigned to one or more of the m groups; computing a combined hash result for all members of a respective group, for each hash function in n hash functions; constructing a first sketch vector with n elements, wherein a respective element is selected, using a selection function, from the combined hash results computed with the hash function corresponding to the element'"'"'s index; receiving a second sketch vector for a second collection of data objects; determining a sketch-vector overlap between the first and second sketch vectors; and computing a data-object overlap between the first and second collections of data objects based on the sketch-vector overlap, wherein computing the data-object overlap comprises entering the sketch-vector overlap into a conversion function;
data-object overlap=(sketch-vector overlap)1/s;wherein s indicates an average number of data objects per group. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A non-transitory computer-readable storage medium storing instructions that when executed by a computer cause the computer to perform a method comprising:
-
dividing a first collection of data objects into m groups of average size s, wherein a data object of the first collection is assigned to one or more of the m groups; computing a combined hash result for all members of a respective group, for each hash function in n pair-wise independent hash functions; constructing a first sketch vector with n elements, wherein a respective element is selected, using a selection function, from the combined hash results computed with the hash function corresponding to the element'"'"'s index; receiving a second sketch vector for a second collection of data objects; determining a sketch-vector overlap between the first and second sketch vectors; and computing a data-object overlap between the first and second collections of data objects based on the sketch-vector overlap, wherein computing the data-object overlap comprises entering the sketch-vector overlap into a conversion function;
data-object overlap=(sketch-vector overlap)1/s;wherein s indicates an average number of data objects per group. - View Dependent Claims (7, 8, 9, 10)
-
-
11. An apparatus comprising:
-
a processor; a memory; a data-grouping mechanism to divide a first collection of data objects into m groups of average size s, wherein a data object of the first collection is assigned to one or more of the m groups; a hashing mechanism to compute a combined hash result for all members of a respective group, for each hash function in n pair-wise independent hash functions; a sketch-generating mechanism to construct a first sketch vector with n elements, wherein a respective element is selected, using a selection function, from the combined hash results computed with the hash function corresponding to the element'"'"'s index; a communication mechanism to receive a second sketch vector for a second collection of data objects; a comparison mechanism to determine a sketch-vector overlap between the first and second sketch vectors; and a computing mechanism to compute a data-object overlap between the first and second collections of data objects based on the sketch-vector overlap, wherein while computing the data-object overlap, the computing mechanism is further configured to enter the sketch-vector overlap into a conversion function;
data-object overlap=(sketch-vector overlap)1/s;wherein s indicates an average number of data objects per group. - View Dependent Claims (12, 13, 14, 15)
-
Specification