GENERATING SKETCHES SENSITIVE TO HIGH-OVERLAP ESTIMATION
First Claim
1. A computer-implemented method comprising:
- dividing, by a computer, a first collection of data objects into m groups of average size s, wherein a data object of the first collection is assigned to one or more of the m groups;
computing a combined hash result for all members of a respective group, for each hash function in n hash functions;
constructing a first sketch vector with n elements, wherein a respective element is selected, using a selection function, from the combined hash results computed with the hash function corresponding to the element'"'"'s index;
receiving a second sketch vector for a second collection of data objects;
determining a sketch-vector overlap between the first and second sketch vectors; and
computing a data-object overlap between the first and second collections of data objects based on the sketch-vector overlap.
4 Assignments
0 Petitions
Accused Products
Abstract
A versioning system determines an amount by which a first collection and a second collection of data objects overlap. The system divides the first collection of data objects into m possibly overlapping groups of average size s and computes one combined hash result for each group. The system then constructs a first sketch vector with n elements based on the combined hash results. A respective element of the first sketch vector is selected, using a selection function, from the combined hash results that are computed with the hash function corresponding to the element'"'"'s index. Next, the system receives a second sketch vector for the second collection of data objects, and determines a sketch-vector overlap between the first and second sketch vectors. The system then computes a data-object overlap between the first and second collections of data objects based on the sketch-vector overlap.
-
Citations
18 Claims
-
1. A computer-implemented method comprising:
-
dividing, by a computer, a first collection of data objects into m groups of average size s, wherein a data object of the first collection is assigned to one or more of the m groups; computing a combined hash result for all members of a respective group, for each hash function in n hash functions; constructing a first sketch vector with n elements, wherein a respective element is selected, using a selection function, from the combined hash results computed with the hash function corresponding to the element'"'"'s index; receiving a second sketch vector for a second collection of data objects; determining a sketch-vector overlap between the first and second sketch vectors; and computing a data-object overlap between the first and second collections of data objects based on the sketch-vector overlap. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A non-transitory computer-readable storage medium storing instructions that when executed by a computer cause the computer to perform a method comprising:
-
dividing a first collection of data objects into m groups of average size s, wherein a data object of the first collection is assigned to one or more of the m groups; computing a combined hash result for all members of a respective group, for each hash function in n pair-wise independent hash functions; constructing a first sketch vector with n elements, wherein a respective element is selected, using a selection function, from the combined hash results computed with the hash function corresponding to the element'"'"'s index; receiving a second sketch vector for a second collection of data objects; determining a sketch-vector overlap between the first and second sketch vectors; and computing a data-object overlap between the first and second collections of data objects based on the sketch-vector overlap. - View Dependent Claims (8, 9, 10, 11, 12)
-
-
13. An apparatus comprising:
-
a data-grouping mechanism to divide a first collection of data objects into m groups of average size s, wherein a data object of the first collection is assigned to one or more of the m groups; a hashing mechanism to compute a combined hash result for all members of a respective group, for each hash function in n pair-wise independent hash functions; a sketch-generating mechanism to construct a first sketch vector with n elements, wherein a respective element is selected, using a selection function, from the combined hash results computed with the hash function corresponding to the element'"'"'s index; a communication mechanism to receive a second sketch vector for a second collection of data objects; a comparison mechanism to determine a sketch-vector overlap between the first and second sketch vectors; and a computing mechanism to compute a data-object overlap between the first and second collections of data objects based on the sketch-vector overlap. - View Dependent Claims (14, 15, 16, 17, 18)
-
Specification