Consistent weighted sampling of multisets and distributions
First Claim
Patent Images
1. A method of determining an element from a set of elements, comprising:
- assigning a weight S(x) to each element x in the set of elements S; and
generating a sample in the form (x, y), wherein x is one of the elements in the set and y is a weight between 0 and the weight S(x) corresponding to that element.
2 Assignments
0 Petitions
Accused Products
Abstract
Techniques are provided that identify near-duplicate items in large collections of items. A list of (value, frequency) pairs is received, and a sample (value, instance) is returned. The value is chosen from the values of the first list, and the instance is a value less than frequency, in such a way that the probability of selecting the same sample from two lists is equal to the similarity of the two lists.
-
Citations
20 Claims
-
1. A method of determining an element from a set of elements, comprising:
-
assigning a weight S(x) to each element x in the set of elements S; and generating a sample in the form (x, y), wherein x is one of the elements in the set and y is a weight between 0 and the weight S(x) corresponding to that element. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A method of determining an element from a set of elements, comprising:
-
assigning a weight S(x) to each element x in the set of elements S; generating a sample in the form (x, y), wherein x is one of the elements in the set and y is a weight between 0 and the weight S(x) corresponding to that element; and determining a plurality of indices that potentially enclose the sample. - View Dependent Claims (11, 12, 13, 14, 15, 16)
-
-
17. A method of determining an element from a set of elements, comprising:
-
assigning a weight S(x) to each element x in the set of elements S; for each element, determining a largest element y below, and a least element z above, the element x; and generating a hash value for the element x; and outputting the x and y corresponding to a maximum hash value. - View Dependent Claims (18, 19, 20)
-
Specification