Consistent weighted sampling of multisets and distributions
First Claim
Patent Images
1. A method of determining a feature from a document comprising a set of features, the method comprising:
- assigning a weight S(x) to each feature in the document comprising the set of features; and
generating a sample in the form (x, y), wherein x is one of the features in the document comprising the set of features and y is a weight between 0 and the weight S(x) corresponding to that feature and wherein y is determined in part by producing a sequence of active indices and identifying a largest one of the active indices that is below the weight S(x) in part by computing log2(S(x)).
2 Assignments
0 Petitions
Accused Products
Abstract
Techniques are provided that identify near-duplicate items in large collections of items. A list of (value, frequency) pairs is received, and a sample (value, instance) is returned. The value is chosen from the values of the first list, and the instance is a value less than frequency, in such a way that the probability of selecting the same sample from two lists is equal to the similarity of the two lists.
-
Citations
18 Claims
-
1. A method of determining a feature from a document comprising a set of features, the method comprising:
-
assigning a weight S(x) to each feature in the document comprising the set of features; and generating a sample in the form (x, y), wherein x is one of the features in the document comprising the set of features and y is a weight between 0 and the weight S(x) corresponding to that feature and wherein y is determined in part by producing a sequence of active indices and identifying a largest one of the active indices that is below the weight S(x) in part by computing log2(S(x)). - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A method of determining a feature from a document comprising a set of features, the method comprising:
-
assigning a weight S(x) to each feature in the document comprising the set of features; generating a sample in the form (x, y), wherein x is one of the features in the document comprising the set of features and y is a weight between 0 and the weight S(x) corresponding to that feature; and determining a plurality of indices that potentially enclose the sample at least in part by computing log2(S(x)), wherein determining the indices is based on intervals of powers of two; and determining which of the intervals of powers of two are empty using a vector comprising a plurality of bits, wherein each bit indicates whether a corresponding interval is empty, and avoiding determining the indices based on the intervals that are determined to be empty using the vector. - View Dependent Claims (11, 12, 13, 14)
-
-
15. A method of determining a feature from a document comprising a set of features, the method comprising:
-
assigning a weight S(x) to each feature in the document comprising the set of features; for each feature having a non-zero weight S(x), selecting a representative (x, y), where y is a positive weight value that is not greater than S(x), wherein selecting the positive weight value of y comprises producing a sequence of active indices, identifying a largest one of the active indices that is below the non-zero weight S(x) and a smallest one of the active indices that is above the non-zero weight S(x) at least in part by computing log2(S(x)), and selecting the identified largest one of the active indices that is below the non-zero weight S(x), as the positive weight value of y; for each representative (x, y), generating a hash value h(x, y); and
outputting only the representative (x, y) corresponding to a maximum hash value h(x, y). - View Dependent Claims (16, 17, 18)
-
Specification