Consistent weighted sampling of multisets and distributions

US 7,716,144 B2
Filed: 03/22/2007
Issued: 05/11/2010
Est. Priority Date: 03/22/2007
Status: Expired due to Fees

First Claim

Patent Images

1. A method of determining a feature from a document comprising a set of features, the method comprising:

assigning a weight S(x) to each feature in the document comprising the set of features; and

generating a sample in the form (x, y), wherein x is one of the features in the document comprising the set of features and y is a weight between 0 and the weight S(x) corresponding to that feature and wherein y is determined in part by producing a sequence of active indices and identifying a largest one of the active indices that is below the weight S(x) in part by computing log₂(S(x)).

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques are provided that identify near-duplicate items in large collections of items. A list of (value, frequency) pairs is received, and a sample (value, instance) is returned. The value is chosen from the values of the first list, and the instance is a value less than frequency, in such a way that the probability of selecting the same sample from two lists is equal to the similarity of the two lists.

Citations

18 Claims

1. A method of determining a feature from a document comprising a set of features, the method comprising:
- assigning a weight S(x) to each feature in the document comprising the set of features; and
  
  generating a sample in the form (x, y), wherein x is one of the features in the document comprising the set of features and y is a weight between 0 and the weight S(x) corresponding to that feature and wherein y is determined in part by producing a sequence of active indices and identifying a largest one of the active indices that is below the weight S(x) in part by computing log₂(S(x)).
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, wherein generating the sample comprises selecting the feature with a probability proportional to the weight S(x) corresponding to that feature.
  - 3. The method of claim 2, further comprising uniformly choosing y.
  - 4. The method of claim 1, further comprising obtaining the set of features in response to a search engine query.
  - 5. The method of claim 1, further comprising outputting the sample.
  - 6. The method of claim 1, further comprising generating a hash value for the sample.
  - 7. The method of claim 1, further comprising repeating the generating to obtain a plurality of samples.
  - 8. The method of claim 7, further comprising generating a hash value for each of the samples.
  - 9. The method of claim 8, further comprising outputting only the sample that has the greatest hash value.

10. A method of determining a feature from a document comprising a set of features, the method comprising:
- assigning a weight S(x) to each feature in the document comprising the set of features;
  
  generating a sample in the form (x, y), wherein x is one of the features in the document comprising the set of features and y is a weight between 0 and the weight S(x) corresponding to that feature; and
  
  determining a plurality of indices that potentially enclose the sample at least in part by computing log₂(S(x)), wherein determining the indices is based on intervals of powers of two; and
  
  determining which of the intervals of powers of two are empty using a vector comprising a plurality of bits, wherein each bit indicates whether a corresponding interval is empty, and avoiding determining the indices based on the intervals that are determined to be empty using the vector.
- View Dependent Claims (11, 12, 13, 14)
- - 11. The method of claim 10, further comprising determining a lower index y and an upper index z that enclose the sample from the plurality of indices.
  - 12. The method of claim 11, further comprising generating a hash value of the sample.
  - 13. The method of claim 12, wherein the hash value is independent of y and consistent.
  - 14. The method of claim 12, wherein generating the hash value comprises producing the hash value from a cumulative density function.

15. A method of determining a feature from a document comprising a set of features, the method comprising:
- assigning a weight S(x) to each feature in the document comprising the set of features;
  
  for each feature having a non-zero weight S(x), selecting a representative (x, y), where y is a positive weight value that is not greater than S(x), wherein selecting the positive weight value of y comprises producing a sequence of active indices, identifying a largest one of the active indices that is below the non-zero weight S(x) and a smallest one of the active indices that is above the non-zero weight S(x) at least in part by computing log₂(S(x)), and selecting the identified largest one of the active indices that is below the non-zero weight S(x), as the positive weight value of y;
  
  for each representative (x, y), generating a hash value h(x, y); and
  
  outputting only the representative (x, y) corresponding to a maximum hash value h(x, y).
- View Dependent Claims (16, 17, 18)
- - 16. The method of claim 15, wherein generating the hash value comprises producing the hash value from a cumulative density function based on z and a random number.
  - 17. The method of claim 15, further comprising after generating the hash value, comparing the hash value to a previously stored maximum hash value, and if the hash value is greater than the previously stored hash value, then storing the hash value as the maximum hash value.
  - 18. The method of claim 15, further comprising deferring the determination of the largest one of the active indices that is below the non-zero weight S(x) until immediately before outputting only the representative (x, y) corresponding to the maximum hash value h(x, y).

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Talwar, Kunal, McSherry, Frank D., Manasse, Mark Steven
Primary Examiner(s)
Vincent; David R
Assistant Examiner(s)
Kim; David H

Application Number

US11/726,644
Publication Number

US 20080235201A1
Time in Patent Office

1,146 Days
Field of Search

706/15, 707/101
US Class Current

706/12
CPC Class Codes

G06F 40/194 Calculation of difference b...

Consistent weighted sampling of multisets and distributions

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Consistent weighted sampling of multisets and distributions

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links