×

Estimating similarity between two collections of information

  • US 7,702,683 B1
  • Filed: 09/18/2006
  • Issued: 04/20/2010
  • Est. Priority Date: 09/18/2006
  • Status: Active Grant
First Claim
Patent Images

1. A method for estimating similarity between two collections of information, comprising:

  • receiving a first collection of information and a second collection of information;

    hashing data chunks of the first and second collections using a set of hash functions;

    deriving k m-bit hash values from hash values determined from the hashing of the first collection of information, where k>

    1 and m>

    1;

    determining an index for each of the k m-bit hash values;

    using a computer processor and the indices for the k m-bit hash values to compare a first probabilistic data structure representing a first collection of information and a second probabilistic data structure representing a second collection of information;

    using a computer processor to determine a measure of similarity between the first probabilistic data structure and the second probabilistic data structure based on the comparing; and

    estimating similarity between the two collections of information from the determined measure of similarity for one of efficient data comparison and efficient data management of the two collections of information.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×