Estimating similarity between two collections of information

US 7,702,683 B1
Filed: 09/18/2006
Issued: 04/20/2010
Est. Priority Date: 09/18/2006
Status: Active Grant

First Claim

Patent Images

1. A method for estimating similarity between two collections of information, comprising:

receiving a first collection of information and a second collection of information;

hashing data chunks of the first and second collections using a set of hash functions;

deriving k m-bit hash values from hash values determined from the hashing of the first collection of information, where k>

1 and m>

1;

determining an index for each of the k m-bit hash values;

using a computer processor and the indices for the k m-bit hash values to compare a first probabilistic data structure representing a first collection of information and a second probabilistic data structure representing a second collection of information;

using a computer processor to determine a measure of similarity between the first probabilistic data structure and the second probabilistic data structure based on the comparing; and

estimating similarity between the two collections of information from the determined measure of similarity for one of efficient data comparison and efficient data management of the two collections of information.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for estimating similarity between two collections of information is described herein. The method includes comparing a first Bloom filter representing a first collection of information and a second Bloom filter representing a second collection of information, and determining a measure of similarity between the first collection of information and the second collection of information based on the comparing.

Citations

20 Claims

1. A method for estimating similarity between two collections of information, comprising:
- receiving a first collection of information and a second collection of information;
  
  hashing data chunks of the first and second collections using a set of hash functions;
  
  deriving k m-bit hash values from hash values determined from the hashing of the first collection of information, where k>
  
  1 and m>
  
  1;
  
  determining an index for each of the k m-bit hash values;
  
  using a computer processor and the indices for the k m-bit hash values to compare a first probabilistic data structure representing a first collection of information and a second probabilistic data structure representing a second collection of information;
  
  using a computer processor to determine a measure of similarity between the first probabilistic data structure and the second probabilistic data structure based on the comparing; and
  
  estimating similarity between the two collections of information from the determined measure of similarity for one of efficient data comparison and efficient data management of the two collections of information.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 10, 11, 14, 15)
- - 2. The method of claim 1, wherein the first collection of information includes at least one data element, and the method further comprising:
    - applying a hash function to the at least one data element in the first collection of information to obtain a hash value.
  - 3. The method of claim 2, further comprising:
    - normalizing information in the at least one data element.
  - 4. The method of claim 2, wherein the first probabilistic data structure is a first Bloom filter that includes a bit length having a predetermined number of bits, and the method further comprising:
    - determining from the hash value a plurality of indices to the bits in the first Bloom filter; and
      
      ensuring that the indexed bits in the first Bloom filter are all set to a first bit value.
  - 5. The method of claim 1, wherein the comparing the first probabilistic data structure and the second probabilistic data structure comprises:
    - determining a first number of bits in the second probabilistic data structure that are set to a predetermined bit value;
      
      providing a third probabilistic data structure representing a union of the first probabilistic data structure and the second probabilistic data structure;
      
      determining a second number of bits in the third probabilistic data structure that are set to the predetermined bit value;
      
      computing a logarithm of a ratio of the first number of bits and the second number of bits; and
      
      determining, based on the computed logarithm, a first estimate of an amount of information in the first collection of information that is not contained in the second collection of information.
  - 6. The method of claim 5, wherein:
    - the comparing the first probabilistic data structure and the second probabilistic data structure further comprises determining a second estimate of a size of the first collection of information; and
      
      the determining the measure of similarity further comprises computing the measure of similarity between the first collection of information and the second collection of information based on both the first estimate and the second estimate.
  - 7. The method of claim 6, wherein the first probabilistic data structure is a first Bloom filter and the second probabilistic data structure is a second Bloom filter, and the determining the second estimate of the size of the first collection further comprises:
    - computing a logarithm of a ratio of a total number of bits in the first Bloom filter and a number of bits in the first Bloom filter that are set to the predetermined bit value.
  - 8. The method of claim 1, wherein the first collection of information includes one of:
    - books in a library;
      
      membership records of an organization;
      
      e-mail messages in an electronic mailbox;
      
      photographs in an album;
      
      TV shows stored in an electronic storage medium;
      
      objects in a room;
      
      vehicles observed at a location;
      
      web pages visited;
      
      songs played;
      
      files accessed;
      
      results of a database inquiry; and
      
      extractable chunks of information from one of,a textual document or file, a drawing, a chart, a presentation, a photographic image, a video image, an audio file, a compiled program, a web page, and a description of a configuration of a system or application.
  - 10. The method of claim 1, wherein the first collection of information includes a document representing one of a textual document, a photographic image, a video image, a drawing, an audio file, a chart, a configuration description, a presentation, a web page, and a compiled program.
  - 11. The method of claim 10, wherein the constructing a first plurality of chunks comprises determining a first boundary of one of the plurality of chunks within the first document by one of:
    - noting a predetermined distance of the first boundary from a prior boundary;
      
      searching for a predetermined character sequence;
      
      searching for a match to a predetermined regular expression; and
      
      computing a function over the contents of a sliding window within the first document.
  - 14. The method of claim 10, wherein the second collection of information comprises a second document constructed from the second plurality of chunks of information, and the first document and the second document are contained within a result returned in response to a search query, and wherein a determination is made to not present the second document based on the measure of similarity.
  - 15. The method of claim 10, wherein the second collection of information comprises a second document constructed from the second plurality of chunks of information, and the method further comprising:
    - determining that the first document is largely contained within the second document based on the measure of similarity.

9. A computerized method for efficiently providing an information service to a customer, comprising:
- constructing a first plurality of chunks of information based on a first collection of information; and
  
  constructing a second plurality of chunks of information based on a second collection of information for comparison with the first collection of information;
  
  hashing the plurality of chunks of information of the first and second collections using a set of hash functions;
  
  deriving k m-bit hash values from hash values determined from the hashing of the first collection of information, where k>
  
  1 and m>
  
  1;
  
  determining an index for each of the k m-bit hash values;
  
  constructing a first probabilistic data structure based on the first plurality of chunks of information;
  
  constructing a second probabilistic data structure based on the second plurality of chunks of information; and
  
  comparing in a computerized system, using the indices for the k m-bit hash values, the first probabilistic data structure and the second probabilistic data structure; and
  
  determining in the computerized system a measure of similarity between the first probabilistic data structure and the second probabilistic data structure based on the comparing;
  
  estimating similarity between the first collection of information and the second collection of information based on the determined measure of similarity as part of efficiently providing the information service to the customer.
- View Dependent Claims (12, 13)
- - 12. The method of claim 9, wherein the first probabilistic data structure is a first Bloom filter, the second probabilistic data structure is a second Bloom filter, and the first collection of information comprises a search query received from the customer.
  - 13. The method of claim 9, wherein the first collection of information and the second collection of information are received from the customer.

16. A computer readable storage device on which is stored program code for measuring similarity between two collections of information, the encoded program code comprising:
- program code for receiving a first collection of information and second collection of information;
  
  program code for hashing data chunks of the first and second collections using a set of hash functions;
  
  program code for deriving k m-bit hash values from hash values determined from the hashing of the first collection of information, where k>
  
  1 and m>
  
  1;
  
  program code for determining an index for each of the k m-bit hash values;
  
  program code for comparing, using the indices for the k m-bit hash values, a first probabilistic data structure representing a first collection of information and a second probabilistic data structure representing a second collection of information; and
  
  program code for determining a measure of similarity between the first collection of information and the second collection of information based on the program code for comparing.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The computer readable storage device of claim 16, wherein the encoded program code further comprising:
    - program code for generating the first probabilistic data structure based on the hash values for the first collection of information.
  - 18. The computer readable storage device of claim 16, wherein the encoded program code further comprising:
    - program code for representing the first probabilistic data structure as a first Bloom filter having a first array of bits having a predetermined bit length and the second probabilistic data structure as a second Bloom filter having a second array of bits having the predetermined bit length; and
      
      program code for setting bits in the first array of bits, which correspond to the hash values for the first collection of information, to a first bit value;
      
      program code for setting bits in the second array of bits, which correspond to the hash values for the second collection of information, to the first bit value; and
      
      the program code for comparing the first probabilistic data structure to the probabilistic data structure comprises;
      
      program code for determining a first number of bits in the first array of bits that are set to a second bit value different from the first bit value;
      
      program code for determining a second number of bits that are set to the second bit value in both the first and second army of bits;
      
      program code for determining a third number of bits in the second array of bits that are set to the second bit value;
      
      program code for computing a first logarithm of a ratio of the first number of bits and the second number of bits;
      
      program code for computing a second logarithm of a ratio of the predetermined bit length and the third number of bits; and
      
      program code for computing a ratio of the first computed logarithm and the second computed logarithm as a measure of difference between the first collection of information and the second collection of information.
  - 19. The computer readable storage device of claim 18, wherein the measure of difference provides an estimate of a percentage of the second collection of information that does not appear in the first collection of information.
  - 20. The computer readable storage device of claim 18, wherein the program code for determining the measure of similarity comprises:
    - program code for subtracting the ratio of the first computed logarithm and the second computed logarithm from a value of one to provide the measure of similarity between the first collection of information and the second collection of information.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Hewlett Packard Enterprise Development LP (Hewlett-Packard Enterprise Company)
Original Assignee
Hewlett-Packard Development Company, L.P. (HP Inc.)
Inventors
Kirshenbaum, Evan R.
Primary Examiner(s)
Mahmoudi; Tony
Assistant Examiner(s)
Uddin; Md. I

Application Number

US11/522,656
Time in Patent Office

1,310 Days
Field of Search

707/3, 707/E17.042, 707/4, 707/5, 707/6, 707/100, 707/102
US Class Current

707/758
CPC Class Codes

G06F 16/152 using file content signatur...

Estimating similarity between two collections of information

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Estimating similarity between two collections of information

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links