Estimating similarity between two collections of information
First Claim
Patent Images
1. A method for estimating similarity between two collections of information, comprising:
- receiving a first collection of information and a second collection of information;
hashing data chunks of the first and second collections using a set of hash functions;
deriving k m-bit hash values from hash values determined from the hashing of the first collection of information, where k>
1 and m>
1;
determining an index for each of the k m-bit hash values;
using a computer processor and the indices for the k m-bit hash values to compare a first probabilistic data structure representing a first collection of information and a second probabilistic data structure representing a second collection of information;
using a computer processor to determine a measure of similarity between the first probabilistic data structure and the second probabilistic data structure based on the comparing; and
estimating similarity between the two collections of information from the determined measure of similarity for one of efficient data comparison and efficient data management of the two collections of information.
2 Assignments
0 Petitions
Accused Products
Abstract
A method for estimating similarity between two collections of information is described herein. The method includes comparing a first Bloom filter representing a first collection of information and a second Bloom filter representing a second collection of information, and determining a measure of similarity between the first collection of information and the second collection of information based on the comparing.
-
Citations
20 Claims
-
1. A method for estimating similarity between two collections of information, comprising:
-
receiving a first collection of information and a second collection of information; hashing data chunks of the first and second collections using a set of hash functions; deriving k m-bit hash values from hash values determined from the hashing of the first collection of information, where k>
1 and m>
1;determining an index for each of the k m-bit hash values; using a computer processor and the indices for the k m-bit hash values to compare a first probabilistic data structure representing a first collection of information and a second probabilistic data structure representing a second collection of information; using a computer processor to determine a measure of similarity between the first probabilistic data structure and the second probabilistic data structure based on the comparing; and estimating similarity between the two collections of information from the determined measure of similarity for one of efficient data comparison and efficient data management of the two collections of information. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 10, 11, 14, 15)
-
-
9. A computerized method for efficiently providing an information service to a customer, comprising:
-
constructing a first plurality of chunks of information based on a first collection of information; and constructing a second plurality of chunks of information based on a second collection of information for comparison with the first collection of information; hashing the plurality of chunks of information of the first and second collections using a set of hash functions; deriving k m-bit hash values from hash values determined from the hashing of the first collection of information, where k>
1 and m>
1;determining an index for each of the k m-bit hash values; constructing a first probabilistic data structure based on the first plurality of chunks of information; constructing a second probabilistic data structure based on the second plurality of chunks of information; and comparing in a computerized system, using the indices for the k m-bit hash values, the first probabilistic data structure and the second probabilistic data structure; and determining in the computerized system a measure of similarity between the first probabilistic data structure and the second probabilistic data structure based on the comparing; estimating similarity between the first collection of information and the second collection of information based on the determined measure of similarity as part of efficiently providing the information service to the customer. - View Dependent Claims (12, 13)
-
-
16. A computer readable storage device on which is stored program code for measuring similarity between two collections of information, the encoded program code comprising:
-
program code for receiving a first collection of information and second collection of information; program code for hashing data chunks of the first and second collections using a set of hash functions; program code for deriving k m-bit hash values from hash values determined from the hashing of the first collection of information, where k>
1 and m>
1;program code for determining an index for each of the k m-bit hash values; program code for comparing, using the indices for the k m-bit hash values, a first probabilistic data structure representing a first collection of information and a second probabilistic data structure representing a second collection of information; and program code for determining a measure of similarity between the first collection of information and the second collection of information based on the program code for comparing. - View Dependent Claims (17, 18, 19, 20)
-
Specification