Data partitioning via bucketing bloom filters
First Claim
1. One or more computer storage media having stored thereon instructions that, when executed by one or more processors of a computing device, cause the one or more processors to:
- identify a first data set of elements and a second data set of elements, wherein the first data set and the second data set are disjoint;
assign each element in the first data set to a bucket of a first set of buckets;
assign each element in the second data set to a bucket of a second set of buckets; and
generate, for each bucket of the first set of buckets, a Bloom filter that indicates that each element assigned to the bucket of the first set of buckets is part of the first data set, and that indicates that each element assigned to a corresponding bucket of the second set of buckets is not part of the first data set,wherein to assign each element in the first data set to the bucket of the first set of buckets is to, for each element in the first data set;
generate a hash value for the element by applying a hash function to the element; and
assign the element to one bucket of the first set of multiple buckets that corresponds to the hash value for the element; and
to assign each element in the second data set to the bucket of the second set of buckets is to, for each element in the second data set;
generate a hash value for the element by applying the hash function to the element; and
assign the element to one bucket of the second set of multiple buckets that corresponds to the hash value for the element.
2 Assignments
0 Petitions
Accused Products
Abstract
Multiple Bloom filters are generated to partition data between first and second disjoint data sets of elements. Each element in the first data set is assigned to a bucket of a first set of buckets, and each element in the second data set is assigned to a bucket of a second set of buckets. A Bloom filter is generated for each bucket of the first set of buckets. The Bloom filter generated for a bucket indicates that each element assigned to that bucket is part of the first data set, and that each element assigned to a corresponding bucket of the second set of buckets is not part of the first data set. Additionally, a Bloom filter corresponding to a subsequently received element can be determined and used to identify whether that subsequently received element is part of the first data set or the second data set.
37 Citations
17 Claims
-
1. One or more computer storage media having stored thereon instructions that, when executed by one or more processors of a computing device, cause the one or more processors to:
-
identify a first data set of elements and a second data set of elements, wherein the first data set and the second data set are disjoint; assign each element in the first data set to a bucket of a first set of buckets; assign each element in the second data set to a bucket of a second set of buckets; and generate, for each bucket of the first set of buckets, a Bloom filter that indicates that each element assigned to the bucket of the first set of buckets is part of the first data set, and that indicates that each element assigned to a corresponding bucket of the second set of buckets is not part of the first data set, wherein to assign each element in the first data set to the bucket of the first set of buckets is to, for each element in the first data set; generate a hash value for the element by applying a hash function to the element; and assign the element to one bucket of the first set of multiple buckets that corresponds to the hash value for the element; and to assign each element in the second data set to the bucket of the second set of buckets is to, for each element in the second data set; generate a hash value for the element by applying the hash function to the element; and assign the element to one bucket of the second set of multiple buckets that corresponds to the hash value for the element. - View Dependent Claims (2, 3, 6, 7, 8, 9)
-
-
4. One or more computer storage media having stored thereon instructions that, when executed by one or more processors of a computing device, cause the one or more processors to:
-
identify a first data set of elements and a second data set of elements, wherein the first data set and the second data set are disjoint; assign each element in the first data set to a bucket of a first set of buckets; assign each element in the second data set to a bucket of a second set of buckets; and generate, for each bucket of the first set of buckets, a Bloom filter that indicates that each element assigned to the bucket of the first set of buckets is part of the first data set, and that indicates that each element assigned to a corresponding bucket of the second set of buckets is not part of the first data set, wherein to generate the Bloom filter for a first bucket of the first set of buckets is to; generate a temporary Bloom filter for the first bucket; for each of one or more elements in a second bucket of the second set of buckets, the first bucket corresponding to the second bucket, check whether the temporary Bloom filter returns a positive result for the element, the positive result indicating that the element was memorized by the temporary Bloom filter; if the temporary Bloom filter returns the positive result for none of the elements of the second bucket, then use the temporary Bloom filter as the Bloom filter for the first bucket; and if the temporary Bloom filter returns the positive result for at least one of the elements of the second bucket, then repeatedly generate new temporary Bloom filters until a temporary Bloom filter that returns the positive result for none of the elements of the second bucket is generated. - View Dependent Claims (5)
-
-
10. A method of creating a set of Bloom filters to partition data into two disjoint data sets, the method comprising:
-
assigning elements of each of the two disjoint data sets to a bucket of a corresponding set of buckets; and generating the set of Bloom filters, each Bloom filter corresponding to one bucket corresponding to a first data set of the two disjoint data sets, each Bloom filter indicating that each element in the first data set is part of the first data set, and further indicating that each element in a second data set of the two data sets is not part of the first data set, wherein assigning elements of each of the two disjoint data sets to a bucket of a corresponding set of buckets comprises; for each element in the first data set; generating a hash value for the element by applying a hash function to the element; and assigning the element to one bucket of a first set of buckets corresponding to the first data set, the one bucket corresponding to the hash value for the element; and for each element in the second data set; generating a hash value for the element by applying the hash function to the element; and assigning the element to one bucket of a second set of buckets corresponding to the second data set, the one bucket corresponding to the hash value for the element. - View Dependent Claims (11, 12, 13)
-
-
14. A method of creating a set of Bloom filters to partition data into two disjoint data sets, the method comprising:
-
assigning elements of each of the two disjoint data sets to a bucket of a corresponding set of buckets; and generating the set of Bloom filters, each Bloom filter corresponding to one bucket corresponding to a first data set of the two disjoint data sets, each Bloom filter indicating that each element in the first data set is part of the first data set, and further indicating that each element in a second data set of the two data sets is not part of the first data set, the generating comprising, for a first bucket in the set of buckets corresponding to the first data set; generating a temporary Bloom filter for the first bucket; for each of one or more elements in a second bucket of the set of buckets corresponding to the second data set, the first bucket corresponding to the second bucket, checking whether the temporary Bloom filter returns a positive result for the element, the positive result indicating that the element was memorized by the temporary Bloom filter; if the temporary Bloom filter returns the positive result for none of the elements of the second bucket, then using the temporary Bloom filter as the Bloom filter for the first bucket; and if the temporary Bloom filter returns the positive result for at least one of the elements of the second bucket, then repeatedly generating new temporary Bloom filters until a temporary Bloom filter that returns the positive result for none of the elements of the second bucket is generated. - View Dependent Claims (15)
-
-
16. One or more computer storage media having stored thereon instructions that, when executed by one or more processors of a computing device, cause the one or more processors to:
-
obtain data that is an element of either a first data set of elements or a second data set of elements, wherein the first data set and the second data set are disjoint; determine a single Bloom filter that corresponds to the element, the single Bloom filter being one of multiple Bloom filters associated with the first data set, each of the multiple Bloom filters corresponding to different elements of the first data set; and use the single Bloom filter to determine whether the element is part of the first data set of elements or the second data set of elements, wherein to determine the single Bloom filter that corresponds to the element is to; generate a hash value for the element using a hash function; and identify as the single Bloom filter a Bloom filter of the multiple Bloom filters that has an index value that is the same as the hash value. - View Dependent Claims (17)
-
Specification