Data partitioning via bucketing bloom filters

US 7,743,013 B2
Filed: 06/11/2007
Issued: 06/22/2010
Est. Priority Date: 06/11/2007
Status: Expired due to Fees

First Claim

Patent Images

1. One or more computer storage media having stored thereon instructions that, when executed by one or more processors of a computing device, cause the one or more processors to:

identify a first data set of elements and a second data set of elements, wherein the first data set and the second data set are disjoint;

assign each element in the first data set to a bucket of a first set of buckets;

assign each element in the second data set to a bucket of a second set of buckets; and

generate, for each bucket of the first set of buckets, a Bloom filter that indicates that each element assigned to the bucket of the first set of buckets is part of the first data set, and that indicates that each element assigned to a corresponding bucket of the second set of buckets is not part of the first data set,wherein to assign each element in the first data set to the bucket of the first set of buckets is to, for each element in the first data set;

generate a hash value for the element by applying a hash function to the element; and

assign the element to one bucket of the first set of multiple buckets that corresponds to the hash value for the element; and

to assign each element in the second data set to the bucket of the second set of buckets is to, for each element in the second data set;

generate a hash value for the element by applying the hash function to the element; and

assign the element to one bucket of the second set of multiple buckets that corresponds to the hash value for the element.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Multiple Bloom filters are generated to partition data between first and second disjoint data sets of elements. Each element in the first data set is assigned to a bucket of a first set of buckets, and each element in the second data set is assigned to a bucket of a second set of buckets. A Bloom filter is generated for each bucket of the first set of buckets. The Bloom filter generated for a bucket indicates that each element assigned to that bucket is part of the first data set, and that each element assigned to a corresponding bucket of the second set of buckets is not part of the first data set. Additionally, a Bloom filter corresponding to a subsequently received element can be determined and used to identify whether that subsequently received element is part of the first data set or the second data set.

37 Citations

View as Search Results

17 Claims

1. One or more computer storage media having stored thereon instructions that, when executed by one or more processors of a computing device, cause the one or more processors to:
- identify a first data set of elements and a second data set of elements, wherein the first data set and the second data set are disjoint;
  
  assign each element in the first data set to a bucket of a first set of buckets;
  
  assign each element in the second data set to a bucket of a second set of buckets; and
  
  generate, for each bucket of the first set of buckets, a Bloom filter that indicates that each element assigned to the bucket of the first set of buckets is part of the first data set, and that indicates that each element assigned to a corresponding bucket of the second set of buckets is not part of the first data set,wherein to assign each element in the first data set to the bucket of the first set of buckets is to, for each element in the first data set;
  
  generate a hash value for the element by applying a hash function to the element; and
  
  assign the element to one bucket of the first set of multiple buckets that corresponds to the hash value for the element; and
  
  to assign each element in the second data set to the bucket of the second set of buckets is to, for each element in the second data set;
  
  generate a hash value for the element by applying the hash function to the element; and
  
  assign the element to one bucket of the second set of multiple buckets that corresponds to the hash value for the element.
- View Dependent Claims (2, 3, 6, 7, 8, 9)
- - 2. One or more computer storage media as recited in claim 1, each bucket in the first set of multiple buckets having one corresponding bucket in the second set of multiple buckets, with corresponding buckets being identified by a same index value.
  - 3. One or more computer storage media as recited in claim 1, wherein to assign each element in the first data set to the bucket of the first set of buckets is further to, for each element in the first data set, store a portion of the hash value for the element in a data structure representing the one bucket.
  - 6. One or more computer storage media as recited in claim 1, wherein to assign each element in the first data set to the bucket of the first set of buckets is to, for each element in the first data set, store the hash value generated by applying the hash function to the element in a data structure representing the bucket to which the element is assigned.
  - 7. One or more computer storage media as recited in claim 1, wherein to assign each element in the first data set to the bucket of the first set of buckets is to, for each element in the first data set, store the element in a data structure representing the bucket to which the element is assigned.
  - 8. One or more computer storage media as recited in claim 1, wherein the instructions further cause the one or more processors to partition data among three or more data sets by repeating the identification, both of the assignments, and the generation recursively with additional data subsets of one or both of the first data set and the second data set.
  - 9. One or more computer storage media as recited in claim 1, wherein the instructions further cause the one or more processors to:
    - obtain data that is an element of either the first data set of elements or the second data set of elements;
      
      determine a single Bloom filter that corresponds to the element, the single Bloom filter being one of multiple Bloom filters associated with the first data set, each of the multiple Bloom filters corresponding to different elements of the first data set; and
      
      use the single Bloom filter to determine whether the element is part of the first data set of elements or the second data set of elements.

4. One or more computer storage media having stored thereon instructions that, when executed by one or more processors of a computing device, cause the one or more processors to:
- identify a first data set of elements and a second data set of elements, wherein the first data set and the second data set are disjoint;
  
  assign each element in the first data set to a bucket of a first set of buckets;
  
  assign each element in the second data set to a bucket of a second set of buckets; and
  
  generate, for each bucket of the first set of buckets, a Bloom filter that indicates that each element assigned to the bucket of the first set of buckets is part of the first data set, and that indicates that each element assigned to a corresponding bucket of the second set of buckets is not part of the first data set,wherein to generate the Bloom filter for a first bucket of the first set of buckets is to;
  
  generate a temporary Bloom filter for the first bucket;
  
  for each of one or more elements in a second bucket of the second set of buckets, the first bucket corresponding to the second bucket, check whether the temporary Bloom filter returns a positive result for the element, the positive result indicating that the element was memorized by the temporary Bloom filter;
  
  if the temporary Bloom filter returns the positive result for none of the elements of the second bucket, then use the temporary Bloom filter as the Bloom filter for the first bucket; and
  
  if the temporary Bloom filter returns the positive result for at least one of the elements of the second bucket, then repeatedly generate new temporary Bloom filters until a temporary Bloom filter that returns the positive result for none of the elements of the second bucket is generated.
- View Dependent Claims (5)
- - 5. One or more computer storage media as recited in claim 4, wherein to generate the new temporary Bloom filters is to use a new seed value for a set of hash functions used to create the new temporary Bloom filters.

10. A method of creating a set of Bloom filters to partition data into two disjoint data sets, the method comprising:
- assigning elements of each of the two disjoint data sets to a bucket of a corresponding set of buckets; and
  
  generating the set of Bloom filters, each Bloom filter corresponding to one bucket corresponding to a first data set of the two disjoint data sets, each Bloom filter indicating that each element in the first data set is part of the first data set, and further indicating that each element in a second data set of the two data sets is not part of the first data set,wherein assigning elements of each of the two disjoint data sets to a bucket of a corresponding set of buckets comprises;
  
  for each element in the first data set;
  
  generating a hash value for the element by applying a hash function to the element; and
  
  assigning the element to one bucket of a first set of buckets corresponding to the first data set, the one bucket corresponding to the hash value for the element; and
  
  for each element in the second data set;
  
  generating a hash value for the element by applying the hash function to the element; and
  
  assigning the element to one bucket of a second set of buckets corresponding to the second data set, the one bucket corresponding to the hash value for the element.
- View Dependent Claims (11, 12, 13)
- - 11. A method as recited in claim 10, further comprising for each element in the first data set, storing a portion of the hash value for the element in a data structure representing the one bucket of the first set of buckets.
  - 12. A method as recited in claim 10, wherein assigning elements of the first data set to a bucket of a corresponding set of buckets further comprises, for each element in the first data set, storing the hash value generated by applying the hash function to the element in a data structure representing the bucket to which the element is assigned.
  - 13. A method as recited in claim 10, wherein assigning elements of the first data set to a bucket of a corresponding set of buckets further comprises, for each element in the first data set, storing the element in a data structure representing the bucket to which the element is assigned.

14. A method of creating a set of Bloom filters to partition data into two disjoint data sets, the method comprising:
- assigning elements of each of the two disjoint data sets to a bucket of a corresponding set of buckets; and
  
  generating the set of Bloom filters, each Bloom filter corresponding to one bucket corresponding to a first data set of the two disjoint data sets, each Bloom filter indicating that each element in the first data set is part of the first data set, and further indicating that each element in a second data set of the two data sets is not part of the first data set,the generating comprising, for a first bucket in the set of buckets corresponding to the first data set;
  
  generating a temporary Bloom filter for the first bucket;
  
  for each of one or more elements in a second bucket of the set of buckets corresponding to the second data set, the first bucket corresponding to the second bucket, checking whether the temporary Bloom filter returns a positive result for the element, the positive result indicating that the element was memorized by the temporary Bloom filter;
  
  if the temporary Bloom filter returns the positive result for none of the elements of the second bucket, then using the temporary Bloom filter as the Bloom filter for the first bucket; and
  
  if the temporary Bloom filter returns the positive result for at least one of the elements of the second bucket, then repeatedly generating new temporary Bloom filters until a temporary Bloom filter that returns the positive result for none of the elements of the second bucket is generated.
- View Dependent Claims (15)
- - 15. A method as recited in claim 14, wherein generating the new temporary Bloom filters comprises using a new seed value for a set of hash functions used to create the new temporary Bloom filters.

16. One or more computer storage media having stored thereon instructions that, when executed by one or more processors of a computing device, cause the one or more processors to:
- obtain data that is an element of either a first data set of elements or a second data set of elements, wherein the first data set and the second data set are disjoint;
  
  determine a single Bloom filter that corresponds to the element, the single Bloom filter being one of multiple Bloom filters associated with the first data set, each of the multiple Bloom filters corresponding to different elements of the first data set; and
  
  use the single Bloom filter to determine whether the element is part of the first data set of elements or the second data set of elements,wherein to determine the single Bloom filter that corresponds to the element is to;
  
  generate a hash value for the element using a hash function; and
  
  identify as the single Bloom filter a Bloom filter of the multiple Bloom filters that has an index value that is the same as the hash value.
- View Dependent Claims (17)
- - 17. One or more computer storage media as recited in claim 16, wherein the hash function is a same hash function as was previously used to assign elements of the first data set to a first set of buckets associated with the multiple Bloom filters during creation of the multiple Bloom filters.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Charles, Denis, Mityagin, Anton, Chellapilla, Kumar
Primary Examiner(s)
To; Baoquoc N

Application Number

US11/811,619
Publication Number

US 20080307189A1
Time in Patent Office

1,107 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G06F 16/93 Document management systems

Data partitioning via bucketing bloom filters

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

37 Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

Data partitioning via bucketing bloom filters

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

37 Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links