System and method for identifying matching portions of two sets of data in a multiprocessor system

US 10,642,808 B1
Filed: 11/01/2016
Issued: 05/05/2020
Est. Priority Date: 11/01/2015
Status: Active Grant

First Claim

Patent Images

1. A method of joining a first database data set and a second database data set, the method comprising:

(A) identifying a size of a storage space to be used for joining the first database data set and the second database data set;

(B) identifying a number of a plurality of processor cores to be used for joining the first database data set and the second database data set;

(C) hashing each of a plurality of data elements of the first database data set to produce a first hash result for each of the plurality of data elements, each first hash result comprising a first portion and a second portion, the first and second portions each comprising less than all of the first hash result and not entirely overlapping with each other;

(D) assigning each of the plurality of data elements of the first database data set to one of a plurality of buffers, responsive to the first portion of the first hash result for each of the respective data elements in the plurality;

(E) identifying a number of a plurality of sub buffers responsive to the size of the storage space identified, the number of processor cores identified, and a size to be used substantially as a size for each of the plurality of sub buffers, each sub buffer corresponding to a range of potential first hash results, a plurality of the sub buffers corresponding to each buffer;

(F) by each of the plurality of processor cores, substantially simultaneously with the other processor cores;

(1) selecting a buffer in the plurality not already selected by any of the plurality of processor cores;

(2) assigning each of the plurality of data elements assigned to the selected buffer, to one of the sub buffers in the plurality, responsive to the second portion of the first hash result of each said data element and the range of potential first hash results of said one of the sub buffers;

(3) generating a hash table for each data element assigned to each sub buffer comprising a first alternate hash result for each data element that is generated using, and different from, the first hash result for the data element;

(4) storing in storage other than random access memory each sub buffer corresponding to the selected buffer and the hash table of said sub buffer; and

(5) repeating steps (1)-(4) until all buffers in the plurality have been selected;

(G) receiving a portion, less than all, of a plurality of data elements of the second database data set into a plurality of chunks of memory;

(H) by each of the plurality of processor cores, substantially simultaneously with the other processor cores;

(1) selecting one of the plurality of chunks not already selected by any of the plurality of processor cores; and

(2) for each of the plurality of data elements in the selected chunk;

a. hashing said data element in the selected chunk to produce a second hash result for said data element;

b. assigning the data element in the selected chunk to one of a plurality of sub partitions, each of the sub partitions in the plurality being assigned a range of potential second hash results equal to a range of a different one of the sub buffers, said assigning being responsive to the range of potential second hash results of said sub partition and the second hash result of said data element in the second chunk; and

(3) repeating steps (1) and (2) until all of the chunks have been processed;

I. by each of the plurality of processor cores, substantially simultaneously with the other processor cores;

(1) selecting one of the plurality of sub partitions not already selected by any of the plurality of processor cores;

(2) reading the hash table and data elements of the first database data set of any sub buffer having a range of potential first hash results corresponding to the range of potential second hash results of the selected sub partition;

(3) for each of the plurality of data elements in the selected sub partition;

(a) identifying whether a second alternate hash result, generated using, and different from, the second hash result of said data element corresponds to the first alternate hash result; and

(b) if the second alternate hash result corresponds to the first alternate hash result, comparing said data element in the selected sub partition with the data element in the sub buffer read that corresponds to the corresponding first alternate hash result, and if the comparing results in a match, identifying as matched with said data element in the selected sub partition the data element in the sub buffer read that corresponds to said data element in the selected sub partition; and

(4) repeating steps (1)-(3) until all of the sub partitions have been selected; and

(J) Repeating steps G-I until all of the plurality of data elements of the second database data set have been processed as in steps G-I.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method matches data from a first set of data with that of an other set of data.

10 Citations

View as Search Results

18 Claims

1. A method of joining a first database data set and a second database data set, the method comprising:
- (A) identifying a size of a storage space to be used for joining the first database data set and the second database data set;
  
  (B) identifying a number of a plurality of processor cores to be used for joining the first database data set and the second database data set;
  
  (C) hashing each of a plurality of data elements of the first database data set to produce a first hash result for each of the plurality of data elements, each first hash result comprising a first portion and a second portion, the first and second portions each comprising less than all of the first hash result and not entirely overlapping with each other;
  
  (D) assigning each of the plurality of data elements of the first database data set to one of a plurality of buffers, responsive to the first portion of the first hash result for each of the respective data elements in the plurality;
  
  (E) identifying a number of a plurality of sub buffers responsive to the size of the storage space identified, the number of processor cores identified, and a size to be used substantially as a size for each of the plurality of sub buffers, each sub buffer corresponding to a range of potential first hash results, a plurality of the sub buffers corresponding to each buffer;
  
  (F) by each of the plurality of processor cores, substantially simultaneously with the other processor cores;
  
  (1) selecting a buffer in the plurality not already selected by any of the plurality of processor cores;
  
  (2) assigning each of the plurality of data elements assigned to the selected buffer, to one of the sub buffers in the plurality, responsive to the second portion of the first hash result of each said data element and the range of potential first hash results of said one of the sub buffers;
  
  (3) generating a hash table for each data element assigned to each sub buffer comprising a first alternate hash result for each data element that is generated using, and different from, the first hash result for the data element;
  
  (4) storing in storage other than random access memory each sub buffer corresponding to the selected buffer and the hash table of said sub buffer; and
  
  (5) repeating steps (1)-(4) until all buffers in the plurality have been selected;
  
  (G) receiving a portion, less than all, of a plurality of data elements of the second database data set into a plurality of chunks of memory;
  
  (H) by each of the plurality of processor cores, substantially simultaneously with the other processor cores;
  
  (1) selecting one of the plurality of chunks not already selected by any of the plurality of processor cores; and
  
  (2) for each of the plurality of data elements in the selected chunk;
  
  a. hashing said data element in the selected chunk to produce a second hash result for said data element;
  
  b. assigning the data element in the selected chunk to one of a plurality of sub partitions, each of the sub partitions in the plurality being assigned a range of potential second hash results equal to a range of a different one of the sub buffers, said assigning being responsive to the range of potential second hash results of said sub partition and the second hash result of said data element in the second chunk; and
  
  (3) repeating steps (1) and (2) until all of the chunks have been processed;
  
  I. by each of the plurality of processor cores, substantially simultaneously with the other processor cores;
  
  (1) selecting one of the plurality of sub partitions not already selected by any of the plurality of processor cores;
  
  (2) reading the hash table and data elements of the first database data set of any sub buffer having a range of potential first hash results corresponding to the range of potential second hash results of the selected sub partition;
  
  (3) for each of the plurality of data elements in the selected sub partition;
  
  (a) identifying whether a second alternate hash result, generated using, and different from, the second hash result of said data element corresponds to the first alternate hash result; and
  
  (b) if the second alternate hash result corresponds to the first alternate hash result, comparing said data element in the selected sub partition with the data element in the sub buffer read that corresponds to the corresponding first alternate hash result, and if the comparing results in a match, identifying as matched with said data element in the selected sub partition the data element in the sub buffer read that corresponds to said data element in the selected sub partition; and
  
  (4) repeating steps (1)-(3) until all of the sub partitions have been selected; and
  
  (J) Repeating steps G-I until all of the plurality of data elements of the second database data set have been processed as in steps G-I.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1, additionally comprising:
    - estimating sizes of each of said two database data sets; and
      
      assigning as the first database data set one of said two database data sets with a smaller estimated size.
  - 3. The method of claim 1, wherein steps A-F are performed prior to specification of the second database data set.
  - 4. The method of claim 1 wherein a second plurality of the sub buffers are each assigned to a same entire range of potential first hash results.
  - 5. The method of claim 4 additionally comprising assigning a label to each of the second plurality of sub buffers, the label responsive to at least a part of the range of potential first hash results corresponding to each said sub buffer.
  - 6. The method of claim 1, wherein the first alternate hash result comprises a different ordering of the first hash result.

7. A system for joining a first database data set and a second database data set, the system comprising:
- (A) a partition set up manager for identifying via an input/output a size of a storage space to be used for joining the first database data set and the second database data set, and for identifying via the partition setup manager input/output a number of a plurality of processor cores to be used for joining the first database data set and the second database data set, and for providing at an output the size of the storage space and the number of the plurality of processor cores;
  
  (B) a partition assignment manager having an input for receiving a plurality of data elements of the first database data set, the partition assignment manager for hashing each of the plurality of data elements of the first database data set to produce a first hash result for each of the plurality of data elements, each first hash result comprising a first portion and a second portion, the first and second portions each comprising less than all of the first hash result and not entirely overlapping with each other, and for assigning via an output each of the plurality of data elements and the hash result produced therefrom, to one of a plurality of buffers, responsive to the first portion of the first first hash result for each of the respective data elements in the plurality;
  
  (C) a sub partition setup manager having an input coupled to the partition setup manager output for receiving the size of the storage space and the number of the plurality of processors, the sub partition setup manager for identifying at an output a number of a first plurality of sub buffers responsive to the size of the storage space received, the number of processor cores received, and a size to be used substantially as a size for each of the plurality of sub buffers, each sub buffer corresponding to a range of potential first hash results, a second plurality of sub buffers corresponding to each buffer in the plurality;
  
  (D) in each of the plurality of processor cores, operating substantially simultaneously with the other processor cores;
  
  (1) a sub partition assignment manager having an input coupled to the sub partition setup manager output for receiving the identification of the number of the plurality of sub buffers, and to the partition assignment manager output for receiving the second portion of the first hash result of each of the data elements of the first database data set, the sub partition assignment manager for selecting via an input/output a buffer in the plurality not already selected by any of the plurality of processor cores and for assigning via an output each of the plurality of data elements assigned to the selected buffer, to one of the sub buffers in the plurality, responsive to the second portion of the first hash result of each said data element and the range of potential first hash results of the one of the sub buffers, and for storing via the sub partition assignment manager output in storage other than random access memory each sub buffer corresponding to the selected buffer and;
  
  (2) a hash table manager having an input coupled to the partition assignment manager output for receiving at least a portion of the first hash result and to the sub partition assignment manager output for receiving the assignment of the plurality of data elements assigned to the sub buffers in the plurality, the hash table manager for generating a hash table for each data element assigned to each sub buffer comprising a first alternate hash result for each data element that is generated using, and different from, the first hash result for said data element, and for storing in the storage other than random access memory via an output the hash table of said sub buffer, associated with said sub buffer; and
  
  wherein operation of the sub partition assignment manager and the hash table manager is repeated until all buffers in the plurality have been selected and hash tables generated for all data elements;
  
  (E) an ODS setup manager having an input for receiving a portion, less than all, of a plurality of data elements of the second database data set and for storing such portion into a plurality of chunks of memory via an output;
  
  (F) at each of the plurality of processor cores, an ODS assignment manager having an input coupled to the ODS setup manager output for receiving at least the data elements of the second database data set in at least some of the plurality of chunks of memory, the ODS assignment manager for selecting one of the plurality of chunks not already selected by any of the plurality of processor cores, and for each of the plurality of data elements in the selected chunk;
  
  hashing the data element in the selected chunk to produce a second hash result for said data element, and providing an assignment via an output the data element in the selected chunk to one of a plurality of sub partitions, each of the sub partitions in the plurality being assigned a range of potential second hash results equal to a range of a different one of the sub buffers, said assignments being responsive to the range of potential second hash results of the sub partition and the second hash result of said data element in the selected chunk, the ODS assignment manager in one of the plurality of processor cores operating substantially simultaneously with the ODS assignment manager in each of at least one other of the processor cores in the plurality;
  
  (G) at each of the plurality of processor cores, substantially simultaneously with the other processor cores;
  
  an ODS match manager having an input coupled to the ODS setup manager output for receiving some of the plurality of data elements of the second database data set, to the output of at least some of the ODS assignment managers for receiving at least some of the assignments, to the storage other than random access memory for receiving the hash table and data elements of the first database data set of a plurality of the sub buffers, the ODS match manager for(1) selecting via an input/output one of the plurality of sub partitions not already selected by any of the plurality of processor cores;
  
  (2) reading the hash table and data elements of the first database data set assigned to any sub buffer having a range of potential first hash results corresponding to the range of potential second hash results of the selected sub partition; and
  
  (3) for each of the plurality of data elements in the selected sub partition;
  
  (a) identifying whether a second alternate hash result, generated using, and different from, the second hash result of said data element corresponds to the first alternate hash result; and
  
  (b) if the second alternate hash result corresponds to the first alternate hash result, comparing said data element in the selected sub partition with the data element in the sub buffer read that corresponds to the corresponding first alternate hash result, and if the comparing results in a match, identifying at an output as matched with said data element in the selected sub partition the data element in the sub buffer read that corresponds to said data element in the selected sub partition; and
  
  (4) repeating operation of (1)-(3) until all sub partitions have been so processed; and
  
  (H) wherein operation of elements (E)-(G) are repeated until all of the plurality of data elements of the second database data set have been processed by elements (E)-(G).
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The system of claim 7:
    - additionally comprising a request receiver having an input for receiving identifiers of the first database data set and the second database data set and information about the first database data set and the second database data set, the request receiver for estimating sizes of each of said two database data sets responsive to said information, assigning as the first database data set one of said two database data sets with a smaller estimated size and for providing at an output the identifiers of the first database data set and the second database data set; and
      
      wherein;
      
      the partition assignment manager input is additionally coupled to the request receiver output for receiving the identifier of the first database data set, and the partition assignment manager receives the plurality of data elements of the first database data set responsive to providing the identifier of the first database data set at the partition assignment manager output; and
      
      the ODS setup manager input is additionally coupled to the request receiver output for receiving the identifier of the second database data set, and the ODS setup manager receives the portions of the plurality of data elements of the second database data set responsive to providing the identifier of the second database data set at the ODS setup manager output.
  - 9. The system of claim 7, wherein operation of elements A-F are performed prior to receipt by any element of the system of a specification of the second database data set.
  - 10. The system of claim 7 wherein a plurality of the sub buffers are each assigned to a same entire range of potential first hash results.
  - 11. The system of claim 10, wherein the hash table manager is additionally for assigning via the hash table manager output a label to each of the plurality of sub buffers, the label responsive to at least a part of the range of potential first hash results corresponding to each said sub buffer.
  - 12. The system of claim 7, wherein each first alternate hash result comprises a different ordering of the first hash result.

13. A computer program product comprising a nontransitory computer useable medium having computer readable program code embodied therein for joining a first database data set and a second database data set, the computer program product comprising computer readable program code devices configured to cause a computer system to:
- A. identify a size of a storage space to be used for joining the first database data set and the second database data set;
  
  B. identify a number of a plurality of processor cores to be used for joining the first database data set and the second database data set;
  
  C. hash each of a plurality of data elements of the first database data set to produce a first hash result for each of the plurality of data elements, each first hash result comprising a first portion and a second portion, the first and second portions each comprising less than all of the first hash result and not entirely overlapping with each other;
  
  D. assign each of the plurality of data elements of the first database data set to one of a plurality of buffers, responsive to the first portion of the first hash result for each of the respective data elements in the plurality;
  
  E. identify a number of a plurality of sub buffers responsive to the size of the storage space identified, the number of processor cores identified, and a size to be used substantially as a size for each of the plurality of sub buffers, each sub buffer corresponding to a range of potential first hash results, a plurality of sub buffers corresponding to each buffer;
  
  F. by each of the plurality of processor cores, substantially simultaneously with the other processor cores;
  
  (1) select a buffer in the plurality not already selected by any of the plurality of processor cores;
  
  (2) assign each of the plurality of data elements assigned to the selected buffer, to one of the sub buffers in the plurality, responsive to the second portion of the first hash result of each said data element and the range of potential first hash results of said one of the sub buffers;
  
  (3) generate a hash table for each data element assigned to each sub buffer comprising a first alternate hash result for each data element that is generated using, and different from, the first hash result for the data element;
  
  (4) store in storage other than random access memory each sub buffer corresponding to the selected buffer and the hash table of said sub buffer; and
  
  (5) repeat operation of computer readable program code devices (F)(1)-(F)(4) until all buffers in the plurality have been selected;
  
  G. receive a portion, less than all, of a plurality of data elements of the second database data set into a plurality of chunks of memory;
  
  H. by each of the plurality of processor cores, substantially simultaneously with the other processor cores;
  
  (1) select one of the plurality of chunks not already selected by any of the plurality of processor cores; and
  
  (2) for each of the plurality of data elements in the selected chunk;
  
  (a) hash said data element in the selected chunk to produce a second hash result for said data element;
  
  (b) assign the data element in the selected chunk to one of a plurality of sub partitions, each of the sub partitions in the plurality being assigned a range of potential second hash results equal to a range of a different one of the sub buffers, said assigning being responsive to the range of potential second hash results of said sub partition and the second hash result of said data element in the second chunk; and
  
  (3) repeat operation of computer readable program code devices (H)(1) and (H)(2) until all of the chunks have been processed;
  
  (I) by each of the plurality of processor cores, substantially simultaneously with the other processor cores;
  
  (1) select one of the plurality of sub partitions not already selected by any of the plurality of processor cores;
  
  (2) read the hash table and data elements of the first database data set of any sub buffer having a range of potential first hash results corresponding to the range of potential second hash results of the selected sub partition;
  
  (3) for each of the plurality of data elements in the selected sub partition;
  
  (a) identify whether a second alternate hash result, generated using, and different from, the second hash result of said data element corresponds to the first alternate hash result; and
  
  (b) if the second alternate hash result corresponds to the first alternate hash result, compare said data element in the selected sub partition with the data element in the sub buffer read that corresponds to the corresponding first alternate hash result, and if the compare results in a match, identify as matched with said data element in the selected sub partition the data element in the sub buffer read that corresponds to said data element in the selected sub partition; and
  
  (4) repeating operation of (I)(1)-(I)(3) until all of the sub partitions have been selected; and
  
  (J) repeat operation of computer readable program code devices (G)-(I) until all of the plurality of data elements of the second database data set have been processed as specified by (G)-(I).
- View Dependent Claims (14, 15, 16, 17, 18)
- - 14. The computer program product of claim 13, additionally comprising:
    - estimating sizes of each of said two database data sets; and
      
      assigning as the first database data set one of said two database data sets with a smaller estimated size.
  - 15. The computer program product of claim 13, wherein computer readable program code devices A-F are performed prior to specification of the second database data set.
  - 16. The computer program product of claim 13 wherein a second plurality of the sub buffers are each assigned to a same entire range of potential first hash results.
  - 17. The computer program product of claim 16 additionally comprising computer readable program code devices configured to cause the computer system to assign a label to each of the second plurality of sub buffers, the label responsive to at least a part of the range of potential first hash results corresponding to each said sub buffer.
  - 18. The computer program product of claim 13, wherein the first alternate hash result comprises a different ordering of the first hash result.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Yellowbrick Data Co.
Original Assignee
Yellowbrick Data, Inc.
Inventors
Kejser, Thomas, Gotlieb, Charles E.
Primary Examiner(s)
Trujillo, James
Assistant Examiner(s)
Mueller, Kurt A

Application Number

US15/340,949
Time in Patent Office

1,281 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/215   Improving data quality; Dat...

G06F 16/2255   Hash tables

G06F 16/2282   Tablespace storage structur...

G06F 16/24544   Join order optimisation

G06F 16/24552   Database cache management

G06F 16/2456   Join operations

G06F 16/278   Data partitioning, e.g. hor...

System and method for identifying matching portions of two sets of data in a multiprocessor system

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

10 Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for identifying matching portions of two sets of data in a multiprocessor system

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

10 Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links