Method and system for performing spatial similarity joins on high-dimensional points

US 5,978,794 A
Filed: 04/09/1996
Issued: 11/02/1999
Est. Priority Date: 04/09/1996
Status: Expired due to Term

First Claim

Patent Images

1. A method for performing similarity joins on high-dimensional points representing data objects of a database, the method comprising the steps of:

generating a multi-dimensional data structure having a plurality of leaf nodes for organizing the points, each leaf node being split into .left brkt-bot.1/epsilon.right brkt-top. child nodes, where epsilon is a similar distance, based on the depth of the leaf node whenever the number of points associated with the leaf node exceeds a predetermined value, the dimensions used for splitting the nodes being in an order of correlation among the dimensions such that one selected for splitting next has the least correlation with previously used dimensions;

traversing the interior nodes of the data structure to select pairs of leaf nodes from which the points are joined; and

joining the points from the selected pairs of leaf nodes based on a joining condition that the distance between any two points to be joined is at most epsilon.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system are disclosed for performing spatial similarity joins on high-dimensional points that represent data objects of a database. The method comprises the steps of: generating a data structure based on the similarity distance ε for organizing the high-dimensional points, traversing the data structure to select pairs of leaf nodes from which the high-dimensional points are joined, and joining the points from selected pairs of nodes according to a joining condition based on the similarity distance ε. An efficient data structure referred to as an ε-K-D-B tree is disclosed to provide fast access to the high-dimensional points and to minimize system storage requirements. The invention provides algorithms for generating the ε-K-D-B tree using biased splitting to minimize the number of nodes to be examined during join operations. The traversing step includes joining selected pairs of nodes and also self-joining selected nodes. Alternatively, the data structure is an R+tree generated using biased splitting.

Citations

23 Claims

1. A method for performing similarity joins on high-dimensional points representing data objects of a database, the method comprising the steps of:
- generating a multi-dimensional data structure having a plurality of leaf nodes for organizing the points, each leaf node being split into .left brkt-bot.1/epsilon.right brkt-top. child nodes, where epsilon is a similar distance, based on the depth of the leaf node whenever the number of points associated with the leaf node exceeds a predetermined value, the dimensions used for splitting the nodes being in an order of correlation among the dimensions such that one selected for splitting next has the least correlation with previously used dimensions;
  
  traversing the interior nodes of the data structure to select pairs of leaf nodes from which the points are joined; and
  
  joining the points from the selected pairs of leaf nodes based on a joining condition that the distance between any two points to be joined is at most epsilon.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method as recited in claim 1, wherein:
    - the step of generating a data structure includes sorting the points in each leaf node using one of the dimensions not used for splitting the leaf nodes as a common sort dimension; and
      
      step of joining the points includes, for each pair of leaf nodes to be joined, the step of sort-merging the points associated with the pair of leaf nodes based on the common sort dimension.
  - 3. The method as recited in claim 1, wherein the step of traversing includes joining a first node and a second node, the step of joining the first and second nodes including the steps of:
    - a) if the first and second nodes are both leaf nodes, then selecting the first and second nodes for joining;
      
      b) if the first node is a leaf node and the second node is not a leaf node, then joining the first node with each child node of the second node; and
      
      c) if neither the first nor the second node is a leaf node, then;
      
      i) joining each n-th child node of the first node with a corresponding n-th child node of the second node, n being from 1 to N, where N is the number of child nodes from each node of the data structure except the leaf nodes;
      
      ii) joining each n-th child node of the first node with an (n+1)-th child node of the second node, n being from 1 to N-1; and
      
      iii) joining each n-th child node of the second node with an (n+1)-th child node of the first node, n being from 1 to N-1.
  - 4. The method as recited in claim 3, wherein:
    - the data structure includes first and second epsilon-KDB trees, the first and second epsilon-KDB trees each representing a set of the points, the root nodes of the first and second epsilon-KDB trees being the first and second nodes to be joined, respectively; and
      
      the step of joining the points includes joining pairs of first and second points based on the joining condition, the first and second points being respectively from the first and second epsilon-KDB, trees.
  - 5. The method as recited in claim 1, wherein the step of traversing includes self-joining selected nodes of the data structure, the step of self-joining including, for each examined node in the data structure, the steps of:
    - a) if the examined node is a leaf node, then joining pairs of points of the examined node based on the joining condition; and
      
      b) if the examined node is not a leaf node, then;
      
      i) self-joining each child node of the examined node; and
      
      ii) joining each pair of adjacent child nodes of the examined node.
  - 6. The method as recited in claim 1, wherein:
    - the data structure includes an R+tree that further comprises a root node and a plurality of dimensions;
      
      the step of generating a data structure includes splitting any leaf node of the R+tree into a plurality of child nodes whenever the number of points in the leaf node exceeds a predetermined value; and
      
      each internal node of the R+tree has a minimum bounding rectangle MBR corresponding to each child node of the internal node, the MBR corresponding to the space including the points associated with the child node.
  - 7. The method as recited in claim 6, wherein the splitting is based on biased-splitting such that one dimension of the R+tree is selected for splitting the leaf nodes repeatedly until the length of the MBR of each new interior node in the selected dimension is less than 2 epsilons, before another dimension is used for splitting.

8. A computer program product for use with a computer system for directing the system to perform spatial similarity joins on high-dimensional points, the points representing data objects of a database, the computer program product comprising:
- a computer readable medium;
  
  means, provided on the computer-readable medium, for directing the system to generate a multi-dimensional data structure having a plurality of leaf nodes for organizing the points, each leaf node being split into .left brkt-bot.1/epsilon.right brkt-top. child nodes, where epsilon is a similar distance, based on the depth of the leaf node whenever the number of points associated with the leaf node exceeds a predetermined value, the dimensions used for splitting the nodes being in an order of correlation among the dimensions such that one selected for splitting next has the least correlation with previously used dimensions;
  
  means, provided on the computer-readable medium, for directing the system to traverse the interior nodes of the data structure to select pairs of leaf nodes from which the points are joined; and
  
  means, provided on the computer-readable medium, for directing the system to join the points from the selected pairs of leaf nodes based on a joining condition that the distance between any two points to be joined is at most epsilon.
- View Dependent Claims (9, 10, 11, 12, 13, 14, 15)
- - 9. The computer program product as recited in claim 8, wherein:
    - the means for directing to generate a data structure includes means, provided on the computer-readable medium, for directing the system to sort the points in each leaf node using one of the dimensions not used for splitting as a common sort dimension; and
      
      the means for directing to join the points includes means, provided on the computer-readable medium, for directing the system to sort-merge, for each pair of leaf nodes to be joined, the points associated with the pair of leaf nodes based on the common sort dimension.
  - 10. The computer program product as recited in claim 8, wherein the means for directing to traverse includes means, provided on the computer-readable medium, for directing the system to join a first node and a second node, the means for directing to join the first and second nodes including:
    - a) if the first and second nodes are both leaf nodes, then means, provided on the computer-readable medium, for directing the system to select the first and second nodes for joining;
      
      b) if the first node is a leaf node and the second node is not a leaf node, then means, provided on the computer-readable medium, for directing the system to join the first node with each child node of the second node; and
      
      c) if neither the first nor the second node is a leaf node, then;
      
      i) means, provided on the computer-readable medium, for directing the system to join each n-th child node of the first node with a corresponding n-th child node of the second node, n being from 1 to N, where N is the number of child nodes from each node of the data structure except the leaf nodes;
      
      ii) means, provided on the computer-readable medium, for directing the system to join each n-th child node of the first node with an (n+1)-th child node of the second node, n being from 1 to N-1; and
      
      iii) means, provided on the computer-readable medium, for directing the system to join each n-th child node of the second node with an (n+1)-th child node of the first node, n being from 1 to N-1.
  - 11. The computer program product as recited in claim 10, wherein:
    - the data structure includes first and second epsilon-KDB trees, the first and second epsilon-KDB trees each representing a set of the points, the root nodes of the first and second epsilon-KDB trees being the first and second nodes to be joined, respectively; and
      
      the means for directing to join the points includes means, provided on the computer-readable medium, for directing the system to join pairs of first and second points based on the joining condition, the first and second points being respectively from the first and second epsilon-KDB trees.
  - 12. The computer program product as recited in claim 8, wherein the means for directing to traverse includes means, provided on the computer-readable medium, for directing the system to self-join selected nodes of the data structure, the means for directing to self-join including, for each examined node of the data structure:
    - a) if the examined node is a leaf node, then means, provided on the computer-readable medium, for directing the system to join pairs of points of the examined node based on the joining condition; and
      
      b) if the examined node is not a leaf node, then;
      
      i) means, provided on the computer-readable medium, for directing the system to self-join each child node of the examined node; and
      
      ii) means, provided on the computer-readable medium, for directing the system to join each pair of adjacent child nodes of the examined node.
  - 13. The computer program product as recited in claim 8, wherein:
    - the data structure includes an R+tree that further comprises a root node and a plurality of dimensions;
      
      the means for directing to generate a data structure includes means, provided on the computer-readable medium, for directing the system to split any leaf node of the R+tree into a plurality of child nodes whenever the number of points in the leaf node exceeds a predetermined value; and
      
      each internal node of the R+tree has a minimum bounding rectangle MBR corresponding to each child node of the internal node, the MBR corresponding to the space including the points in the child node.
  - 14. The computer program product as recited in claim 13, wherein the splitting is based on biased-splitting such that one dimension of the R+tree is selected for splitting the leaf nodes repeatedly until the length of the MBR of each new interior node in the selected dimension is less than 2 epsilons, before another dimension is used for splitting.
  - 15. The computer program product as recited in claim 8, wherein the means for directing to join the points includes means, provided on the computer-readable medium, for directing the system to perform a nested-loop joining of all pairs of points of a leaf node, based on the joining condition.

16. A database system for performing similarity joins on high-dimensional points representing data objects of a database, the system comprising:
- means for generating a multi-dimensional data structure having a plurality of leaf nodes for organizing the points, each leaf node being split into .left brkt-bot.1/epsilon.right brkt-top. child nodes, where epsilon is a similar distance, based on the depth of the leaf node whenever the number of points associated with the leaf node exceeds a predetermined value, the dimensions used for splitting the nodes being in an order of correlation among the dimensions such that one selected for splitting next has the least correlation with previously used dimensions;
  
  means for traversing the interior nodes of the data structure to select pairs of leaf nodes from which the points are joined; and
  
  means for joining the points from the selected pairs of leaf nodes based on a joining condition that the distance between any two points to be joined is at most epsilon.
- View Dependent Claims (17, 18, 19, 20, 21, 22, 23)
- - 17. The system as recited in claim 16, wherein:
    - the means for generating a data structure includes means for sorting the points in each leaf node using one of the dimensions not used for splitting as a common sort dimension; and
      
      the means for joining the points includes, for each pair of leaf nodes to be joined, means for sort-merging the points associated with the pair of leaf nodes based on the common sort dimension.
  - 18. The system as recited in claim 16, wherein the means for traversing includes means for joining a first node and a second node, the means for joining the first and second nodes including:
    - a) if the first and second nodes are both leaf nodes, then means for selecting the first and second nodes for joining;
      
      b) if the first node is a leaf node and the second node is not a leaf node, then means for joining the first node with each child node of the second node; and
      
      c) if neither the first nor the second node is a leaf node, then;
      
      i) means for joining each n-th child node of the first node with a corresponding n-th child node of the second node, n being from 1 to N, where N is the number of child nodes from each node of the data structure except the leaf nodes;
      
      ii) means for joining each n-th child node of the first node with an (n+1)-th child node of the second node, n being from 1 to N-1; and
      
      iii) means for joining each n-th child node of the second node with an (n+1)-th child node of the first node, n being from 1 to N-1.
  - 19. The system as recited in claim 18, wherein:
    - the data structure includes first and second epsilon-KDB trees, the first and second epsilon-KDB trees each representing a set of the points, the root nodes of the first and second epsilon-KDB trees being the first and second nodes to be joined, respectively; and
      
      the means for joining the points includes means for joining pairs of first and second points based on the joining condition, the first and second points being respectively from the first and second epsilon-KDB trees.
  - 20. The system as recited in claim 16, wherein the means for traversing includes means for self-joining selected nodes of the data structure, the means for self-joining including, for each examined node of the data structure:
    - a) if the examined node is a leaf node, then means for joining pairs of points of the examined node based on the joining condition; and
      
      b) if the examined node is not a leaf node, then;
      
      i) means for self-joining each child node of the examined node; and
      
      ii) means for joining each pair of adjacent child nodes of the examined node.
  - 21. The system as recited in claim 16, wherein:
    - the data structure includes an R+tree that further comprises a root node and a plurality of dimensions;
      
      the means for generating a data structure includes means for splitting any leaf node of the R+tree into a plurality of child nodes whenever the number of points in the leaf node exceeds a predetermined value; and
      
      each internal node of the R+tree has a minimum bounding rectangle MBR corresponding to each child node of the internal node, the MBR corresponding to the space including the points in the child node.
  - 22. The system as recited in claim 21, wherein the splitting is based on biased-splitting such that one dimension of the R+tree is selected for splitting the leaf nodes repeatedly until the length of the MBR of each new interior node in the selected dimension is less than 2 epsilons, before another dimension is used for splitting.
  - 23. The system as recited in claim 16, wherein the means for joining the points includes means for performing a nested-loop joining of all pairs of points of a leaf node based on the joining condition.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Agrawal, Rakesh, Srikant, Ramakrishnan, Shim, Kyuseok
Primary Examiner(s)
Black, Thomas G.
Assistant Examiner(s)
Jung, David Yiuk

Application Number

US08/629,688
Time in Patent Office

1,302 Days
Field of Search

707/1-206, 707/200, 364/282.1, 364/DIG. 1
US Class Current

1/1
CPC Class Codes

G06F 16/2246   Trees, e.g. B+trees

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99934   Query formulation, input pr...

Y10S 707/99942   Manipulating data structure...

Method and system for performing spatial similarity joins on high-dimensional points

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

23 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for performing spatial similarity joins on high-dimensional points

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

23 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links