Method and system for performing spatial similarity joins on high-dimensional points
First Claim
1. A method for performing similarity joins on high-dimensional points representing data objects of a database, the method comprising the steps of:
- generating a multi-dimensional data structure having a plurality of leaf nodes for organizing the points, each leaf node being split into .left brkt-bot.1/epsilon.right brkt-top. child nodes, where epsilon is a similar distance, based on the depth of the leaf node whenever the number of points associated with the leaf node exceeds a predetermined value, the dimensions used for splitting the nodes being in an order of correlation among the dimensions such that one selected for splitting next has the least correlation with previously used dimensions;
traversing the interior nodes of the data structure to select pairs of leaf nodes from which the points are joined; and
joining the points from the selected pairs of leaf nodes based on a joining condition that the distance between any two points to be joined is at most epsilon.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and system are disclosed for performing spatial similarity joins on high-dimensional points that represent data objects of a database. The method comprises the steps of: generating a data structure based on the similarity distance ε for organizing the high-dimensional points, traversing the data structure to select pairs of leaf nodes from which the high-dimensional points are joined, and joining the points from selected pairs of nodes according to a joining condition based on the similarity distance ε. An efficient data structure referred to as an ε-K-D-B tree is disclosed to provide fast access to the high-dimensional points and to minimize system storage requirements. The invention provides algorithms for generating the ε-K-D-B tree using biased splitting to minimize the number of nodes to be examined during join operations. The traversing step includes joining selected pairs of nodes and also self-joining selected nodes. Alternatively, the data structure is an R+tree generated using biased splitting.
-
Citations
23 Claims
-
1. A method for performing similarity joins on high-dimensional points representing data objects of a database, the method comprising the steps of:
-
generating a multi-dimensional data structure having a plurality of leaf nodes for organizing the points, each leaf node being split into .left brkt-bot.1/epsilon.right brkt-top. child nodes, where epsilon is a similar distance, based on the depth of the leaf node whenever the number of points associated with the leaf node exceeds a predetermined value, the dimensions used for splitting the nodes being in an order of correlation among the dimensions such that one selected for splitting next has the least correlation with previously used dimensions; traversing the interior nodes of the data structure to select pairs of leaf nodes from which the points are joined; and joining the points from the selected pairs of leaf nodes based on a joining condition that the distance between any two points to be joined is at most epsilon. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A computer program product for use with a computer system for directing the system to perform spatial similarity joins on high-dimensional points, the points representing data objects of a database, the computer program product comprising:
-
a computer readable medium; means, provided on the computer-readable medium, for directing the system to generate a multi-dimensional data structure having a plurality of leaf nodes for organizing the points, each leaf node being split into .left brkt-bot.1/epsilon.right brkt-top. child nodes, where epsilon is a similar distance, based on the depth of the leaf node whenever the number of points associated with the leaf node exceeds a predetermined value, the dimensions used for splitting the nodes being in an order of correlation among the dimensions such that one selected for splitting next has the least correlation with previously used dimensions; means, provided on the computer-readable medium, for directing the system to traverse the interior nodes of the data structure to select pairs of leaf nodes from which the points are joined; and means, provided on the computer-readable medium, for directing the system to join the points from the selected pairs of leaf nodes based on a joining condition that the distance between any two points to be joined is at most epsilon. - View Dependent Claims (9, 10, 11, 12, 13, 14, 15)
-
-
16. A database system for performing similarity joins on high-dimensional points representing data objects of a database, the system comprising:
-
means for generating a multi-dimensional data structure having a plurality of leaf nodes for organizing the points, each leaf node being split into .left brkt-bot.1/epsilon.right brkt-top. child nodes, where epsilon is a similar distance, based on the depth of the leaf node whenever the number of points associated with the leaf node exceeds a predetermined value, the dimensions used for splitting the nodes being in an order of correlation among the dimensions such that one selected for splitting next has the least correlation with previously used dimensions; means for traversing the interior nodes of the data structure to select pairs of leaf nodes from which the points are joined; and means for joining the points from the selected pairs of leaf nodes based on a joining condition that the distance between any two points to be joined is at most epsilon. - View Dependent Claims (17, 18, 19, 20, 21, 22, 23)
-
Specification