Partition-based high dimensional similarity join method
First Claim
1. A partition-based high dimensional similarity join method, comprising:
- determining a total number of dimensions dp for use in partitioning a high dimensional data space and a total number of partitioning dimensions;
partitioning the high dimensional data space in accordance with the determined dimensions and the total number of partitioning dimensions;
performing joins between data sets according to the partitioned dimensions; and
counting a number of join computations which occur in the joins between the respective cells of the data sets,wherein the total number of dimensions dp for use in partitioning the high dimensional data space are determined based on the number of join computations, andwherein the total number of dimensions dp used in partitioning the high dimensional data space is obtained by comparing a size of the data sets and a size of disk blocks in which the data sets are stored, according to the following equation;
where |R|block and |S|block are a total numbers of disk blocks in which the data sets R and S are stored, respectively, the Blocksize is the size of the disk blocks, and [1/ε
] is a number of the cells.
2 Assignments
0 Petitions
Accused Products
Abstract
A partition-based high dimensional similarity join method allowing similarity to be efficiently measured by beforehand dynamically selecting space partitioning dimensions and the number of the partitioning dimensions using a dimension selection algorithm. A method of efficiently performing similarity join for high dimensional data during a relatively short period of time without requiring massive storage space. The method includes according to the present invention comprises the steps of partitioning a high dimensional data space and performing joins between predetermined data sets. Dimensions for use in partitioning the high dimensional data space and the number of partitioning dimensions are determined in advance before the space partitioning, and the joins are performed only when respective cells of the data sets are overlapping with each other or are neighboring each other.
8 Citations
8 Claims
-
1. A partition-based high dimensional similarity join method, comprising:
-
determining a total number of dimensions dp for use in partitioning a high dimensional data space and a total number of partitioning dimensions; partitioning the high dimensional data space in accordance with the determined dimensions and the total number of partitioning dimensions; performing joins between data sets according to the partitioned dimensions; and counting a number of join computations which occur in the joins between the respective cells of the data sets, wherein the total number of dimensions dp for use in partitioning the high dimensional data space are determined based on the number of join computations, and wherein the total number of dimensions dp used in partitioning the high dimensional data space is obtained by comparing a size of the data sets and a size of disk blocks in which the data sets are stored, according to the following equation; where |R|block and |S|block are a total numbers of disk blocks in which the data sets R and S are stored, respectively, the Blocksize is the size of the disk blocks, and [1/ε
] is a number of the cells.- View Dependent Claims (2, 3, 4)
-
-
5. A partition-based high dimensional similarity join method, comprising:
-
determining a total number of dimensions dp for use in partitioning a high dimensional data space and a total number of partitioning dimensions; partitioning the high dimensional data space in accordance with the determined dimensions and the total number of partitioning dimensions; performing joins between data sets according to the partitioned dimensions; and counting a number of join computations which occur in the joins between the respective cells of the data sets, wherein the total number of dimensions dp used in partitioning the high dimensional data space is obtained by comparing a size of the data sets and a size of disk blocks in which the data sets are stored, according to the following equation; where |R|block and |S|block are a total numbers of disk blocks in which the data sets R and S are stored, respectively, the BlockSize is the size of the disk blocks, and [1/ε
] is a number of the cells.- View Dependent Claims (6, 7, 8)
-
Specification