Partition-based high dimensional similarity join method

US 7,167,868 B2
Filed: 08/13/2003
Issued: 01/23/2007
Est. Priority Date: 09/11/2002
Status: Expired due to Fees

First Claim

Patent Images

1. A partition-based high dimensional similarity join method, comprising:

determining a total number of dimensions dp for use in partitioning a high dimensional data space and a total number of partitioning dimensions;

partitioning the high dimensional data space in accordance with the determined dimensions and the total number of partitioning dimensions;

performing joins between data sets according to the partitioned dimensions; and

counting a number of join computations which occur in the joins between the respective cells of the data sets,wherein the total number of dimensions d_pfor use in partitioning the high dimensional data space are determined based on the number of join computations, andwherein the total number of dimensions d_pused in partitioning the high dimensional data space is obtained by comparing a size of the data sets and a size of disk blocks in which the data sets are stored, according to the following equation;

$d_{p} = \frac{\log \frac{{Min ({\langle R \rangle}_{block}, \rangle S \rangle}_{block})}{BlockSize}}{\log ⌈ 1 / ɛ ⌉},$ where |R|_blockand |S|_blockare a total numbers of disk blocks in which the data sets R and S are stored, respectively, the Blocksize is the size of the disk blocks, and [1/ε

] is a number of the cells.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A partition-based high dimensional similarity join method allowing similarity to be efficiently measured by beforehand dynamically selecting space partitioning dimensions and the number of the partitioning dimensions using a dimension selection algorithm. A method of efficiently performing similarity join for high dimensional data during a relatively short period of time without requiring massive storage space. The method includes according to the present invention comprises the steps of partitioning a high dimensional data space and performing joins between predetermined data sets. Dimensions for use in partitioning the high dimensional data space and the number of partitioning dimensions are determined in advance before the space partitioning, and the joins are performed only when respective cells of the data sets are overlapping with each other or are neighboring each other.

8 Citations

View as Search Results

8 Claims

1. A partition-based high dimensional similarity join method, comprising:
- determining a total number of dimensions dp for use in partitioning a high dimensional data space and a total number of partitioning dimensions;
  
  partitioning the high dimensional data space in accordance with the determined dimensions and the total number of partitioning dimensions;
  
  performing joins between data sets according to the partitioned dimensions; and
  
  counting a number of join computations which occur in the joins between the respective cells of the data sets,wherein the total number of dimensions d_pfor use in partitioning the high dimensional data space are determined based on the number of join computations, andwherein the total number of dimensions d_pused in partitioning the high dimensional data space is obtained by comparing a size of the data sets and a size of disk blocks in which the data sets are stored, according to the following equation;
  
  $d_{p} = \frac{\log \frac{{Min ({\langle R \rangle}_{block}, \rangle S \rangle}_{block})}{BlockSize}}{\log ⌈ 1 / ɛ ⌉},$ where |R|_blockand |S|_blockare a total numbers of disk blocks in which the data sets R and S are stored, respectively, the Blocksize is the size of the disk blocks, and [1/ε
  
  ] is a number of the cells.
- View Dependent Claims (2, 3, 4)
- - 2. The method as claimed in claim 1, wherein the number of join computations is obtained by computing a number of entries of the data sets R and S included in the respective cells for respective dimensions and then counting a number of distance computations of joins between the cells for the respective dimensions.
  - 3. The method as claimed in claim 2, wherein the number of join computations is obtained by computing a number of entries of the data sets R and S included in sampled cells among the cells for the respective dimensions and then counting a number of distance computations of joins between the cells for the respective dimensions.
  - 4. The method of claim 1, wherein the joins are performed only when respective cells in the data sets are overlapping with each other or are neighboring each other.

5. A partition-based high dimensional similarity join method, comprising:
- determining a total number of dimensions d_pfor use in partitioning a high dimensional data space and a total number of partitioning dimensions;
  
  partitioning the high dimensional data space in accordance with the determined dimensions and the total number of partitioning dimensions;
  
  performing joins between data sets according to the partitioned dimensions; and
  
  counting a number of join computations which occur in the joins between the respective cells of the data sets,wherein the total number of dimensions d_pused in partitioning the high dimensional data space is obtained by comparing a size of the data sets and a size of disk blocks in which the data sets are stored, according to the following equation;
  
  $d_{p} = \frac{\log \frac{{Min ({\langle R \rangle}_{block}, \rangle S \rangle}_{block})}{BlockSize}}{\log ⌈ 1 / ɛ ⌉},$ where |R|_blockand |S|_blockare a total numbers of disk blocks in which the data sets R and S are stored, respectively, the BlockSize is the size of the disk blocks, and [1/ε
  
  ] is a number of the cells.
- View Dependent Claims (6, 7, 8)
- - 6. The method as claimed in claim 5, wherein the number of join computations is obtained by computing a number of entries of the data sets R and S included in the respective cells for respective dimensions and then counting a number of distance computations of joins between the cells for the respective dimensions.
  - 7. The method as claimed in claim 5, wherein the number of join computations is obtained by computing a number of entries of the data sets R and S included in sampled cells among the cells for the respective dimensions and then counting a number of distance computations of joins between the cells for the respective dimensions.
  - 8. The method of claim 5, wherein the joins are performed only when respective cells in the data sets are overlapping with each other or are neighboring each other.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Samsung Electronics Co. Ltd., Paceco Corporation (Tsuneishi Holdings Company Limited)
Original Assignee
Samsung Electronics Co. Ltd., Paceco Corporation (Tsuneishi Holdings Company Limited)
Inventors
Shin, Hyoseop
Primary Examiner(s)
Corrielus; Jean M.
Assistant Examiner(s)
Woo; Isaac

Application Number

US10/639,597
Publication Number

US 20040093320A1
Time in Patent Office

1,259 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G06F 16/283   Multi-dimensional databases...

G06F 16/40   of multimedia data, e.g. sl...

Y10S 707/99942   Manipulating data structure...

Y10S 707/99943   Generating database or data...

Y10S 707/99945   Object-oriented database st...

Y10S 707/99953   Recoverability

Partition-based high dimensional similarity join method

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

8 Citations

8 Claims

Specification

Use Cases

Quick Links

Others

Partition-based high dimensional similarity join method

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

8 Citations

8 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others