Method and Apparatus for fast similarity-based query, self-join, and join for massive, high-dimension datasets

US 20070299865A1
Filed: 06/24/2007
Published: 12/27/2007
Est. Priority Date: 06/27/2006
Status: Active Grant

First Claim

Patent Images

1. A computer implemented method comprising:

(a) specifying a maximum of leaf nodes as G items;

(b) inputting n input content items and denoting said n input items as a current collection;

(c) determining if n>

G; and

(d) if not then(d1) building a leaf node;

(d2) filling said leaf node with said n input content items; and

(d3) connecting a link from a parent to said leaf node capable of being stored in hardware on said computer and capable of being displayed to a user.(e) if yes then(e1) computing a vector sum over all items in said current collection, wherein said vector sum is vsplit=sum(i;

x.i)/n;

(e2) computing a vector difference for each item in said current collection, wherein said vector difference is d.i=x.i−

vsplit;

(e3) computing a scalar value for each item in said current collection, wherein said scalar value is p.i=<

d.i, vsplit>

, and making a collection of said computed scalar value for each item;

(e4) determining if p.i<

3; and

(f) if not then(f1) removing a largest p.i from said collection(f2) removing a smallest p.i from said collection; and

(f3) resuming at (e3).(g) if yes then(g1) determining if 1 or 2 computed values remain in said collection; and

(h) if 1 then

(h1) letting p.split be said 1 computed remaining value; and

(h2) resuming at (j);

(i) if 2 then

(i1) letting p.split be an average of said 2 computed remaining values; and

(i2) resuming at (1);

(j) defining a splitter which consists of said vsplit and said p.split;

(k) denoting for each of said content items in said current collection a designation of “

upper”

if p.i>

p.split, otherwise a designation of “

lower”

;

(l) building an interior node, consisting of said splitter, and defining links to said “

lower” and

said “

upper”

nodes;

(m) inputting said “

lower”

nodes as items into a new “

lower”

current collection, letting new “

lower”

n denote the number of items in said new “

lower”

current collection, replacing said current collection with said new “

lower”

current collection, replacing said n with said new “

lower”

n, and resuming at (c);

(n) inputting said “

upper”

nodes as items into a new “

upper”

current collection, letting new “

upper”

n denote the number of items in said new “

upper”

current collection, replacing said current collection with said new “

upper”

current collection, replacing said n with said new “

upper”

n, and resuming at (c);

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and apparatus for fast similarity-based query, self-join, and join for massive, high-dimension datasets have been disclosed.

16 Citations

View as Search Results

12 Claims

1. A computer implemented method comprising:
- (a) specifying a maximum of leaf nodes as G items;
  
  (b) inputting n input content items and denoting said n input items as a current collection;
  
  (c) determining if n>
  
  G; and
  
  (d) if not then(d1) building a leaf node;
  
  (d2) filling said leaf node with said n input content items; and
  
  (d3) connecting a link from a parent to said leaf node capable of being stored in hardware on said computer and capable of being displayed to a user.(e) if yes then(e1) computing a vector sum over all items in said current collection, wherein said vector sum is vsplit=sum(i;
  
  x.i)/n;
  
  (e2) computing a vector difference for each item in said current collection, wherein said vector difference is d.i=x.i−
  
  vsplit;
  
  (e3) computing a scalar value for each item in said current collection, wherein said scalar value is p.i=<
  
  d.i, vsplit>
  
  , and making a collection of said computed scalar value for each item;
  
  (e4) determining if p.i<
  
  3; and
  
  (f) if not then(f1) removing a largest p.i from said collection(f2) removing a smallest p.i from said collection; and
  
  (f3) resuming at (e3).(g) if yes then(g1) determining if 1 or 2 computed values remain in said collection; and
  
  (h) if 1 then
  
  (h1) letting p.split be said 1 computed remaining value; and
  
  (h2) resuming at (j);
  
  (i) if 2 then
  
  (i1) letting p.split be an average of said 2 computed remaining values; and
  
  (i2) resuming at (1);
  
  (j) defining a splitter which consists of said vsplit and said p.split;
  
  (k) denoting for each of said content items in said current collection a designation of “
  
  upper”
  
  if p.i>
  
  p.split, otherwise a designation of “
  
  lower”
  
  ;
  
  (l) building an interior node, consisting of said splitter, and defining links to said “
  
  lower” and
  
  said “
  
  upper”
  
  nodes;
  
  (m) inputting said “
  
  lower”
  
  nodes as items into a new “
  
  lower”
  
  current collection, letting new “
  
  lower”
  
  n denote the number of items in said new “
  
  lower”
  
  current collection, replacing said current collection with said new “
  
  lower”
  
  current collection, replacing said n with said new “
  
  lower”
  
  n, and resuming at (c);
  
  (n) inputting said “
  
  upper”
  
  nodes as items into a new “
  
  upper”
  
  current collection, letting new “
  
  upper”
  
  n denote the number of items in said new “
  
  upper”
  
  current collection, replacing said current collection with said new “
  
  upper”
  
  current collection, replacing said n with said new “
  
  upper”
  
  n, and resuming at (c);

2. A computer implemented method comprising:
- (a) inputting a dataset for which a similarity index has been built;
  
  (b) inputting a query q;
  
  (c) inputting a desired similarity threshold s;
  
  (d) setting a current node to a root of said similarity index.(e) determining if said current node is a leaf or an interior node; and
  
  (f) if leaf then(f1) computing a similarity between q and each item in said leaf node; and
  
  (f2) returning one or more of said each item that meets said desired similarity threshold s, said one or more of said each item capable of being stored in hardware on said computer and capable of being displayed to a user.(g) if interior node then(g1) obtaining a splitter (vsplit, p.split) from said interior node; and
  
  (g2) computing r=<
  
  q−
  
  vsplit, vsplit>
  
  ;
  
  (h) determining if r—
  
  p.split>
  
  0; and
  
  (i) if yes then(i1) setting said current node to be a “
  
  upper”
  
  child node; and
  
  (i2) resuming at (e);
  
  (j) if no then(j1) setting said current node to be a “
  
  lower”
  
  child node; and
  
  (j2) resuming at (e);

3. A computer implemented method comprising:
- (a) specifying a maximum cluster size of G;
  
  (b) specifying a minimum cluster similarity of s;
  
  (c) inputting n content items;
  
  (d) building a bulk similarity index for said n content items, yielding a tree with leaf nodes of G or fewer content items per node;
  
  (e) computing for each leaf node a pairwise similarity between items;
  
  (f) determining if there is at least one pair of items with similarity exceeding s; and
  
  (g) if no then(g1) defining a “
  
  oneoff”
  
  cluster;
  
  (g2) putting said items in said “
  
  oneoff”
  
  cluster; and
  
  (g3) resuming at (i);
  
  (h) if yes then(h1) computing a highest pairwise similarity, and designating one item as a “
  
  anchor”
  
  of a cluster;
  
  (h2) defining a “
  
  good”
  
  cluster;
  
  (h3) placing into said “
  
  good”
  
  cluster each item in said current node if its similarity to said “
  
  anchor”
  
  exceeds s, otherwise designate any remaining items as “
  
  residuals” and
  
  collect them separately;
  
  (i) determining if all leaf nodes have been processed; and
  
  (j) if not then(j1) resuming at (e);
  
  (k) if so then(l) determining if any “
  
  residuals”
  
  have been collected at this point; and
  
  (m) if not then(m1) storing in hardware on said computer any “
  
  good”
  
  cluster and any “
  
  oneoff”
  
  cluster, said stored any “
  
  good”
  
  cluster and said stored any “
  
  oneoff”
  
  cluster capable of being displayed to a user.(n) if yes then(n1) gathering said “
  
  residuals”
  
  collected earlier; and
  
  (n2) resuming at (c);

4. A computer implemented method comprising:
- (a) inputting datasets 1 and 2(b) inputting a similarity threshold of s*;
  
  (c) inputting for each said datasets 1 and 2 a “
  
  grouping”
  
  size G.I;
  
  (d) computing, based on datasets 1 and 2 size, an anticipated height of a tree, h.i=ceiling(log 2(N.i/G.i));
  
  (e) building for datasets 1 and 2 a self-join tree using a similarity threshold, s.i=1−
  
  [(1−
  
  s*)/(2*h.i−
  
  1], yielding a tree of good groups and oneoff groups;
  
  (f) designating a “
  
  topmost”
  
  set as an anchor of a root good group, together with members of said oneoff groups;
  
  (g) forming a dataset from said “
  
  topmost”
  
  set;
  
  (h) performing a self-join on said dataset, using said similarity threshold s*;
  
  (i) outputting from said self-join on said dataset “
  
  good”
  
  groups that have at least one member from dataset 1, and at least one member from dataset 2 denoting these a s*join cluster, said s*join cluster capable of being stored in hardware on said computer and capable of being displayed to a user.
- View Dependent Claims (5, 6)
- - 5. The computer implemented method of claim 4 wherein said “
    - grouping”
      
      size G.i is determined in advance.
  - 6. The computer implemented method of claim 4 wherein said “
    - grouping”
      
      size G.i minimizes overall computation time of building a similarity index.

7. A hardware based apparatus comprising:
- (a) means for specifying a maximum of leaf nodes as G items;
  
  (b) means for inputting n input content items and denoting said n input items as a current collection;
  
  (c) means for determining if n>
  
  G; and
  
  (d) if not then(d1) means for building a leaf node;
  
  (d2) means for filling said leaf node with said n input content items; and
  
  (d3) means for connecting a link from a parent to said leaf node capable of being stored in a memory and said memory capable of being displayed to a user.(e) if yes then(e1) means for computing a vector sum over all items in said current collection, wherein said vector sum is vsplit=sum(i;
  
  x.i)/n;
  
  (e2) means for computing a vector difference for each item in said current collection, wherein said vector difference is d.i=x.i−
  
  vsplit;
  
  (e3) means for computing a scalar value for each item in said current collection, wherein said scalar value is p.i=<
  
  d.i, vsplit>
  
  , and making a collection of said computed scalar value for each item;
  
  (e4) means for determining if p.i<
  
  3; and
  
  (f) if not then(f1) means for removing a largest p.i from said collection(f2) means for removing a smallest p.i from said collection; and
  
  (f3) means for resuming at (e3).(g) if yes then(g1) means for determining if 1 or 2 computed values remain in said collection; and
  
  (h) if 1 then
  
  (h1) means for letting p.split be said 1 computed remaining value; and
  
  (h2) means for resuming at (j);
  
  (i) if 2 then
  
  (i1) means for letting p.split be an average of said 2 computed remaining values; and
  
  (i2) means for resuming at (1);
  
  (j) means for defining a splitter which consists of said vsplit and said p.split;
  
  (k) means for denoting for each of said content items in said current collection a designation of “
  
  upper”
  
  if p.i>
  
  p.split, otherwise a designation of “
  
  lower”
  
  ;
  
  (l) means for building an interior node, consisting of said splitter, and defining links to said “
  
  lower” and
  
  said “
  
  upper”
  
  nodes;
  
  (m) means for inputting said “
  
  lower”
  
  nodes as items into a new “
  
  lower”
  
  current collection, letting new “
  
  lower”
  
  n denote the number of items in said new “
  
  lower”
  
  current collection, replacing said current collection with said new “
  
  lower”
  
  current collection, replacing said n with said new “
  
  lower”
  
  n, and resuming at (c);
  
  (n) means for inputting said “
  
  upper”
  
  nodes as items into a new “
  
  upper”
  
  current collection, letting new “
  
  upper”
  
  n denote the number of items in said new “
  
  upper”
  
  current collection, replacing said current collection with said new “
  
  upper”
  
  current collection, replacing said n with said new “
  
  upper”
  
  n, and resuming at (c);

8. A hardware based apparatus comprising:
- (a) means for inputting a dataset for which a similarity index has been built;
  
  (b) means for inputting a query q;
  
  (c) means for inputting a desired similarity threshold s;
  
  (d) means for setting a current node to a root of said similarity index.(e) means for determining if said current node is a leaf or an interior node; and
  
  (f) if leaf then(f1) means for computing a similarity between q and each item in said leaf node; and
  
  (f2) means for returning one or more of said each item that meets said desired similarity threshold s, said one or more of said each item capable of being stored in a memory and said memory capable of being displayed to a user.(g) if interior node then(g1) means for obtaining a splitter (vsplit, p.split) from said interior node; and
  
  (g2) means for computing r=<
  
  q−
  
  vsplit, vsplit>
  
  ;
  
  (h) determining if r−
  
  p.split>
  
  0; and
  
  (i) if yes then(i1) means for setting said current node to be a “
  
  upper”
  
  child node; and
  
  (i2) means for resuming at (e);
  
  (j) if no then(j1) means for setting said current node to be a “
  
  lower”
  
  child node; and
  
  (j2) means for resuming at (e);

9. A hardware based apparatus comprising:
- (a) means for specifying a maximum cluster size of G;
  
  (b) means for specifying a minimum cluster similarity of s;
  
  (c) means for inputting n content items;
  
  (d) means for building a bulk similarity index for said n content items, yielding a tree with leaf nodes of G or fewer content items per node;
  
  (e) means for computing for each leaf node a pairwise similarity between items;
  
  (f) means for determining if there is at least one pair of items with similarity exceeding s; and
  
  (g) if no then(g1) means for defining a “
  
  oneoff”
  
  cluster;
  
  (g2) means for putting said items in said “
  
  oneoff”
  
  cluster; and
  
  (g3) means for resuming at (i);
  
  (h) if yes then(h1) means for computing a highest pairwise similarity, and designating one item as a “
  
  anchor”
  
  of a cluster;
  
  (h2) means for defining a “
  
  good”
  
  cluster;
  
  (h3) means for placing into said “
  
  good”
  
  cluster each item in said current node if its similarity to said “
  
  anchor”
  
  exceeds s, otherwise designate any remaining items as “
  
  residuals” and
  
  collect them separately;
  
  (i) means for determining if all leaf nodes have been processed; and
  
  (j) if not then(j1) means for resuming at (e);
  
  (k) if so then(l) means for determining if any “
  
  residuals”
  
  have been collected at this point; and
  
  (m) if not then(m1) means for storing in a memory any “
  
  good”
  
  cluster and any “
  
  oneoff”
  
  cluster, and said memory capable of being displayed to a user.(n) if yes then(n1) means for gathering said “
  
  residuals”
  
  collected earlier; and
  
  (n2) means for resuming at (c);

10. A hardware based apparatus comprising:
- (a) means for inputting datasets 1 and 2(b) means for inputting a similarity threshold of s*;
  
  (c) means for inputting for each said datasets 1 and 2 a “
  
  grouping”
  
  size G.i;
  
  (d) means for computing, based on datasets 1 and 2 size, an anticipated height of a tree, h.i=ceiling(log 2(N.i/G.i));
  
  (e) means for building for datasets 1 and 2 a self-join tree using a similarity threshold, s.i=1−
  
  [(1−
  
  s*)/(2*h.i−
  
  1], yielding a tree of good groups and oneoff groups;
  
  (f) means for designating a “
  
  topmost”
  
  set as an anchor of a root good group, together with members of said oneoff groups;
  
  (g) means for forming a dataset from said “
  
  topmost”
  
  set;
  
  (h) means for performing a self-join on said dataset, using said similarity threshold s*;
  
  (i) means for outputting from said self-join on said dataset “
  
  good”
  
  groups that have at least one member from dataset 1, and at least one member from dataset 2 denoting these a s*join cluster, said s*join cluster capable of being stored in a memory, and said memory capable of being displayed to a user.
- View Dependent Claims (11, 12)
- - 11. The hardware based apparatus of claim 10 wherein said “
    - grouping”
      
      size G.i is determined in advance.
  - 12. The hardware based apparatus of claim 10 wherein said “
    - grouping”
      
      size G.i minimizes overall computation time of building a similarity index.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nahava, Inc.
Original Assignee
Nahava, Inc.
Inventors
Nakano, Russell T., Cheng, Stanley

Granted Patent

US 7,644,090 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/2264 Multidimensional index stru...

Method and Apparatus for fast similarity-based query, self-join, and join for massive, high-dimension datasets

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

16 Citations

12 Claims

Specification

Solutions

Use Cases

Quick Links

Method and Apparatus for fast similarity-based query, self-join, and join for massive, high-dimension datasets

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

16 Citations

12 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links