Method and apparatus for fast similarity-based query, self-join, and join for massive, high-dimension datasets
First Claim
Patent Images
1. A computer implemented method comprising:
- (a) specifying a maximum of G items per leaf node;
(b) inputting n distinct input content items and denoting said n distinct input content items as a current collection;
(c) determining if n>
G; and
(d) if not then(d1) building a leaf node;
(d2) filling said leaf node with said n distinct input content items; and
(d3) connecting a link from a parent to said leaf node capable of being stored in hardware on said computer and capable of being displayed to a user.(e) if yes then(e1) computing a vector sum over all items in said current collection, wherein said vector sum is denoted vsplit, and vsplit=sum(i;
x.i)/n, where i denotes an index (i=1, . . . , n) and x.i denotes a vector at index i;
(e2) computing a vector difference for each item in said current collection, wherein said vector difference is denoted as d.i, and d.i=x.i−
vsplit;
(e3) computing a scalar value for each item in said current collection, wherein said scalar value is denoted as p.i, and p.i=<
d.i, vsplit>
where <
.>
denotes inner product, and making a collection of said computed scalar value for each item;
(e4) determining if p.i<
3; and
(f) if not then(f1) removing a largest p.i from said collection(f2) removing a smallest p.i from said collection; and
(f3) resuming at (e3)(g) if yes then(g1) determining if 1 or 2 computed values remain in said collection; and
(h) if 1 then(h1) letting p.split be said 1 computed remaining value; and
(h2) resuming at (j);
(i) if 2 then(i1) letting p.split be an average of said 2 computed remaining values; and
(i2) resuming at (j);
(j) defining a splitter which consists of said vsplit and said p.split;
(k) denoting for each of said n distinct input content items in said current collection a designation of “
upper”
if p.i>
p.split, otherwise a designation of “
lower”
;
(l) building an interior node, consisting of said splitter, and defining links to said “
lower” and
said “
upper”
nodes;
(m) inputting said “
lower”
nodes as items into a new “
lower”
current collection, letting new “
lower”
n denote the number of items in said new “
lower”
current collection, replacing said current collection with said new “
lower”
current collection, replacing said n with said new “
lower”
n, and resuming at (c);
(n) inputting said “
upper”
nodes as items into a new “
upper”
current collection, letting new “
upper”
n denote the number of items in said new “
upper”
current collection replacing said current collection with said new “
upper”
current collection, replacing said n with said new “
upper”
n, and resuming at (c).
1 Assignment
0 Petitions
Accused Products
Abstract
A method and apparatus for fast similarity-based query, self-join, and join for massive, high-dimension datasets have been disclosed.
19 Citations
10 Claims
-
1. A computer implemented method comprising:
-
(a) specifying a maximum of G items per leaf node; (b) inputting n distinct input content items and denoting said n distinct input content items as a current collection; (c) determining if n>
G; and(d) if not then (d1) building a leaf node; (d2) filling said leaf node with said n distinct input content items; and (d3) connecting a link from a parent to said leaf node capable of being stored in hardware on said computer and capable of being displayed to a user. (e) if yes then (e1) computing a vector sum over all items in said current collection, wherein said vector sum is denoted vsplit, and vsplit=sum(i;
x.i)/n, where i denotes an index (i=1, . . . , n) and x.i denotes a vector at index i;(e2) computing a vector difference for each item in said current collection, wherein said vector difference is denoted as d.i, and d.i=x.i−
vsplit;(e3) computing a scalar value for each item in said current collection, wherein said scalar value is denoted as p.i, and p.i=<
d.i, vsplit>
where <
.>
denotes inner product, and making a collection of said computed scalar value for each item;(e4) determining if p.i<
3; and(f) if not then (f1) removing a largest p.i from said collection (f2) removing a smallest p.i from said collection; and (f3) resuming at (e3) (g) if yes then (g1) determining if 1 or 2 computed values remain in said collection; and (h) if 1 then (h1) letting p.split be said 1 computed remaining value; and (h2) resuming at (j); (i) if 2 then (i1) letting p.split be an average of said 2 computed remaining values; and (i2) resuming at (j); (j) defining a splitter which consists of said vsplit and said p.split; (k) denoting for each of said n distinct input content items in said current collection a designation of “
upper”
if p.i>
p.split, otherwise a designation of “
lower”
;(l) building an interior node, consisting of said splitter, and defining links to said “
lower” and
said “
upper”
nodes;(m) inputting said “
lower”
nodes as items into a new “
lower”
current collection, letting new “
lower”
n denote the number of items in said new “
lower”
current collection, replacing said current collection with said new “
lower”
current collection, replacing said n with said new “
lower”
n, and resuming at (c);(n) inputting said “
upper”
nodes as items into a new “
upper”
current collection, letting new “
upper”
n denote the number of items in said new “
upper”
current collection replacing said current collection with said new “
upper”
current collection, replacing said n with said new “
upper”
n, and resuming at (c). - View Dependent Claims (2, 3, 4, 5)
-
-
6. A hardware based apparatus comprising:
-
(a) means for specifying a maximum of G items per leaf node; (b) means for inputting n distinct input content items and denoting said n distinct input content items as a current collection; (c) means for determining if n>
G; and(d) if not then (d1) means for building a leaf node; (d2) means for filling said leaf node with said n distinct input content items; and (d3) means for connecting a link from a parent to said leaf node capable of being stored in a memory and said memory capable of being displayed to a user. (e) if yes then (e1) means for computing a vector sum over all items in said current collection, wherein said vector sum is denoted vsplit, and vsplit=sum(i;
x.i)/n, where i denotes an index (i=1, . . . , n) and x.i denotes a vector at index i;(e2) means for computing a vector difference for each item in said current collection, wherein said vector difference is denoted as d.i, and d.i=x.i−
vsplit;(e3) means for computing a scalar value for each item in said current collection, wherein said scalar value is denoted as p.i, and p.i=<
d.i, vsplit>
where <
.>
denotes inner product, and making a collection of said computed scalar value for each item;(e4) means for determining if p.i<
3; and(f) if not then (f1) means for removing a largest p.i from said collection (f2) means for removing a smallest p.i from said collection; and (f3) means for resuming at (e3) (g) if yes then (g1) means for determining if 1 or 2 computed values remain in said collection; and (h) if 1 then (h1) means for letting p.split be said 1 computed remaining value; and (h2) means for resuming at (j); (i) if 2 then (i1) means for letting p.split be an average of said 2 computed remaining values; and (i2) means for resuming at (j); (j) means for defining a splitter which consists of said vsplit and said p.split; (k) means for denoting for each of said n distinct input content items in said current collection a designation of “
upper”
if p.i>
p.split, otherwise a designation of “
lower”
;(l) means for building an interior node, consisting of said splitter, and defining links to said “
lower” and
said “
upper”
nodes;(m) means for inputting said “
lower”
nodes as items into a new “
lower”
current collection, letting new “
lower”
n denote the number of items in said new “
lower”
current collection, replacing said current collection with said new “
lower”
current collection, replacing said n with said new “
lower”
n, and resuming at (c);(n) means for inputting said “
upper”
nodes as items into a new “
upper”
current collection, letting new “
upper”
n denote the number of items in said new “
upper”
current collection, replacing said current collection with said new “
upper”
current collection, replacing said n with said new “
upper”
n, and resuming at (c). - View Dependent Claims (7, 8, 9, 10)
-
Specification