×

Method and apparatus for fast similarity-based query, self-join, and join for massive, high-dimension datasets

  • US 7,644,090 B2
  • Filed: 06/24/2007
  • Issued: 01/05/2010
  • Est. Priority Date: 06/27/2006
  • Status: Active Grant
First Claim
Patent Images

1. A computer implemented method comprising:

  • (a) specifying a maximum of G items per leaf node;

    (b) inputting n distinct input content items and denoting said n distinct input content items as a current collection;

    (c) determining if n>

    G; and

    (d) if not then(d1) building a leaf node;

    (d2) filling said leaf node with said n distinct input content items; and

    (d3) connecting a link from a parent to said leaf node capable of being stored in hardware on said computer and capable of being displayed to a user.(e) if yes then(e1) computing a vector sum over all items in said current collection, wherein said vector sum is denoted vsplit, and vsplit=sum(i;

    x.i)/n, where i denotes an index (i=1, . . . , n) and x.i denotes a vector at index i;

    (e2) computing a vector difference for each item in said current collection, wherein said vector difference is denoted as d.i, and d.i=x.i−

    vsplit;

    (e3) computing a scalar value for each item in said current collection, wherein said scalar value is denoted as p.i, and p.i=<

    d.i, vsplit>

    where <

    .>

    denotes inner product, and making a collection of said computed scalar value for each item;

    (e4) determining if p.i<

    3; and

    (f) if not then(f1) removing a largest p.i from said collection(f2) removing a smallest p.i from said collection; and

    (f3) resuming at (e3)(g) if yes then(g1) determining if 1 or 2 computed values remain in said collection; and

    (h) if 1 then(h1) letting p.split be said 1 computed remaining value; and

    (h2) resuming at (j);

    (i) if 2 then(i1) letting p.split be an average of said 2 computed remaining values; and

    (i2) resuming at (j);

    (j) defining a splitter which consists of said vsplit and said p.split;

    (k) denoting for each of said n distinct input content items in said current collection a designation of “

    upper”

    if p.i>

    p.split, otherwise a designation of “

    lower”

    ;

    (l) building an interior node, consisting of said splitter, and defining links to said “

    lower” and

    said “

    upper”

    nodes;

    (m) inputting said “

    lower”

    nodes as items into a new “

    lower”

    current collection, letting new “

    lower”

    n denote the number of items in said new “

    lower”

    current collection, replacing said current collection with said new “

    lower”

    current collection, replacing said n with said new “

    lower”

    n, and resuming at (c);

    (n) inputting said “

    upper”

    nodes as items into a new “

    upper”

    current collection, letting new “

    upper”

    n denote the number of items in said new “

    upper”

    current collection replacing said current collection with said new “

    upper”

    current collection, replacing said n with said new “

    upper”

    n, and resuming at (c).

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×