Vector index preparing method, similar vector searching method, and apparatuses for the methods
First Claim
1. A method of preparing a mechanically searchable index with respect to a vector database in which a finite number of sets each including at least N-dimensional real vector and an identification number of the vector are registered as vector data, said method comprising:
- a first step of vector index preparation of dividing N components into m sets in a predetermined method with respect to the N-dimensional real vector V of each vector data in said vector database, preparing m partial vectors v1 to vm, subsequently tabulating a distribution of a norm of the partial vector vk (k=1 to m), preparing a norm division table in which a norm range of a predetermined D type norm division is determined, calculating a region number d to which said partial vector vk belongs in accordance with predetermined D region center vectors p1 to pD, tabulating a distribution of a cosine (vk·
pd)/(|Vk|*|pd|) of an angle formed by said partial vector vk and the region center vector pd as a declination distribution, and preparing a declination division table in which a declination range of the predetermined C type declination division is recorded;
a second step of the vector index preparation of dividing N components into m sets in the same method as said first step with respect to the N-dimensional real vector V of each vector data in said vector database, preparing m partial vectors v1 to vm, referring to said norm division table to calculate a number r of the norm division to which the norm of said partial vector vb belongs with respect to the partial vector vb (b=1 to m) for the partial space number b, calculating the region number d to which said partial vector vb belongs in accordance with the predetermined D region center vectors pl to pD in the same method as said first step, calculating a declination (vb·
pd)/(|vb|*|pd|) as a cosine of an angle formed by said partial vector vb and the region center vector pd indicating a center direction of the region of said region number d, referring to said declination division table, calculating a number c of the belonging declination division, and calculating index registration data to be registered in a vector index from said partial space number b, said region number d, said declination division number c, said norm division number r, the component of said partial vector vb, and the identification number i; and
a third step of the vector index preparation of constituting the vector index such that the identification number and the component of each partial vector can be searched using a set of the partial space number b, the region number d, the declination division number c and a norm division number range [r1, r2] as a key from said norm division table, said declination division table, and said index registration data, and such that the vector component of each vector data can be searched with the identification number of the vector component.
1 Assignment
0 Petitions
Accused Products
Abstract
In the present invention, a similar vector is searched from a several hundreds dimensional vector database at a high speed, by a single vector index, and in accordance with either measure of an inner product or a distance by designating a similarity search range and maximum obtained pieces number, vector index preparation is performed by decomposing each vector into a plurality of partial vectors and characterizing the vector by a norm division, belonging region and declination division to prepare an index, and similarity search is performed by obtaining a partial query vector and partial search range from a query vector and search range, performing similarity search in each partial space to accumulate a difference from the search range and to obtain an upper limit value, and obtaining a correct measure from a higher upper limit value to obtain a final similarity search result.
27 Citations
29 Claims
-
1. A method of preparing a mechanically searchable index with respect to a vector database in which a finite number of sets each including at least N-dimensional real vector and an identification number of the vector are registered as vector data, said method comprising:
-
a first step of vector index preparation of dividing N components into m sets in a predetermined method with respect to the N-dimensional real vector V of each vector data in said vector database, preparing m partial vectors v1 to vm, subsequently tabulating a distribution of a norm of the partial vector vk (k=1 to m), preparing a norm division table in which a norm range of a predetermined D type norm division is determined, calculating a region number d to which said partial vector vk belongs in accordance with predetermined D region center vectors p1 to pD, tabulating a distribution of a cosine (vk·
pd)/(|Vk|*|pd|) of an angle formed by said partial vector vk and the region center vector pd as a declination distribution, and preparing a declination division table in which a declination range of the predetermined C type declination division is recorded;
a second step of the vector index preparation of dividing N components into m sets in the same method as said first step with respect to the N-dimensional real vector V of each vector data in said vector database, preparing m partial vectors v1 to vm, referring to said norm division table to calculate a number r of the norm division to which the norm of said partial vector vb belongs with respect to the partial vector vb (b=1 to m) for the partial space number b, calculating the region number d to which said partial vector vb belongs in accordance with the predetermined D region center vectors pl to pD in the same method as said first step, calculating a declination (vb·
pd)/(|vb|*|pd|) as a cosine of an angle formed by said partial vector vb and the region center vector pd indicating a center direction of the region of said region number d, referring to said declination division table, calculating a number c of the belonging declination division, and calculating index registration data to be registered in a vector index from said partial space number b, said region number d, said declination division number c, said norm division number r, the component of said partial vector vb, and the identification number i; and
a third step of the vector index preparation of constituting the vector index such that the identification number and the component of each partial vector can be searched using a set of the partial space number b, the region number d, the declination division number c and a norm division number range [r1, r2] as a key from said norm division table, said declination division table, and said index registration data, and such that the vector component of each vector data can be searched with the identification number of the vector component. - View Dependent Claims (3, 4, 5, 6, 7, 8, 9, 28)
-
-
2. A method of preparing a mechanically searchable index with respect to a vector database in which a finite number of sets each including at least N-dimensional real vector and an identification number of the vector are registered as vector data, said method comprising:
-
a first step of vector index preparation of dividing N components into m sets in a predetermined method with respect to the N-dimensional real vector V of each vector data in said vector database, preparing m partial vectors vl to vm, subsequently tabulating a distribution of a norm of the partial vector vb (b=1 to m) for each partial space number b, preparing a norm division table in which a norm range of a predetermined D type norm division is determined, calculating a region number d to which said partial vector vb belongs in accordance with predetermined D region center vectors pl to pD tabulating a distribution of a cosine (vb·
pd)/(|vb|*|pd|) of an angle formed by said partial vector vb and the region center vector pd as a declination distribution, and preparing a declination division table in which a declination range of the predetermined C type declination division is recorded;
a second step of the vector index preparation of dividing N components into m sets in the same method as said first step with respect to the N-dimensional real vector V of each vector data in said vector database, preparing m partial vectors vl to vm, referring to said norm division table to calculate a number r of the norm division to which the norm of said partial vector vb belongs with respect to the partial vector vb (b=1 to m) for said partial space b, calculating the region number d to which said partial vector vb belongs in accordance with the predetermined D region center vectors pl to pD in the same method as said first step, calculating a declination (vb·
pd)/(|vb|*|pd|) as a cosine of an angle formed by said partial vector vb and the region center vector pd indicating a center direction of the region of said region number d, referring to said declination division table, calculating a number c of the belonging declination division, calculating a component division number wj of a predetermined range to which vbj belongs from a maximum value of the norm of the norm division corresponding to said calculated norm division number r with respect to each component vbj of said calculated partial vector vb, and calculating index registration data to be registered in a vector index from said partial space number b, said region number d, said declination division number c, said norm division number r, a string of said component division numbers wj, and the identification number i; and
a third step of the vector index preparation of constituting the vector index such that the identification number and the component of each partial vector can be searched using a set of the partial space number b, the region number d, the declination division number c and a norm division number range [r1, r2] as a key from said norm division table, said declination division table, and said index registration data, and such that the vector component of each vector data can be searched with the identification number of the vector component.
-
-
10. A similar vector searching method in which a query vector Q of an N-dimensional real vector, an inner product lower limit value α
- , and maximum obtained vector number L are designated as search conditions, a vector index prepared from vector data with a finite number of sets of at least N-dimensional real vector and an ID number of the real vector registered therein is searched, and L sets at maximum (i, V·
Q) of an identification number i and an inner product of Q and V are obtained with respect to vector data (i, V) of said vector database whose value V·
Q of the inner product with said query vector Q is larger than said inner product lower limit value α
, said similar vector searching method comprising;
a first step of similar vector search of dividing N components of Q into m sets in the same predetermined method as a method used in preparing said vector index with respect to said query vector Q, preparing m partial query vectors ql to qm, calculating a partial inner product lower limit value fb as a lower limit value of an inner product (hereinafter referred to as “
partial inner product) of each partial query vector qb and the corresponding partial vector from a designated inner product lower limit value α
, calculating a partial space number b, and a set (c, [r1, r2]) of a declination division number c to be searched in a region number d and a norm division range [r1, r2] from a value of an inner product pd·
qb of the region center vector pd and said partial query vector qb, said partial inner product lower limit value fb, and a norm division table and a declination division table in said vector index with respect to each partial query vector qb (b=1 to m) and each region b, searching a range of said vector index using (b, d, c, [r1, r2]) as a search condition based on said calculated (c, [r1, r2]), obtaining the identification number i and the component of the partial vector vb satisfying the condition as an index search result, calculating a partial inner product difference (vb·
qb)−
fb as a difference between a partial inner product vb·
qb of said vb and qb and said partial inner product lower limit value fb, and accumulating (adding) the difference as an inner product difference upper limit value S[i] of the identification number i of an inner product difference table; and
a second step of the similar vector search of searching said vector index with the identification number i in order from a largest value in said inner product difference table S[i] to obtain a vector data component V, calculating an inner product difference value t=V·
Q−
α
by subtracting a from the inner product V·
Q of V and said query vector Q, and outputting a set of at least the identification number i and an inner product t+α
as a search result with respect to L pieces at maximum of vector data with a large inner product difference value when L or more pieces of vector data having the inner product difference value larger than a maximum value of an element having a non-calculated inner product difference value are collected, or when the inner products of all the vector data having a positive inner product difference upper limit value are calculated in said inner product difference table. - View Dependent Claims (12)
- , and maximum obtained vector number L are designated as search conditions, a vector index prepared from vector data with a finite number of sets of at least N-dimensional real vector and an ID number of the real vector registered therein is searched, and L sets at maximum (i, V·
-
11. A similar vector searching method in which a query vector Q of an N-dimensional real vector, a distance upper limit value α
- , and maximum obtained vector number L are designated as search conditions, a vector index prepared from vector data with a finite number of sets of at least N-dimensional real vector and an identification number of the real vector registered therein is searched, and L sets at maximum (i, p) of an identification number i of an N-dimensional real vector V in said vector data and a distance p between Q and V are obtained such that a value of an inner product with said query vector Q is not more than said distance upper limit value α
, said similar vector searching method comprising;
a first step of similar vector search of dividing N components of Q into m sets in the same predetermined method as a method used in preparing said vector index with respect to said query vector Q, preparing m partial query vectors q1 to qm, calculating a partial square distance upper limit value fb as an upper limit value of a square distance |vb−
qb|2 (i.e., square of Euclidean distance, hereinafter referred to as “
partial square distance”
) of each partial query vector (L and the corresponding partial vector vb from a designated distance upper limit value α
, systematically generating a set (b, d, c, [r1, r2]) of a partial space number b to be searched, a region number d, a declination division number c and a norm division range [r1, r2] from said partial query vector qb, said partial square distance upper limit value fb, and a norm division table and a declination division table in said vector index with respect to each partial query vector qb(b=1 to m), searching a range of said vector index using said generated (b, d, c, [r1, r2]) as a search condition, obtaining the identification number i and the component of the partial vector vb satisfying the condition as an index search result, calculating a partial square distance difference fb−
|vb−
qb|2 as a difference between said partial square distance upper limit value fb and a partial square distance |vb−
qb|2 of vb and qb, and accumulating (adding) the difference as a square distance difference upper limit value S[i] of the identification number i of a square distance difference table; and
a second step of the similar vector search of searching said vector index with the identification number i in order from a largest value in said square distance difference table S[i] to obtain a vector data component V, calculating a square distance difference value α
2−
|V−
Q|2 by subtracting a square distance |V−
Q|2 of V and said query vector Q from a squared distance upper limit value α
2, and outputting a set of at least the identification number i and a distance (α
2−
t)½
as a search result with respect to L pieces at maximum of vector data with a large square distance difference value t when L or more pieces of vector data having the square distance difference value larger than a maximum value of an element having a non-calculated square distance difference value are collected, or when the square distance difference values of all the vector data having a positive square distance difference upper limit value are calculated in said square distance difference table. - View Dependent Claims (13, 14, 17, 18, 19, 20, 21, 22, 25, 26, 27, 29)
- , and maximum obtained vector number L are designated as search conditions, a vector index prepared from vector data with a finite number of sets of at least N-dimensional real vector and an identification number of the real vector registered therein is searched, and L sets at maximum (i, p) of an identification number i of an N-dimensional real vector V in said vector data and a distance p between Q and V are obtained such that a value of an inner product with said query vector Q is not more than said distance upper limit value α
-
15. An apparatus for preparing a mechanically searchable index with respect to a vector database in which a finite number of sets each including at least N-dimensional real vector and an identification number of the vector are registered as vector data, said apparatus comprising:
-
partial vector calculation means for dividing N components into m sets in a predetermined method with respect to the N-dimensional real vector V of each vector data in said vector database, and preparing m partial vectors v1 to vm;
norm distribution tabulation means for tabulating a distribution of a norm of the partial vector vk (k=1 to m) among said prepared m partial vectors vI to vm, and preparing a norm division table in which a norm range of a predetermined D type norm division is determined;
region number calculation means for calculating a region number d to which said partial vector vk belongs in accordance with predetermined D region center vectors pl to pD;
declination distribution tabulation means for tabulating a distribution of a cosine (vk·
pd)/(|Vk|*|pd|) of an angle formed by said partial vector vk and the region center vector pd as a declination distribution, and preparing a declination division table in which a declination range of the predetermined C type declination division is recorded;
norm division number calculation means for referring to said norm division table to calculate a number r of the norm division to which the norm of said partial vector vb belongs with respect to the partial vector vb (b=1 to m) for the partial space number b among the m partial vectors v1 to vm prepared by said partial vector calculation means;
declination division number calculation means for calculating a declination (vb·
pd)/(|vb|*|pd|) as a cosine of an angle formed by said partial vector vb and the region center vector pd indicating a center direction of the region of said region number d calculated by said region number calculation means;
index data calculation means for calculating index registration data to be registered in a vector index from said partial space number b, said region number d, said declination division number c, said norm division number r, the component of said partial vector vb, and the identification number i; and
index constituting means for constituting the vector index such that the identification number and the component of each partial vector can be searched using a set of the partial space number b, the region number d, the declination division number c and a norm division number range [r1, r2] as a key from said norm division table, said declination division table, and said index registration data, and such that the vector component of each vector data can be searched with the identification number of the vector component.
-
-
16. An apparatus for preparing a mechanically searchable index with respect to a vector database in which a finite number of sets each including at least N-dimensional real vector and an identification number of the vector are registered as vector data, said apparatus comprising:
-
partial vector calculation means for dividing N components into m sets in a predetermined method with respect to the N-dimensional real vector V of each vector data in said vector database, and preparing m partial vectors v1 to vm;
norm distribution tabulation means for tabulating a distribution of a norm of the partial vector vb (b=1 to m) for a partial space number b among said prepared m partial vectors v1 to vm, and preparing a norm division table in which a norm range of a predetermined D type norm division is determined;
region number calculation means for calculating a region number d to which said partial vector vb belongs in accordance with predetermined D region center vectors pl to pD;
declination distribution tabulation means for tabulating a distribution of a cosine (vb·
pd)/(|vb|*|pd|) of an angle formed by said partial vector vb and the region center vector pd as a declination distribution, and preparing a declination division table in which a declination range of the predetermined C type declination division is recorded;
norm division number calculation means for referring to said norm division table to calculate a number r of the norm division to which the norm of said partial vector vb belongs with respect to the partial vector vb (b=1 to m) for a partial space b among the m partial vectors v1 to vm prepared by said partial vector calculation means;
declination division number calculation means for calculating a declination (vb·
pd)/(|vb|*|pd|) as a cosine of an angle formed by said partial vector vb and the region center vector pd indicating a center direction of the region of the region number d calculated by said region number calculation means;
component division number calculation means for calculating a component division number wj of a predetermined range to which vbj belongs from a maximum value of the norm of the norm division corresponding to said calculated norm division number r with respect to each component vbj of said calculated partial vector vb;
index data calculation means for calculating index registration data to be registered in a vector index from said partial space number b, said region number d, said declination division number c, said norm division number r, a string of said component division numbers wj, and the identification number i; and
index constituting means for constituting the vector index such that the identification number and the component of each partial vector can be searched using a set of the partial space number b, the region number d, the declination division number c and a norm division number range [r1, r2] as a key from said norm division table, said declination division table, and said index registration data, and such that the vector component of each vector data can be searched with the identification number of the vector component.
-
-
23. A similar vector searching apparatus for designating a query vector Q of an N-dimensional real vector, an inner product lower limit value α
- , and maximum obtained vector number L as search conditions, searching a vector index prepared from vector data with a finite number of sets of at least N-dimensional real vector and an ID number of the real vector registered therein, and obtaining L sets at maximum (i, V·
Q) of an identification number i and an inner product of Q and V with respect to vector data (i, V) of said vector database whose value V·
Q of the inner product with said query vector Q is larger than said inner product lower limit value α
, said similar vector searching apparatus comprising;
partial query condition calculation means for dividing N components of Q into m sets in the same predetermined method as a method used in preparing said vector index with respect to said query vector Q, preparing m partial query vectors ql to qm, and calculating a partial inner product lower limit value fb as a lower limit value of an inner product (hereinafter referred to as “
partial inner product) of each partial query vector qb and the corresponding partial vector from a designated inner product lower limit value α
;
search object range generation means for calculating a partial space number b, and a set (c, [r1, r2]) of a declination division number c to be searched in a region number d and a norm division range [r1, r2] from a value of an inner product pd·
qb of the region center vector pd and said partial query vector qb, said partial inner product lower limit value fb, and a norm division table and a declination division table in said vector index with respect to each partial query vector qb (b=1 to m) and each region b;
index search means for searching a range of said vector index using (b, d, c, [r1, r2]) as a search condition based on (c, [r1, r2]) calculated by said search object range generation means, and obtaining the identification number i and the component of the partial vector vb satisfying the condition as an index search result;
inner product difference upper limit calculation means for calculating a partial inner product difference (vb·
qb)−
fb as a difference between a partial inner product vb·
qb of said vb and qb and said partial inner product lower limit value fb, and accumulating (adding) the difference as an inner product difference upper limit value S[i] of the identification number i of an inner product difference table; and
similarity search result determination means for searching said vector index with the identification number i in order from a largest value in said inner product difference table S[i] to obtain a vector data component V, calculating an inner product difference value t=V·
Q−
α
by subtracting α
from the inner product V·
Q of V and said query vector Q, and outputting a set of at least the identification number i and an inner product t+α
as a search result with respect to L pieces at maximum of vector data with a large inner product difference value when L or more pieces of vector data having the inner product difference value larger than a maximum value of an element having a non-calculated inner product difference value are collected, or when the inner products of all the vector data having a positive inner product difference upper limit value are calculated in said inner product difference table.
- , and maximum obtained vector number L as search conditions, searching a vector index prepared from vector data with a finite number of sets of at least N-dimensional real vector and an ID number of the real vector registered therein, and obtaining L sets at maximum (i, V·
-
24. A similar vector searching apparatus for designating a query vector Q of an N-dimensional real vector, a distance upper limit value α
- , and maximum obtained vector number L as search conditions, searching a vector index prepared from vector data with a finite number of sets of at least N-dimensional real vector and an identification number of the real vector registered therein, and obtaining L sets at maximum (i, p) of an identification number i of an N-dimensional real vector V in said vector data and a distance p between Q and V such that a value of an inner product with said query vector Q is not more than said distance upper limit value α
, said similar vector searching apparatus comprising;
partial query condition calculation means for dividing N components of Q into m sets in the same predetermined method as a method used in preparing said vector index with respect to said query vector Q, preparing m partial query vectors ql to qm, calculating a partial square distance upper limit value fb as an upper limit value of a square distance |vb−
qb|2 (i.e., square of Euclidean distance, hereinafter referred to as “
partial square distance”
) of each partial query vector qb and the corresponding partial vector vb from a designated distance upper limit value α
;
search object range generation means for systematically generating a set (b, d, c, [r1, r2]) of a partial space number b to be searched, a region number d, a declination division number c and a norm division range [r1, r2] from said partial query vector qb, said partial square distance upper limit value fb, and a norm division table and a declination division table in said vector index with respect to said partial query vector qb (b=1 to m);
index search means for searching a range of said vector index using (b, d, c, [r1, r2]) generated by said search object range generation means as a search condition, and obtaining the identification number i and the component of the partial vector vb satisfying the condition as an index search result;
square distance difference upper limit calculation means for calculating a partial square distance difference fb−
|vb−
qb l |2 as a difference between said partial square distance upper limit value fb and a partial square distance |vb−
qb|2 of vb and qb, and accumulating (adding) the difference as a square distance difference upper limit value S[i] of the identification number i of a square distance difference table; and
similarity search result determination means for searching said vector index with the identification number i in order from a largest value in said square distance difference table S[i] to obtain a vector data component V, calculating a square distance difference value α
2−
|V−
Q|2 by subtracting a square distance |V−
Q|2 of V and said query vector Q from a squared distance upper limit value α
2, and outputting a set of at least the identification number i and a distance (α
2−
t)½
as a search result with respect to L pieces at maximum of vector data with a large square distance difference value t when L or more pieces of vector data having the square distance difference value larger than a maximum value of an element having a non-calculated square distance difference value are collected, or when the square distance difference values of all the vector data having a positive square distance difference upper limit value are calculated in said square distance difference table.
- , and maximum obtained vector number L as search conditions, searching a vector index prepared from vector data with a finite number of sets of at least N-dimensional real vector and an identification number of the real vector registered therein, and obtaining L sets at maximum (i, p) of an identification number i of an N-dimensional real vector V in said vector data and a distance p between Q and V such that a value of an inner product with said query vector Q is not more than said distance upper limit value α
Specification