Vector index preparing method, similar vector searching method, and apparatuses for the methods
First Claim
1. A method of preparing an index, which is searchable by a computer, with respect to a vector database in which a finite number of ordered lists each including at least N-dimensional real vector and an identification number of the vector are registered as vector data, said index being used for data retrieval using a computer, said method comprising:
- a first step of vector index preparation of dividing N components into m ordered list in a predetermined method with respect to the N-dimensional real vector V of each vector data in said vector database, preparing m partial vectors v1 to vm, subsequently tabulating a distribution of a norm of the partial vector vk (k=1 to m), preparing a norm partition table which contains a predetermined number of norm ranges, calculating a region number d to which said partial vector vk belongs in accordance with predetermined D region center vectors p1 to pD, tabulating a distribution of a cosine (vk·
pd)/(|Vk|*|pd|) of an angle formed by said partial vector vk and the region center vector pd as a declination distribution, and preparing a declination partition table which contains a predetermined number of declination ranges;
a second step of the vector index preparation of dividing N components into m ordered lists in the same method as said first step with respect to the N-dimensional real vector V of each vector data in said vector database, preparing m partial vectors v1 to vm, referring to said norm partition table to calculate a number r of the norm partition to which the norm of said partial vector vb belongs with respect to the partial vector vb (b=1 to m) for the partial space number b, calculating the region number d to which said partial vector vb belongs in accordance with the predetermined D region center vectors p1 to pD in the same method as said first step, calculating a declination (vb·
pd)/(|vb|*|pd|) as a cosine of an angle formed by said partial vector vb and the region center vector pd indicating a center direction of the region of said region number d, referring to said declination partition table, calculating a number c of the belonging declination partition, and calculating index registration data to be registered in a vector index from said partial space number b, said region number d, said declination partition number c, said norm partition number r, the component of said partial vector vb, and the identification number i; and
a third step of the vector index preparation of constituting the vector index such that the identification number and the component of each partial vector can be searched using a ordered list of the partial space number b, the region number d, the declination partition number c and a norm partition number range (r1, r2) as a key from said norm partition table, said declination partition table, and said index registration data, and such that the vector component of each vector data can be searched with the identification number of the vector component.
1 Assignment
0 Petitions
Accused Products
Abstract
In the present invention, a similar vector is searched from a several hundreds dimensional vector database at a high speed, by a single vector index, and in accordance with either measure of an inner product or a distance by designating a similarity search range and maximum obtained pieces number, vector index preparation is performed by decomposing each vector into a plurality of partial vectors and characterizing the vector by a norm division, belonging region and declination division to prepare an index, and similarity search is performed by obtaining a partial query vector and partial search range from a query vector and search range, performing similarity search in each partial space to accumulate a difference from the search range and to obtain an upper limit value, and obtaining a correct measure from a higher upper limit value to obtain a final similarity search result.
32 Citations
29 Claims
-
1. A method of preparing an index, which is searchable by a computer, with respect to a vector database in which a finite number of ordered lists each including at least N-dimensional real vector and an identification number of the vector are registered as vector data, said index being used for data retrieval using a computer, said method comprising:
-
a first step of vector index preparation of dividing N components into m ordered list in a predetermined method with respect to the N-dimensional real vector V of each vector data in said vector database, preparing m partial vectors v1 to vm, subsequently tabulating a distribution of a norm of the partial vector vk (k=1 to m), preparing a norm partition table which contains a predetermined number of norm ranges, calculating a region number d to which said partial vector vk belongs in accordance with predetermined D region center vectors p1 to pD, tabulating a distribution of a cosine (vk·
pd)/(|Vk|*|pd|) of an angle formed by said partial vector vk and the region center vector pd as a declination distribution, and preparing a declination partition table which contains a predetermined number of declination ranges;a second step of the vector index preparation of dividing N components into m ordered lists in the same method as said first step with respect to the N-dimensional real vector V of each vector data in said vector database, preparing m partial vectors v1 to vm, referring to said norm partition table to calculate a number r of the norm partition to which the norm of said partial vector vb belongs with respect to the partial vector vb (b=1 to m) for the partial space number b, calculating the region number d to which said partial vector vb belongs in accordance with the predetermined D region center vectors p1 to pD in the same method as said first step, calculating a declination (vb·
pd)/(|vb|*|pd|) as a cosine of an angle formed by said partial vector vb and the region center vector pd indicating a center direction of the region of said region number d, referring to said declination partition table, calculating a number c of the belonging declination partition, and calculating index registration data to be registered in a vector index from said partial space number b, said region number d, said declination partition number c, said norm partition number r, the component of said partial vector vb, and the identification number i; anda third step of the vector index preparation of constituting the vector index such that the identification number and the component of each partial vector can be searched using a ordered list of the partial space number b, the region number d, the declination partition number c and a norm partition number range (r1, r2) as a key from said norm partition table, said declination partition table, and said index registration data, and such that the vector component of each vector data can be searched with the identification number of the vector component. - View Dependent Claims (3, 4, 5, 6, 7, 8, 9, 28)
-
-
2. A method of preparing an index, which is searchable by a computer, with respect to a vector database in which a finite number of ordered lists each including at least N-dimensional real vector and an identification number of the vector are registered as vector data, said index being used for data retrieval using a computer, said method comprising:
-
a first step of vector index preparation of dividing N components into m ordered list in a predetermined method with respect to the N-dimensional real vector V of each vector data in said vector database, preparing m partial vectors v1 to vm, subsequently tabulating a distribution of a norm of the partial vector vb (b=1 to m) for each partial space number b, preparing a norm partition table which contains a predetermined number of norm ranges, calculating a region number d to which said partial vector vb belongs in accordance with predetermined D region center vectors p1 to pD tabulating a distribution of a cosine (vb·
pd)/(|vb|*|pd|) of an angle formed by said partial vector vb and the region center vector pd as a declination distribution, and preparing a declination partition table which contains a predetermined number of norm ranges;a second step of the vector index preparation of dividing N components into m ordered list in the same method as said first step with respect to the N-dimensional real vector V of each vector data in said vector database, preparing m partial vectors v1 to vm, referring to said norm partition table to calculate a number r of the norm partition to which the norm of said partial vector vb belongs with respect to the partial vector vb (b=1 to m) for said partial space b, calculating the region number d to which said partial vector vb belongs in accordance with the predetermined D region center vectors p1 to pD in the same method as said first step, calculating a declination (vb·
pd)/(|vb|*|pd|) as a cosine of an angle formed by said partial vector vb and the region center vector pd indicating a center direction of the region of said region number d, referring to said declination partition table, calculating a number c of the belonging declination partition, calculating a component partition number wj of a predetermined range to which vbj belongs from a maximum value of the norm of the norm partition corresponding to said calculated norm partition number r with respect to each component vbj of said calculated partial vector vb, and calculating index registration data to be registered in a vector index from said partial space number b, said region number d, said declination partition number c, said norm partition number r, a string of said component partition numbers wj, and the identification number i; anda third step of the vector index preparation of constituting the vector index such that the identification number and the component of each partial vector can be searched using a set of the partial space number b, the region number d, the declination partition number c and a norm partition number range (r1, r2) as a key from said norm partition table, said declination partition table, and said index registration data, and such that the vector component of each vector data can be searched with the identification number of the vector component.
-
-
10. A similarity vector searching method in which a query vector Q of an N-dimensional real vector, an inner product lower limit value α
- , and maximum obtained vector number L are designated as search conditions, a vector index prepared from vector data with a finite number of ordered list of at least N-dimensional real vector and an ID number of the real vector registered therein is searched, and L ordered list at maximum (i, V·
Q) of an identification number i and an inner product of Q and V are obtained with respect to vector data (i, V) of said vector database whose value V·
Q of the inner product with said query vector Q is larger than said inner product lower limit value α
, said similar vector searching method comprising;a first step of similar vector search of dividing N components of Q into m ordered lists in the same predetermined method as a method used in preparing said vector index with respect to said query vector Q, preparing m partial query vectors ql to qm, calculating a partial inner product lower limit value fb as a lower limit value of a partial inner product of each partial query vector qb and the corresponding partial vector from a designated inner product lower limit value α
, calculating a partial space number b, and an ordered list (c, (r1, r2)) of a declination division number c to be searched in a region number d and a norm partition range (r1, r2) from a value of an inner product pd·
qb of the region center vector pd and said partial query vector qb, said partial inner product lower limit value fb, and a norm partition table and a declination partition table in said vector index with respect to each partial query vector qb (b=1 to m) and each region b, searching a range of said vector index using (b, d, c, (r1, r2)) as a search condition based on said calculated (c, (r1, r2)), obtaining the identification number i and the component of the partial vector vb satisfying the condition as an index search result, calculating a partial inner product difference (vb·
qb)−
fb as a difference between a partial inner product vb·
qb of said vb and qb and said partial inner product lower limit value fb, and accumulating (adding) the difference as an inner product difference upper limit value S(i) of the identification number i of an inner product difference table; anda second step of the similar vector search of searching said vector index with the identification number i in order from a largest value in said inner product difference table S(i) to obtain a vector data component V, calculating an inner product difference value t=V·
Q−
α
by subtracting a from the inner product V·
Q of V and said query vector Q, and outputting an ordered list of at least the identification number i and an inner product t+α
as a search result with respect to L pieces at maximum of vector data with a large inner product difference value when L or more pieces of vector data having the inner product difference value larger than a maximum value of an element having a non-calculated inner product difference value are collected, or when the inner products of all the vector data having a positive inner product difference upper limit value are calculated in said inner product difference table. - View Dependent Claims (12)
- , and maximum obtained vector number L are designated as search conditions, a vector index prepared from vector data with a finite number of ordered list of at least N-dimensional real vector and an ID number of the real vector registered therein is searched, and L ordered list at maximum (i, V·
-
11. A similarity vector searching method in which a query vector Q of an N-dimensional real vector, a distance upper limit value α
- , and maximum obtained vector number L are designated as search conditions, a vector index prepared from vector data with a finite number of ordered lists of at least N-dimensional real vector and an identification number of the real vector registered therein is searched, and L ordered lists at maximum (i, p) of an identification number i of an N-dimensional real vector V in said vector data and a distance p between Q and V are obtained such that a value of an inner product with said query vector Q is not more than said distance upper limit value α
, said similar vector searching method comprising;a first step of similar vector search of dividing N components of Q into m ordered lists in the same predetermined method as a method used in preparing said vector index with respect to said query vector Q, preparing m partial query vectors q1 to qm, calculating a partial square distance upper limit value fb as an upper limit value of a partial square distance |vb−
qb|2 (i.e.,) corresponding to square of Euclidean distance of each partial query vector qb and the corresponding partial vector vb from a designated distance upper limit value α
, systematically generating an ordered list (b, d, c, (r1, r2)) of a partial space number b to be searched, a region number d, a declination partition number c and a norm partition range (r1, r2) from said partial query vector qb, said partial square distance upper limit value fb, and a norm partition table and a declination partition table in said vector index with respect to each partial query vector qb (b=1 to m), searching a range of said vector index using said generated (b, d, c, (r1, r2)) as a search condition, obtaining the identification number i and the component of the partial vector vb satisfying the condition as an index search result, calculating a partial square distance difference fb−
|vb−
qb|2 as a difference between said partial square distance upper limit value fb and a partial square distance |vb−
qb|2 of vb and qb, and accumulating (adding) the difference as a square distance difference upper limit value S(i) of the identification number i of a square distance difference table; anda second step of the similar vector search of searching said vector index with the identification number i in order from a largest value in said square distance difference table S(i) to obtain a vector data component V, calculating a square distance difference value α
2−
|V−
Q|2 by subtracting a square distance |V−
Q|2 of V and said query vector Q from a squared distance upper limit value α
2, and outputting an ordered list of at least the identification number i and a distance (α
2−
t)1/2 as a search result with respect to L pieces at maximum of vector data with a large square distance difference value t when L or more pieces of vector data having the square distance difference value larger than a maximum value of an element having a non-calculated square distance difference value are collected, or when the square distance difference values of all the vector data having a positive square distance difference upper limit value are calculated in said square distance difference table. - View Dependent Claims (13, 14)
- , and maximum obtained vector number L are designated as search conditions, a vector index prepared from vector data with a finite number of ordered lists of at least N-dimensional real vector and an identification number of the real vector registered therein is searched, and L ordered lists at maximum (i, p) of an identification number i of an N-dimensional real vector V in said vector data and a distance p between Q and V are obtained such that a value of an inner product with said query vector Q is not more than said distance upper limit value α
-
15. An apparatus for preparing an index, which is searchable by a computer, with respect to a vector database in which a finite number of ordered lists each including at least N-dimensional real vector and an identification number of the vector are registered as vector data, said index being used for data retrieval using a computer, said apparatus comprising:
-
partial vector calculation means for dividing N components into m ordered lists in a predetermined method with respect to the N-dimensional real vector V of each vector data in said vector database, and preparing m partial vectors v1 to vm; norm distribution tabulation means for tabulating a distribution of a norm of the partial vector vk (k=1 to m) among said prepared m partial vectors v1 to vm, and preparing a norm partition table which contains a predetermined number of norm ranges; region number calculation means for calculating a region number d to which said partial vector vk belongs in accordance with predetermined D region center vectors pl to pD; declination distribution tabulation means for tabulating a distribution of a cosine (vk·
pd)/(|Vk|*|pd|) of an angle formed by said partial vector vk and the region center vector pd as a declination distribution, and preparing a declination partition table which contains a predetermined number of declination ranges;norm division number calculation means for referring to said norm partition table to calculate a number r of the norm partition to which the norm of said partial vector vb belongs with respect to the partial vector vb (b=1 to m) for the partial space number b among the m partial vectors v1 to vm prepared by said partial vector calculation means; declination partition number calculation means for calculating a declination (vb·
pd)/(|vb|*|pd|) as a cosine of an angle formed by said partial vector vb and the region center vector pd indicating a center direction of the region of said region number d calculated by said region number calculation means;index data calculation means for calculating index registration data to be registered in a vector index from said partial space number b, said region number d, said declination partition number c, said norm partition number r, the component of said partial vector vb, and the identification number i; and index constituting means for constituting the vector index such that the identification number and the component of each partial vector can be searched using an ordered list of the partial space number b, the region number d, the declination partition number c and a norm partition number range as a key from said norm partition table, said declination partition table, and said index registration data, and such that the vector component of each vector data can be searched with the identification number of the vector component. - View Dependent Claims (17, 18, 19, 20, 21, 22, 29)
-
-
16. An apparatus for preparing an index, which is searchable by a computer, with respect to a vector database in which a finite number of ordered lists each including at least N-dimensional real vector and an identification number of the vector are registered as vector data, said index being used for data retrieval using a computer, said apparatus comprising:
-
partial vector calculation means for dividing N components into m ordered lists in a predetermined method with respect to the N-dimensional real vector V of each vector data in said vector database, and preparing m partial vectors v1 to vm; norm distribution tabulation means for tabulating a distribution of a norm of the partial vector vb (b=1 to m) for a partial space number b among said prepared m partial vectors v1 to vm, and preparing a norm partition table which contains a predetermined number of norm ranges; region number calculation means for calculating a region number d to which said partial vector vb belongs in accordance with predetermined D region center vectors p1 to pD; declination distribution tabulation means for tabulating a distribution of a cosine (vb·
pd)/(|vb|*|pd|) of an angle formed by said partial vector vb and the region center vector pd as a declination distribution, and preparing a declination partition table which contains a predetermined number of declination ranges;norm partition number calculation means for referring to said norm partition table to calculate a number r of the norm partition to which the norm of said partial vector vb belongs with respect to the partial vector vb (b=1 to m) for a partial space b among the m partial vectors v1 to vm prepared by said partial vector calculation means; declination partition number calculation means for calculating a declination (vb·
pd)/(|vb|*|pd|) as a cosine of an angle formed by said partial vector vb and the region center vector pd indicating a center direction of the region of the region number d calculated by said region number calculation means;component partition number calculation means for calculating a component partition number wj of a predetermined range to which vbj belongs from a maximum value of the norm of the norm partition corresponding to said calculated norm partition number r with respect to each component vbj of said calculated partial vector vb; index data calculation means for calculating index registration data to be registered in a vector index from said partial space number b, said region number d, said declination partition number c, said norm partition number r, a string of said component partition numbers wj, and the identification number i; and index constituting means for constituting the vector index such that the identification number and the component of each partial vector can be searched using a ordered list of the partial space number b, the region number d, the declination partition number c and a norm partition number range (r1, r2) as a key from said norm partition table, said declination partition table, and said index registration data, and such that the vector component of each vector data can be searched with the identification number of the vector component.
-
-
23. A similarity vector searching apparatus for designating a query vector Q of an N-dimensional real vector, an inner product lower limit value α
- , and maximum obtained vector number L as search conditions, searching a vector index prepared from vector data with a finite number of ordered lists of at least N-dimensional real vector and an ID number of the real vector registered therein, and obtaining L ordered lists at maximum (i, V·
Q) of an identification number i and an inner product of Q and V with respect to vector data (i, V) of said vector database whose value V·
Q of the inner product with said query vector Q is larger than said inner product lower limit value α
, said similar vector searching apparatus comprising;partial query condition calculation means for dividing N components of Q into m ordered lists in the same predetermined method as a method used in preparing said vector index with respect to said query vector Q, preparing m partial query vectors q1 to qm, and calculating a partial inner product lower limit value fb as a lower limit value of a partial inner product of each partial query vector qb and the corresponding partial vector from a designated inner product lower limit value α
;search object range generation means for calculating a partial space number b, and an ordered list (c, (r1, r2)) of a declination partition number c to be searched in a region number d and a norm partition range (r1, r2) from a value of an inner product pd·
qb of the region center vector pd and said partial query vector qb, said partial inner product lower limit value fb, and a norm partition table and a declination partition table in said vector index with respect to each partial query vector qb (b=1 to m) and each region b;index search means for searching a range of said vector index using (b, d, c, (r1, r2)) as a search condition based on (c, (r1, r2)) calculated by said search object range generation means, and obtaining the identification number i and the component of the partial vector vb satisfying the condition as an index search result; inner product difference upper limit calculation means for calculating a partial inner product difference (vb·
qb)−
fb as a difference between a partial inner product vb·
qb of said vb and qb and said partial inner product lower limit value fb, and accumulating (adding) the difference as an inner product difference upper limit value S(i) of the identification number i of an inner product difference table; andsimilarity search result determination means for searching said vector index with the identification number i in order from a largest value in said inner product difference table S(i) to obtain a vector data component V, calculating an inner product difference value t=V·
Q−
α
by subtracting α
from the inner product V·
Q of V and said query vector Q, and outputting an ordered list of at least the identification number i and an inner product t+α
as a search result with respect to L pieces at maximum of vector data with a large inner product difference value when L or more pieces of vector data having the inner product difference value larger than a maximum value of an element having a non-calculated inner product difference value are collected, or when the inner products of all the vector data having a positive inner product difference upper limit value are calculated in said inner product difference table. - View Dependent Claims (25, 26)
- , and maximum obtained vector number L as search conditions, searching a vector index prepared from vector data with a finite number of ordered lists of at least N-dimensional real vector and an ID number of the real vector registered therein, and obtaining L ordered lists at maximum (i, V·
-
24. A similarity vector searching apparatus for designating a query vector Q of an N-dimensional real vector, a distance upper limit value α
- , and maximum obtained vector number L as search conditions, searching a vector index prepared from vector data with a finite number of ordered lists of at least N-dimensional real vector and an identification number of the real vector registered therein, and obtaining L ordered lists at maximum (i, p) of an identification number i of an N-dimensional real vector V in said vector data and a distance p between Q and V such that a value of an inner product with said query vector Q is not more than said distance upper limit value α
, said similar vector searching apparatus comprising;partial query condition calculation means for dividing N components of Q into m ordered lists in the same predetermined method as a method used in preparing said vector index with respect to said query vector Q, preparing m partial query vectors q1 to qm, calculating a partial square distance upper limit value fb as an upper limit value of a partial square distance |vb−
qb|2 (i.e.,) corresponding to square of Euclidean distance of each partial query vector qb and the corresponding partial vector vb from a designated distance upper limit value α
;search object range generation means for systematically generating an ordered list (b, d, c, (r1, r2)) of a partial space number b to be searched, a region number d, a declination partition number c and a norm partition range (r1, r2) from said partial query vector qb, said partial square distance upper limit value fb, and a norm partition table and a declination partition table in said vector index with respect to said partial query vector qb (b=1 to m); index search means for searching a range of said vector index using (b, d, c, (r1, r2)) generated by said search object range generation means as a search condition, and obtaining the identification number i and the component of the partial vector vb satisfying the condition as an index search result; square distance difference upper limit calculation means for calculating a partial square distance difference fb−
|vb−
qb |2 as a difference between said partial square distance upper limit value fb and a partial square distance |vb−
qb|2 of vb and qb, and accumulating (adding) the difference as a square distance difference upper limit value S(i) of the identification number i of a square distance difference table; andsimilarity search result determination means for searching said vector index with the identification number i in order from a largest value in said square distance difference table S(i) to obtain a vector data component V, calculating a square distance difference value α
2−
|V−
Q|2 by subtracting a square distance |V−
Q|2 of V and said query vector Q from a squared distance upper limit value α
2, and outputting an ordered list of at least the identification number i and a distance (α
2−
t)1/2 as a search result with respect to L pieces at maximum of vector data with a large square distance difference value t when L or more pieces of vector data having the square distance difference value larger than a maximum value of an element having a non-calculated square distance difference value are collected, or when the square distance difference values of all the vector data having a positive square distance difference upper limit value are calculated in said square distance difference table. - View Dependent Claims (27)
- , and maximum obtained vector number L as search conditions, searching a vector index prepared from vector data with a finite number of ordered lists of at least N-dimensional real vector and an identification number of the real vector registered therein, and obtaining L ordered lists at maximum (i, p) of an identification number i of an N-dimensional real vector V in said vector data and a distance p between Q and V such that a value of an inner product with said query vector Q is not more than said distance upper limit value α
Specification