System and method of finding near neighbors in large metric space databases
First Claim
1. A computer implemented method for performing near neighbor database searches comprising:
- a. using a database which represents a large metric space having data items, where each data item in the database represents a point in the large metric space;
b. selecting data items to form a subset;
c. designating a data item in the subset as a current data item;
i. computing a distance between the current data item and at least one other data item in the subset;
ii. using the computed distances, finding at least one near neighbor data item for the current data item, and creating a link from the current data item to each near neighbor data item;
d. repeating step c for all data items within the subset. e. selecting at least one data item from the subset to form a current search set;
f. for all data items in the current search set, computing a distance between each data item and a query;
g. computing a distance between the query and a near neighbor data item linked by a data item in the current search set; and
h. adding this linked near neighbor data item to the current search set if an addition criterion is met.
0 Assignments
0 Petitions
Accused Products
Abstract
The present invention comprises a computer-implemented method of searching large metric space databases. It allows fast near neighbor searches in databases where the data elements in the database are high dimensional and each data element represents a point in a large metric space. Given a query item, which also represents a point in the large metric space, one or more data items in the database which are approximately nearest neighbors of the query item are found. A set of data items is first preprocessed by computing distances between pairs of items and storing links between pairs which are near one another. A search of the database proceeds by following links from item to item, usually by following links to items which are nearest the query Q. In one embodiment, the search terminates upon reaching an item R which is closer to Q than are all the items to which R links.
-
Citations
97 Claims
-
1. A computer implemented method for performing near neighbor database searches comprising:
-
a. using a database which represents a large metric space having data items, where each data item in the database represents a point in the large metric space;
b. selecting data items to form a subset;
c. designating a data item in the subset as a current data item;
i. computing a distance between the current data item and at least one other data item in the subset;
ii. using the computed distances, finding at least one near neighbor data item for the current data item, and creating a link from the current data item to each near neighbor data item;
d. repeating step c for all data items within the subset. e. selecting at least one data item from the subset to form a current search set;
f. for all data items in the current search set, computing a distance between each data item and a query;
g. computing a distance between the query and a near neighbor data item linked by a data item in the current search set; and
h. adding this linked near neighbor data item to the current search set if an addition criterion is met. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 50, 53, 58, 61, 64, 66, 69, 70, 74, 75, 78, 94)
i. selecting a current search item from the data items in the subset;
j. computing a distance between the current search item and a query;
k. for all near neighbor data items linked by the current search item, computing a distance between the query and the near neighbor data item linked by the current search item;
l. for the current search item, selecting one of the near neighbor data items, and if this near neighbor data item is closer to the query than is the current search item, selecting this near neighbor data item as the new current search item and repeating steps k and i; and
m. if the near neighbor data item is not closer to the query than is the current search item, designating the current search item as an approximately nearest neighbor of the query.
-
-
11. The method of claim 2 wherein the addition criterion is met if the computed distance of the linked near neighbor data item to the query is less than the computed distances of all other data items in the current search set to the query.
-
12. The method of claim 1 wherein the subset contains all the data items in the database.
-
13. The method of claim 1 wherein the distances are computed using a user-defined distance metric.
-
14. The method of claim 1 wherein determining the near neighbor data items of the current data item comprises computing the distance between the current data item and each data item in the subset and selecting the near neighbor data items which produce a smallest distance.
-
15. The method of claim 1 wherein a number of near neighbor data items is selected to optimize computational speed of the near neighbor database search.
-
16. The method of claim 1 wherein a number of near neighbor data items is selected to optimize computational data storage resources used for the near neighbor database search.
-
17. The method of claim 1 wherein a number of near neighbor data items that are selected is a default value.
-
50. The method of claim 1 wherein creating a link between data items that are near neighbors comprises creating a link from a data item A to a data item B if A is a nearest neighbor of B.
-
53. The method of claim 1 wherein creating a link between data items that are near neighbors comprises creating a link from a data item A to a data item B if B is a nearest neighbor of A.
-
58. The method of claim 1 further comprising storing a link from a data item A to a data item B if A is a nearest neighbor of data item B.
-
61. The method of claim 1 further comprising storing a link from a data item A to a data item B if B is a nearest neighbor of data item A.
-
64. The method of claim 1 wherein the data items in the subset are selected randomly.
-
66. The method of claim 2 wherein the initial current search set is designated by a user.
-
69. A software program embodied on a computer-readable medium incorporating the method as recited in claim 1.
-
70. A software program embodied on a computer-readable medium incorporating the method as recited in claim 2.
-
74. The method of claim 1 wherein all distances are computed using a same distance metric.
-
75. The method of claim 1, further comprising:
-
i. selecting a current search item from the data items in the subset;
j. computing a distance between the current search item and a query;
k. for all near neighbor data items linked by the current search item, computing a distance between the query and the near neighbor data item linked by the current search item;
l. for the current search item, determining the near neighbor data item having the smallest distance to the query, and if this near neighbor data item is closer to the query than is the current search item, selecting this near neighbor data item as the new current search item and repeating steps k and l; and
m. if the near neighbor data item is not closer to the query than is the current search item, designating the current search item as an approximately nearest neighbor of the query.
-
-
78. The method of claim 2 wherein all distances are computed using a same distance metric.
-
94. The method of claim 1 wherein the distance is computed between the current data item and all the data items in the subset.
-
18. A computer implemented method for performing near neighbor database searches comprising:
-
a. using a database which represents a large metric space having data items, where each data item in the database represents a point in the large metric space;
b. designating all the data items in the database as an initial current subset;
c. selecting data items from the current subset to form a new current subset;
d. designating a data item in the current subset as a current data item;
i. computing a distance between the current data item and at least one other data item in the current subset;
ii. using the computed distances, finding at least one near neighbor data item for the current data item, and creating a link from the current data item to each near neighbor data item;
e. repeating step d until each data item within the current subset has been designated as the current data item; and
f. repeating steps c through e at least once. - View Dependent Claims (19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 51, 54, 59, 62, 65, 67, 71, 72, 76, 77, 79, 80, 81, 82, 83, 95, 96, 97)
g. setting the current subset level to a selected subset;
h. selecting at least one data item from the current subset level to form a current search set;
i. for all data items in the current search set, computing a distance between each data item and a query;
j. computing a distance between the query and a near neighbor data item linked by a data item in the current search set;
k. adding this linked near neighbor data item to the current search set if an addition criterion is met; and
l. repeating steps j and k until a search criterion is met.
-
-
20. The method of claim 19 wherein the selected subset is the subset containing the smallest number of data items.
-
21. The method of claim 19 further comprising when the search criterion is met, designating an item in the current search set as an approximately nearest neighbor of the query.
-
22. The method of claim 19 further comprising designating the current search set as a set of approximately nearest neighbors of the query.
-
23. The method of claim 19 wherein the search criterion is met if all the near neighbor data items linked by all the data items in the current search set fail to meet the addition criterion.
-
24. The method of claim 21 wherein the item in the current search set with the smallest distance to the query is designated as the approximately nearest neighbor of the query.
-
25. The method of claim 19 wherein the addition criterion uses the computed distance between the linked near neighbor data item and the query to determine whether to add the linked near neighbor data item to the current search set.
-
26. The method of claim 19 wherein the addition criterion is met if the computed distance of the linked near neighbor data item to the query is less than the computed distance of at least one other data item in the current search set to the query.
-
27. The method of claim 19 wherein the addition criterion is deterministic.
-
28. The method of claim 19 wherein the addition criterion is probabilistic.
-
29. The method of claim 28 further comprising a probability of adding one of the near neighbor data items to the current search set that is inversely related to the distance between the query and the near neighbor data item.
-
30. The method of claim 19 wherein the addition criterion is not met unless the linked near neighbor data item is in the current subset level.
-
31. The method of claim 19 further comprising:
-
m. the addition criterion is not met unless the linked near neighbor data item is in the current subset level;
n. changing the current subset level to a subset level with a greater number of item; and
o. repeating steps h through n until a final criterion is met.
-
-
32. The method of claim 19 further comprising:
-
m. the addition criterion is not met unless the linked near neighbor data item is in the current subset level;
n. changing the current subset level to a subset level with a greater number of item; and
o. repeating steps h through n until a user-specified number of subset levels is searched.
-
-
33. The method of claim 31 wherein steps h through n are repeated until a default number of subset levels is searched.
-
34. The method of claim 19 wherein the search criterion is met when a specified amount of processing has occurred.
-
35. The method of claim 19 wherein the search criterion is met when the item in the current search set closest to the query is within a user-specified distance from the query.
-
36. The method of claim 19 wherein the search criterion is met when the item in the current search set closest to the query is within a default distance from the query.
-
37. The method of claim 18 wherein steps c through e are repeated until the number of data items in the current subset is less than or equal to a designated value.
-
38. The method of claim 18 wherein steps c through e are repeated until every data item in the current subset links to every other data item in the current subset.
-
39. The method of claim 18 wherein all distances are computed using a same distance metric.
-
40. The method of claim 39 wherein finding the near neighbors of the current data item comprises computing the distance between the current data item and each data item in the subset and selecting p items which produce a smallest distance.
-
41. The method of claim 40 wherein p is selected to optimize the computational speed of the near neighbor database search.
-
42. The method of claim 40 wherein p is selected to optimize the computational data storage resources used for the near neighbor database search.
-
43. The method of claim 40 wherein p is a predetermined default value.
-
51. The method of claim 18 wherein creating a link between data items that are near neighbors comprises creating a link from a data item A to a data item B if A is a nearest neighbor of B.
-
54. The method of claim 18 wherein creating a link between data items that are near neighbors comprises creating a link from a data item A to a data item B if B is a nearest neighbor of A.
-
59. The method of claim 18 further comprising storing a link from a data item A to a data item B if A is a nearest neighbor of data item B.
-
62. The method of claim 18 further comprising storing a link from a data item A to a data item B if B is a nearest neighbor of data item A.
-
65. The method of claim 18 wherein the data items in the subsets are selected randomly.
-
67. The method of claim 19 wherein the initial current search set is designated by a user.
-
71. A software program embodied on a computer-readable medium incorporating the method as recited in claim 18.
-
72. A software program embodied on a computer-readable medium incorporating the method as recited in claim 19.
-
76. The method of claim 18, further comprising:
-
g. setting the current subset level to a selected subset;
h. selecting a current search item from the data items in the current subset level;
i. computing a distance between the current search item and a query;
j. computing a distance between the query and a near neighbor data item linked by the current search item;
k. designating this linked near neighbor data item as the current search item if an addition criterion is met; and
l. repeating steps j and k until a search criterion is met.
-
-
77. The method of claim 76 wherein selecting the current subset level comprises selecting a subset containing a smallest number of data items.
-
79. The method of claim 19 wherein all distances are computed using a same distance metric.
-
80. The method of claim 79 wherein the addition criterion is a probabilistic criterion comprising a probability of adding the near neighbor data item that is inversely related to the distance between the query and the near neighbor data item.
-
81. The method of claim 18 wherein the number of near neighbor data items varies for each data item.
-
82. The method of claim 18 wherein the number of near neighbor data items is the same for each data item.
-
83. The method of claim 18 further comprising:
-
g. setting the current subset level to the subset containing the smallest number of data items;
h. selecting a current search item from the data items in the current subset level;
i. for the current subset level;
i. computing a distance between the current search item and a query;
ii. for all near neighbor data items linked by the current search item, computing a distance between the query and the near neighbor data item;
iii. for the current search item, selecting one of the near neighbor data items, and if this near neighbor data item is closer to the query than is the current search item, selecting this near neighbor data item as the new current search item and repeating steps ii and iii; and
j. if the near neighbor data item is not closer to the query than is the current search item, designating the current search item as an approximately nearest neighbor of the query for the current subset level; and
k. changing the current subset level and repeating steps i and j.
-
-
95. The method of claim 18 wherein steps c through e are repeated until the number of items in the current subset is less than or equal to a designated value.
-
96. The method of claim 18 wherein steps c through e are repeated until each item in the current subset has a link to all other items in the current subset.
-
97. The method of claim 18 wherein steps c through e are repeated at least twice.
-
44. A computer implemented method for finding a data item nearby a query point comprising:
-
a. using a database with data items, where each data item in the database represents a point in a large metric space;
b. creating links between data items that are near neighbors;
c. selecting a data item to be the current search item;
d. computing a distance between the current search item and the query point;
e. computing a near neighbor distance between the query point and a near neighbor item linked by the current search item;
f. if the near neighbor distance is less than the distance between the current search item and the query point, selecting the near neighbor item as the current search item; and
g. repeating steps e and f until a search criterion is met. - View Dependent Claims (45, 46, 47, 48, 49, 52, 55, 56, 60, 63, 68, 73)
-
-
57. A computer implemented method for performing near neighbor database searches comprising:
-
a. using a database which represents a large metric space having data items, where each data item in the database represents a point in the large metric space;
b. computing the distance between data items;
c. storing a link, which is a pointer to another data item, for data items that are near one another;
d. designating an item to be a current search item;
e. computing a distance between the current search item and a query point in the large metric space;
f. computing a distance between the query point and linked items to which the current search item links;
g. if the distance from the query point to the linked item is less than the distance between the query point and the current search item, designating the linked item as the current search item and repeating steps f and g; and
h. if the distance from the query point to the linked item is greater than the distance between the query point and the current search item, designating the current search item as a search result.
-
-
84. A computer implemented method for performing near neighbor database searches comprising:
-
a. using a database containing data items having links between near neighbor data items;
b. selecting at least one data item from the database to form a current search set;
c. for each data item in the current search set, computing a distance between the data item and a query;
d. computing a distance between the query and a near neighbor data item linked by a data item in the current search set;
e. adding this linked near neighbor data item to the current search set if an addition criterion is met;
f. repeating steps d and e until a search criterion is met; and
g. designating a data item within the current search set as an approximately nearest neighbor of the query. - View Dependent Claims (85, 86, 87, 88, 89, 90, 91, 92, 93)
-
Specification