Fast query search in large dimension database
First Claim
1. A computer including a data storage device including a computer usable medium having computer usable code means for identifying data points that are similar to a query, the data points being stored in a database accessible by the computer, the computer usable code means having:
- computer readable code means for arranging the database in a tree structure having at least an internal node level including internal nodes, a leaf-parent node level including leaf-parent nodes, and a leaf level including leaves, each internal node and leaf-parent node representing clusters of data points, each cluster being defined by (1) a respective randomly selected representative data point and (2) respective dependent data points, the distance between each representative data point in a level and its dependent data points being less than the distance between the dependent data point and the respective representative data point of any other cluster in the level;
computer readable code means for receiving a query Q and in response determining respective query distances between the query Q and at least some of the representative data points;
computer readable code means for arranging the internal nodes of the respective at least some of the representative data points in a sequence from smallest query distance to greatest query distance; and
computer readable code means for returning a subsequence having a predetermined number of internal nodes, starting with the first node in the sequence.
1 Assignment
0 Petitions
Accused Products
Abstract
A computer-implemented database search method includes arranging data points in a tree structure, with upper nodes being labeled by respective randomly selected representative data point and with the distance between each data point which is related to a first node and the label of the first node being less than the distance between the data point and the label of nodes in other branches. When a query is received, the distance between the query and the label of each node in the upper-most level is determined, and the nodes arranged in sequence, shortest distance first. Then, the process is repeated for the first "f" nodes in the sequence, and so on, until a sequence of leaves (i.e., data points having no dependent nodes or leaves) is obtained. The first "k" leaves are returned as the "k" closest database matches to the query. Alternatively, geometric information pertaining to the data points is recorded when the database is populated, and then, for query execution, nodes of data are ranked according to the geometric information as it relates to the query, with the node rankings terminated when a high bound for the geometric relationship between the query and a node is reached.
-
Citations
29 Claims
-
1. A computer including a data storage device including a computer usable medium having computer usable code means for identifying data points that are similar to a query, the data points being stored in a database accessible by the computer, the computer usable code means having:
-
computer readable code means for arranging the database in a tree structure having at least an internal node level including internal nodes, a leaf-parent node level including leaf-parent nodes, and a leaf level including leaves, each internal node and leaf-parent node representing clusters of data points, each cluster being defined by (1) a respective randomly selected representative data point and (2) respective dependent data points, the distance between each representative data point in a level and its dependent data points being less than the distance between the dependent data point and the respective representative data point of any other cluster in the level; computer readable code means for receiving a query Q and in response determining respective query distances between the query Q and at least some of the representative data points; computer readable code means for arranging the internal nodes of the respective at least some of the representative data points in a sequence from smallest query distance to greatest query distance; and computer readable code means for returning a subsequence having a predetermined number of internal nodes, starting with the first node in the sequence. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A computer-implemented method for arranging data points in a database, comprising:
-
defining a first level of nodes including internal nodes, each internal node being labeled by a representative data point, the representative data points being randomly selected; defining a second level of nodes including leaf-parent nodes, each leaf-parent node depending from an internal node, each leaf-parent node being labeled by a representative data point, each leaf-parent node including at least one leaf labeled by a data point, wherein the distance between a leaf and the label of its leaf-parent node is less than the distance between the leaf and the label of any other leaf-parent node. - View Dependent Claims (9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. A computer program device comprising:
-
a computer program storage device readable by a digital processing apparatus; and a program means on the program storage device and including instructions for causing the digital processing apparatus to return "k" close matches in a database to a query Q, the database being arranged in a tree having nodes, each node being labeled by a respective representative data point, by; receiving a query Q and in response determining respective query distances between the query Q and each representative data point in a predetermined number of test nodes; arranging the test nodes in a sequence from smallest query distance to greatest query distance; and returning a subsequence having a predetermined number of test nodes, starting with the first test node in the sequence. - View Dependent Claims (18, 19, 20, 21, 22, 23)
-
-
24. A computer including a data storage device including a computer usable medium having computer usable code means for identifying data points that are similar to a query, the data points being stored in a database accessible by the computer, the computer usable code means having:
-
computer readable code means for arranging the database in a tree structure having data nodes, each node containing clusters of data points, each cluster being defined by (1) a respective randomly selected representative data point and (2) respective dependent data points; computer readable code means for recording geometric information for each node; computer readable code means for receiving a query Q and in response listing the nodes in a sequence of increasing order of lower bounds, the lower bounds being determined based on the geometric information; and computer readable code means for returning the sequence of the nodes for searching thereof. - View Dependent Claims (25, 26, 27, 28, 29)
-
Specification