Density-based indexing method for efficient execution of high dimensional nearest-neighbor queries on large databases
First Claim
1. A method for evaluating data records contained within a database wherein each record has multiple data attributes;
- the method comprising the steps of;
a) clustering the data records contained in the database into multiple data clusters wherein each of the multiple data clusters is characterized by a cluster model; and
b) building a new database of records having an augmented record format that contains the original record attributes and an additional record attribute containing a cluster identifier for each record based on the clustering step.
2 Assignments
0 Petitions
Accused Products
Abstract
Method and apparatus for efficiently performing nearest neighbor queries on a database of records wherein each record has a large number of attributes by automatically extracting a multidimensional index from the data. The method is based on first obtaining a statistical model of the content of the data in the form of a probability density function. This density is then used to decide how data should be reorganized on disk for efficient nearest neighbor queries. At query time, the model decides the order in which data should be scanned. It also provides the means for evaluating the probability of correctness of the answer found so far in the partial scan of data determined by the model. In this invention a clustering process is performed on the database to produce multiple data clusters. Each cluster is characterized by a cluster model. The set of clusters represent a probability density function in the form of a mixture model. A new database of records is built having an augmented record format that contains the original record attributes and an additional record attribute containing a cluster number for each record based on the clustering step. The cluster model uses a probability density function for each cluster so that the process of augmenting the attributes of each record is accomplished by evaluating each record'"'"'s probability with respect to each cluster. Once the augmented records are used to build a database the augmented attribute is used as an index into the database so that nearest neighbor query analysis can be very efficiently conducted using an indexed look up process. As the database is queried, the probability density function is used to determine the order clusters or database pages are scanned. The probability density function is also used to determine when scanning can stop because the nearest neighbor has been found with high probability.
-
Citations
39 Claims
-
1. A method for evaluating data records contained within a database wherein each record has multiple data attributes;
- the method comprising the steps of;
a) clustering the data records contained in the database into multiple data clusters wherein each of the multiple data clusters is characterized by a cluster model; and
b) building a new database of records having an augmented record format that contains the original record attributes and an additional record attribute containing a cluster identifier for each record based on the clustering step. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- the method comprising the steps of;
-
11. A method for evaluating data records stored on a storage medium wherein each record has multiple data attributes that have been characterized by a probability function found by clustering the data in the database to produce a clustering model;
- said method comprising the steps of assigning a cluster number for each of the records on the storage medium based upon the probability function, and finding a nearest neighbor from the data records for a query record by scanning data records from more than one based upon the probability function of the cluster model used to assign cluster numbers to the data records.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
-
21. A process for use in answering queries comprising the steps of:
-
a) clustering data stored in a database using a clustering technique to provide an estimate of the probability density function of the sample data;
b) adding an additional column attribute to the database that represents the predicted cluster membership for each data record within the database; and
c) rebuilding a data table of the database using the newly added column as an index to records in the table. - View Dependent Claims (22, 23, 24, 25, 26)
-
-
27. In a computer data mining system, apparatus for evaluating data in a database comprising:
-
a) one or more data storage devices for storing data records on a storage medium;
the data records including data attributes; and
b) a computer having a rapid access memory and an interface to the storage devices for reading data from the storage medium and bringing the data records from the storage medium into the rapid access memory for subsequent evaluation;
c) the computer comprising a processing unit for evaluating at least some of the data records and for determining a probability density function for the records based on a clustering of data from data in the database into multiple numbers of data clusters, and said computer programmed to build an index for the data records in the database based on the probability density function, wherein said computer performs an approximate nearest neighbor analysis by choosing a specified cluster based on the index and evaluating records of the specified cluster for nearness to a given data record. - View Dependent Claims (28, 29, 30)
-
-
31. A computer-readable medium having computer-executable components comprising:
-
a) a database component for interfacing with a database that stores data records made up of multiple data attributes;
b) a modeling component for constructing and storing a clustering model that characterizes multiple data clusters wherein the modeling component constructs a model of data clustering that corresponds to a mixture of probability functions; and
c) an analysis component for indexing the database on a cluster number for each record in the database wherein the indexing is performed based on a probability assessment of each record to the mixture of probability functions and for approximating a nearest neighbor query by determining an index for a sample record and scanning data records previously assigned a similar index. - View Dependent Claims (32, 33, 34)
-
-
35. A method for evaluating data records stored on a storage medium wherein each record has multiple data attributes that have been clustered to define a probability function of the data records stored on the storage medium;
- said method comprising the steps of evaluating the clusters of a clustering model and if the cluster separation between cluster centroids is of a sufficient size, assigning an index for each of the records on the storage medium based upon the probability function that is derived from the clustering model.
-
36. A method for evaluating data records stored on a storage medium wherein each record has multiple data attributes that have been characterized by a probability function found by clustering the data in the database to produce a clustering model;
- said method comprising the steps of assigning a cluster number for each of the records on the storage medium based upon the probability function, and finding a nearest neighbor from the data records for a query record by choosing between a scan of a subset of the database and a complete scan of the database based on the probability estimate generated by a statistical model of the data.
-
37. A method for performing an approximate nearest neighbor search of data records in a database stored on a storage medium wherein each record has multiple data attributes comprising:
-
a) clustering the data records to define a cluster model having multiple clusters made up of probability functions to form a compressed characterization of the data records of the database;
b) assigning clusters numbers as indexes to the data records based on the probability functions; and
c) searching data records from a cluster to find a nearest neighbor within said cluster of a sample data record based on a nearness of the sample data record to the clusters that make up the cluster model. - View Dependent Claims (38, 39)
-
Specification