METHOD FOR PERFORMING EFFICIENT SIMILARITY SEARCH
First Claim
1. A method embodied on a computer readable medium for retrieving k approximate nearest neighbors, with respect to a query object and a distance function, from a data set having a plurality of objects, comprising:
- using a set of uniquely identified reference objects selected from the same domain of the objects of said data set;
using a computer to implement the steps ofrepresenting each object of said data set and said query object with a sequence of identifiers of the l closests objects belonging to said set of reference objects, measuring the distance between any object of said data set and any object of said set of reference objects using said distance function;
maintaining a prefix tree to organize said sequences;
maintaining a data storage to organize the data entries representing all the object in said data set, wherein a data entry stores the information required to compute the distance of the object it represents, using said distance function, with respect to any other object in the domain;
maintaining in every leaf of said prefix tree the pointers to the locations of said data storage containing the data entries relative to the objects of said data set that are represented by the sequence identified by the path going from the root of said prefix tree to said leaf;
maintaining the data entries in said data storage sequentially sorted in the order resulting from performing a depth first visit of said prefix tree;
using said prefix tree to identify a set of at least z objects of said data set whose representing sequences have the longest possible prefix match with the sequence representing said query object;
using the pointers in the leaves of said prefix tree to retreive all the data entries associated to said candidate objects;
using the data entry of each object in said set of candidate objects to compute the distance, using said distance function, with respect to said query object;
selecting the k nearest objects in said set of candidate objects, with respect to said query object, as the approximate k nearest neighbors search result.
0 Assignments
0 Petitions
Accused Products
Abstract
The present invention provides systems and methods for performing efficient k-NN approximate similarity search on a database of objects. The invention is based on the definition of an index data structure that enables to have fast searches and very good scalability with respect to the database size. Such index makes efficient use of both the main and secondary memory of the computer, taking advantage of the specific properties of both kinds of memories.
A prefix tree is built on all the sequences assigned to the database objects by a sequence generation function. The prefix tree is stored in the main memory.
The information required to identify each database object and to compute the similarity between database objects and query objects are stored in a data storage kept in the secondary memory.
Given a query object and the request for the k nearest neighbors, the search functionality of the invention uses the prefix tree to quickly identify a set of candidate objects. The organization of the data storage is then used to efficiently retrieve the information relative to the candidate objects. Such information is used to compute the similarity of candidate object with the query, in order to select the k most similar ones, which are thus returned as the result.
44 Citations
6 Claims
-
1. A method embodied on a computer readable medium for retrieving k approximate nearest neighbors, with respect to a query object and a distance function, from a data set having a plurality of objects, comprising:
-
using a set of uniquely identified reference objects selected from the same domain of the objects of said data set; using a computer to implement the steps of representing each object of said data set and said query object with a sequence of identifiers of the l closests objects belonging to said set of reference objects, measuring the distance between any object of said data set and any object of said set of reference objects using said distance function; maintaining a prefix tree to organize said sequences; maintaining a data storage to organize the data entries representing all the object in said data set, wherein a data entry stores the information required to compute the distance of the object it represents, using said distance function, with respect to any other object in the domain; maintaining in every leaf of said prefix tree the pointers to the locations of said data storage containing the data entries relative to the objects of said data set that are represented by the sequence identified by the path going from the root of said prefix tree to said leaf; maintaining the data entries in said data storage sequentially sorted in the order resulting from performing a depth first visit of said prefix tree; using said prefix tree to identify a set of at least z objects of said data set whose representing sequences have the longest possible prefix match with the sequence representing said query object; using the pointers in the leaves of said prefix tree to retreive all the data entries associated to said candidate objects; using the data entry of each object in said set of candidate objects to compute the distance, using said distance function, with respect to said query object; selecting the k nearest objects in said set of candidate objects, with respect to said query object, as the approximate k nearest neighbors search result. - View Dependent Claims (2, 3, 4, 5, 6)
-
Specification