METHOD FOR PERFORMING EFFICIENT SIMILARITY SEARCH

US 20100106713A1
Filed: 09/24/2009
Published: 04/29/2010
Est. Priority Date: 10/28/2008
Status: Abandoned Application

First Claim

Patent Images

1. A method embodied on a computer readable medium for retrieving k approximate nearest neighbors, with respect to a query object and a distance function, from a data set having a plurality of objects, comprising:

using a set of uniquely identified reference objects selected from the same domain of the objects of said data set;

using a computer to implement the steps ofrepresenting each object of said data set and said query object with a sequence of identifiers of the l closests objects belonging to said set of reference objects, measuring the distance between any object of said data set and any object of said set of reference objects using said distance function;

maintaining a prefix tree to organize said sequences;

maintaining a data storage to organize the data entries representing all the object in said data set, wherein a data entry stores the information required to compute the distance of the object it represents, using said distance function, with respect to any other object in the domain;

maintaining in every leaf of said prefix tree the pointers to the locations of said data storage containing the data entries relative to the objects of said data set that are represented by the sequence identified by the path going from the root of said prefix tree to said leaf;

maintaining the data entries in said data storage sequentially sorted in the order resulting from performing a depth first visit of said prefix tree;

using said prefix tree to identify a set of at least z objects of said data set whose representing sequences have the longest possible prefix match with the sequence representing said query object;

using the pointers in the leaves of said prefix tree to retreive all the data entries associated to said candidate objects;

using the data entry of each object in said set of candidate objects to compute the distance, using said distance function, with respect to said query object;

selecting the k nearest objects in said set of candidate objects, with respect to said query object, as the approximate k nearest neighbors search result.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention provides systems and methods for performing efficient k-NN approximate similarity search on a database of objects. The invention is based on the definition of an index data structure that enables to have fast searches and very good scalability with respect to the database size. Such index makes efficient use of both the main and secondary memory of the computer, taking advantage of the specific properties of both kinds of memories.

A prefix tree is built on all the sequences assigned to the database objects by a sequence generation function. The prefix tree is stored in the main memory.

The information required to identify each database object and to compute the similarity between database objects and query objects are stored in a data storage kept in the secondary memory.

Given a query object and the request for the k nearest neighbors, the search functionality of the invention uses the prefix tree to quickly identify a set of candidate objects. The organization of the data storage is then used to efficiently retrieve the information relative to the candidate objects. Such information is used to compute the similarity of candidate object with the query, in order to select the k most similar ones, which are thus returned as the result.

44 Citations

View as Search Results

6 Claims

1. A method embodied on a computer readable medium for retrieving k approximate nearest neighbors, with respect to a query object and a distance function, from a data set having a plurality of objects, comprising:
- using a set of uniquely identified reference objects selected from the same domain of the objects of said data set;
  
  using a computer to implement the steps ofrepresenting each object of said data set and said query object with a sequence of identifiers of the l closests objects belonging to said set of reference objects, measuring the distance between any object of said data set and any object of said set of reference objects using said distance function;
  
  maintaining a prefix tree to organize said sequences;
  
  maintaining a data storage to organize the data entries representing all the object in said data set, wherein a data entry stores the information required to compute the distance of the object it represents, using said distance function, with respect to any other object in the domain;
  
  maintaining in every leaf of said prefix tree the pointers to the locations of said data storage containing the data entries relative to the objects of said data set that are represented by the sequence identified by the path going from the root of said prefix tree to said leaf;
  
  maintaining the data entries in said data storage sequentially sorted in the order resulting from performing a depth first visit of said prefix tree;
  
  using said prefix tree to identify a set of at least z objects of said data set whose representing sequences have the longest possible prefix match with the sequence representing said query object;
  
  using the pointers in the leaves of said prefix tree to retreive all the data entries associated to said candidate objects;
  
  using the data entry of each object in said set of candidate objects to compute the distance, using said distance function, with respect to said query object;
  
  selecting the k nearest objects in said set of candidate objects, with respect to said query object, as the approximate k nearest neighbors search result.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1, wherein said set of reference objects is defined by randomly sampling the objects of said data set.
  - 3. The method of claim 1, wherein said set of reference objects is defined by randomly sampling the objects a different data set, which may have a non-empty intersection with the data set being indexed.
  - 4. The method of claim 1, wherein said set of reference objects is defined by selecting relevant objects from a log of query objects used in previous nearest neighbor searches.
  - 5. The method of claim 1, wherein some of the objects of said data set are represented by more than one sequence, generating the additional sequences by permutating some of the elements of the original sequence representing each of said objects.
  - 6. The method of claim 1, wherein more than one set of candidate objects is identified by representing the query object with more than one sequence, generating the additional sequences by permutating some of the elements of the original sequence representing said query object.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Andrea Esuli, Cristina Galeotti
Original Assignee
Andrea Esuli, Cristina Galeotti
Inventors
Esuli, Andrea, Galeotti, Cristina

Application Number

US12/565,869
Publication Number

US 20100106713A1
Time in Patent Office

Days
Field of Search
US Class Current

707/716
CPC Class Codes

G06F 16/9027 Trees

G06F 18/24147 Distances to closest patter...

METHOD FOR PERFORMING EFFICIENT SIMILARITY SEARCH

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

44 Citations

6 Claims

Specification

Solutions

Use Cases

Quick Links

METHOD FOR PERFORMING EFFICIENT SIMILARITY SEARCH

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

44 Citations

6 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links