Method and system for similarity search and clustering
First Claim
1. A method for searching a collection of items, wherein each item in the collection has a set of properties, comprising the steps of:
- obtaining a query composed of a first set of one or more properties; and
obtaining a result based on applying a distance function to one or more of the items in the collection, wherein the distance function determines a distance between the query and an item in the collection based on the number of items in the collection that are associated with all of the properties in the intersection of the first set of properties and the set of properties for the item.
1 Assignment
0 Petitions
Accused Products
Abstract
Provided is a similarity search method that makes use of a localized distance metric. The data includes a collection of items, wherein each item is associated with a set of properties. The distance between two items is defined in terms of the number of items in the collection that are associated with the set of properties common to the two items. A query is generally composed of a set of properties. The distance between a query and an item is defined in terms of the number of items in the collection that are associated with the set of properties common to the query and the item. The properties can be of various types, such as binary, partially ordered, or numeric. The distance metric may be applied explicitly or implicitly for similarity search. One embodiment of this invention uses random walks such that the similarity search can be performed exactly or approximately, trading-off between accuracy and performance. The distance metric of the present invention can also be the basis for matching and clustering applications. In these contexts, the distance metric of the present invention may be used to build a graph, to which matching or clustering algorithms can be applied.
-
Citations
41 Claims
-
1. A method for searching a collection of items, wherein each item in the collection has a set of properties, comprising the steps of:
-
obtaining a query composed of a first set of one or more properties; and
obtaining a result based on applying a distance function to one or more of the items in the collection, wherein the distance function determines a distance between the query and an item in the collection based on the number of items in the collection that are associated with all of the properties in the intersection of the first set of properties and the set of properties for the item. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30)
-
-
31. A method for analyzing two sets of properties from a plurality of sets of properties, comprising the steps of:
-
determining a set of common properties in the intersection of the two sets of properties;
determining the number of sets of properties from the plurality of sets of properties that include the set of common properties; and
assessing the distance between the two sets of properties as a function of the number of sets of properties that include the set of common properties.
-
-
32. A method for analyzing the relationship between two items in a collection of items, wherein each item in the collection is associated with a set of properties, comprising the steps of:
-
obtaining a set of properties with which the two items are commonly associated; and
determining the degree of commonality between the two items as a function of the number of items in the collection that are associated with all of the properties with which the two items are commonly associated.
-
-
33. A computer program product, residing on a computer readable medium, for use in searching a collection of items, the computer program product comprising instructions for causing a computer to:
-
receive a query composed of one or more properties; and
obtain a result based on applying a distance function to one or more items in the collection, wherein the distance function determines a distance between the query and an item in the collection based on the number of items in the collection that are associated with all of the properties in the intersection of the first set of properties and the set of properties for the item. - View Dependent Claims (34, 35, 36, 37)
-
-
38. A computer system for managing data records comprising:
-
an information retrieval subsystem that stores and retrieves data records, each data record being associated with a set of properties; and
a similarity search subsystem that receives similarity search queries and processes similarity search queries based on a distance function, a similarity search query being associated with a first set of properties, wherein the distance function determines a distance between the query and a data record in the collection based on the number of data records in the collection that are associated with all of the properties in the intersection of the first set of properties and the set of properties for the data record. - View Dependent Claims (39)
-
-
40. A method for applying a matching algorithm to a collection of items, each item being associated with a set of properties, comprising the steps of:
-
constructing a graph having nodes that correspond to items, and having edges that correspond to pairs of items, wherein each edge has a cost correlated to the number of items in the collection that are associated with all of the properties in the intersection of the sets of properties for the two items that the edge links; and
identifying a subset of the edges that constitutes a minimum-cost matching with respect to the graph.
-
-
41. A method for applying a clustering algorithm to a collection of items, each item being associated with a set of properties, comprising the steps of:
-
constructing a graph having nodes that correspond to items, and having edges that correspond to pairs of items, wherein each edge has a cost correlated to the number of items in the collection that are associated with all of the properties in the intersection of the sets of properties for the two items that the edge links; and
identifying a collection of subsets of the edges that constitutes a minimum-cost clustering with respect to the graph.
-
Specification