Informationtheory based measure of similarity between instances in ontology

0Associated
Cases 
0Associated
Defendants 
0Accused
Products 
6Forward
Citations 
0
Petitions 
1
Assignment
First Claim
1. A method of measuring similarity between instances in an ontology for use in an information retrieval system, the method comprising the steps of:
 obtaining a set of instances from the ontology;
computing a first similarity metric that measures similarity between instances in the set of instances with respect to ontology concepts to which the instances belong; and
storing at least one taxonomy induced by the first similarity metric, wherein the at least one induced taxonomy is usable for responding to requests submitted to the information retrieval system;
wherein the first similarity metric measures similarity of instances i and j in the set of instances based on the similarity of C(i) and C(j), where the C(i) and the C(j) represent sets of concepts to which the instances belong; and
wherein the first similarity metric considers concept membership statements of the instances in the set of instances by defining a description of an individual and a commonality between the instances based on the ontology concepts to which the instances belong;
and further wherein the obtaining, computing and storing steps are performed by a processor and memory.
1 Assignment
0 Petitions
Accused Products
Abstract
Improved information processing techniques for measuring similarity between instances in an ontology are disclosed. For example, a method of measuring similarity between instances in an ontology for use in an information retrieval system includes the following steps. A set of instances from the ontology is obtained. At least one of the following similarity metrics for the set of instances is computed: (i) a first metric that measures similarity between instances in the set of instances with respect to ontology concepts to which the instances belong; (ii) a second metric which measures similarity between instances in the set of instances where the instances are subjects in statements involving a given ontology property; and (iii) a third metric which measures similarity between instances in the set of instances where the instances are objects in statements involving a given ontology property. At least one taxonomy induced by the at least one computed similarity metric is stored, wherein the at least one induced taxonomy is usable for responding to requests submitted to an information retrieval system. When two or more of the first metric, the second metric and the third metric are computed, and two or more induced taxonomies corresponding to the two or more computed similarity metrics are stored, the method may include merging the two or more induced taxonomies to form a combined taxonomy, wherein the combined taxonomy is usable for responding to requests submitted to an information retrieval system.
17 Citations
View as Search Results
Method and system for managing single and multiple taxonomies  
Patent #
US 7,974,984 B2
Filed 04/19/2007

Current Assignee
Blackbird Tech LLC

Sponsoring Entity
Mobile Content Networks Incorporated

TAXONOMY BASED INDEXING AND SEARCHING  
Patent #
US 20090089270A1
Filed 09/26/2008

Current Assignee
Autodesk Inc.

Sponsoring Entity
Autodesk Inc.

METHOD AND SYSTEM FOR MANAGING SINGLE AND MULTIPLE TAXONOMIES  
Patent #
US 20070250487A1
Filed 04/19/2007

Current Assignee
Blackbird Tech LLC

Sponsoring Entity
Mobile Content Networks Incorporated

Taxonomy based indexing and searching  
Patent #
US 8,131,757 B2
Filed 09/26/2008

Current Assignee
Autodesk Inc.

Sponsoring Entity
Autodesk Inc.

Digitization of a catalog of retail products  
Patent #
US 9,928,530 B2
Filed 01/19/2017

Current Assignee
International Business Machines Corporation

Sponsoring Entity
International Business Machines Corporation

Userguided term suggestions  
Patent #
US 10,459,957 B2
Filed 12/29/2016

Current Assignee
Google LLC

Sponsoring Entity
Google LLC

Method and system for analyzing similarity of concept sets  
Patent #
US 20080195587A1
Filed 02/13/2007

Current Assignee
SAP SE

Sponsoring Entity
SAP SE

Casebased reasoning similarity metrics implementation using user defined functions  
Patent #
US 7,136,852 B1
Filed 11/27/2001

Current Assignee
Teradata US Inc.

Sponsoring Entity
NCR Corporation

System and method for generating a taxonomy from a plurality of documents  
Patent #
US 6,665,681 B1
Filed 04/09/1999

Current Assignee
Amobee Inc.

Sponsoring Entity
Entrieva Inc.

System and method for capturing knowledge for integration into one or more multirelational ontologies  
Patent #
US 20060053099A1
Filed 05/05/2005

Current Assignee
BioWisdom Ltd.

Sponsoring Entity
BioWisdom Ltd.

Method, system, and computer program product for searching for, navigating among, and ranking of documents in a personal web  
Patent #
US 20060059144A1
Filed 09/16/2005

Current Assignee
Telenor Asa

Sponsoring Entity
Telenor Asa

System and method for arranging concept clusters in thematic neighborhood relationships in a shaped twodimensional visual display space  
Patent #
US 20050192956A1
Filed 08/03/2004

Current Assignee
Nuix North America Inc.

Sponsoring Entity
Attenex Corp.

System and method for performing similarity searching using pointer optimization  
Patent #
US 6,738,759 B1
Filed 07/07/2000

Current Assignee
Fair Isaac Corporation

Sponsoring Entity
Infoglide Software Corp.

Catalog taxonomy for storing product information and system and method using same  
Patent #
US 20040254950A1
Filed 06/13/2003

Current Assignee
CBS Interactive Inc.

Sponsoring Entity
CBS Interactive Inc.

Method and system for collaborative ontology modeling  
Patent #
US 20030163597A1
Filed 05/25/2001

Current Assignee
International Business Machines Corporation

Sponsoring Entity
International Business Machines Corporation

Method and apparatus for developing a transfer dictionary used in transferbased machine translation system  
Patent #
US 20030233226A1
Filed 06/05/2003

Current Assignee
International Business Machines Corporation

Sponsoring Entity
International Business Machines Corporation

Information system with knowledge base and data base  
Patent #
US 5,418,943 A
Filed 10/23/1991

Current Assignee
ATT Inc.

Sponsoring Entity
ATT Inc.

25 Claims
 1. A method of measuring similarity between instances in an ontology for use in an information retrieval system, the method comprising the steps of:
obtaining a set of instances from the ontology; computing a first similarity metric that measures similarity between instances in the set of instances with respect to ontology concepts to which the instances belong; and storing at least one taxonomy induced by the first similarity metric, wherein the at least one induced taxonomy is usable for responding to requests submitted to the information retrieval system; wherein the first similarity metric measures similarity of instances i and j in the set of instances based on the similarity of C(i) and C(j), where the C(i) and the C(j) represent sets of concepts to which the instances belong; and wherein the first similarity metric considers concept membership statements of the instances in the set of instances by defining a description of an individual and a commonality between the instances based on the ontology concepts to which the instances belong; and further wherein the obtaining, computing and storing steps are performed by a processor and memory.  View Dependent Claims (2, 3, 4, 5, 6, 7)
 8. A method of measuring similarity between instances in an ontology for use in an information retrieval system, the method comprising the steps of:
obtaining a set of instances from the ontology; computing at least one of the following similarity metrics for the set of instances; a first metric that measures similarity between instances in the set of instances with respect to ontology concepts to which the instances belong; a second metric which measures similarity between instances in the set of instances where the instances are subjects in statements involving a given ontology property; and a third metric which measures similarity between instances in the set of instances where the instances are objects in statements involving a given ontology property; and storing at least one taxonomy induced by the at least one computed similarity metric, wherein the at least one induced taxonomy is usable for responding to requests submitted to an information retrieval system; wherein the first metric, the second metric and the third metric comprise information theorybased measurements.
 9. A method of measuring similarity between instances in an ontology for use in an information retrieval system, the method comprising the steps of:
obtaining a set of instances from the ontology; computing a similarity metric which measures similarity between instances in the set of instances where the instances are subjects in statements involving at least one given ontology property; and storing at least one taxonomy induced by the similarity metric, wherein the at least one induced taxonomy is usable for responding to requests submitted to the information retrieval system; wherein the similarity metric measures similarity of instances i and j in the set of instances based on the similarity of sets of objects in statements where the instances are subjects in the statements; and further wherein the obtaining, computing and storing steps are performed by a processor and memory.  View Dependent Claims (10, 11, 12, 13, 14, 15)
 16. A method of measuring similarity between instances in an ontology for use in an information retrieval system, the method comprising the steps of:
obtaining a set of instances from the ontology; computing a similarity metric which measures similarity between instances in the set of instances where the instances are objects in statements involving at least one given ontology property; and storing at least one taxonomy induced by the similarity metric, wherein the at least one induced taxonomy is usable for responding to requests submitted to the information retrieval system; wherein the similarity metric measures similarity of instances i and j in the set of instances based on the similarity of sets of subjects in statements where the instances are objects in the statements; and further wherein the obtaining, computing and storing steps are performed by a processor and memory.  View Dependent Claims (17, 18, 19, 20, 21, 22)
 23. An article of manufacture for measuring similarity between instances in an ontology for use in an information retrieval system, comprising a computer readable storage medium containing one or more programs which when executed by a processor implement the steps of:
obtaining a set of instances from the ontology; computing at least one of the following similarity metrics for the set of instances; a first similarity metric that measures similarity between first instances in the set of instances with respect to ontology concepts to which the first instances belong, wherein the first similarity metric measures similarity of first instances i1 and j1 in the set of instances based on the similarity of C(i1) and C(j1), wherein the C(i1) and the C(j1) represent sets of concepts to which the first instances belong, and wherein the first similarity metric considers concept membership statements of the first instances in the set of instances by defining a description of an individual and a commonality between the first instances based on the ontology concepts to which the first instances belong; a second similarity metric which measures similarity between second instances in the set of instances, wherein the second instances are subjects in first statements involving at least one given first ontology property, wherein the second similarity metric measures similarity of second instances i2 and j2 in the set of instances based on similarity of sets of objects in the first statements where the second instances are the subjects in the first statements; and a third similarity metric which measures similarity between third instances in the set of instances, wherein the third instances are objects in second statements involving at least one given second ontology property, wherein the third similarity metric measures similarity of third instances i3 and j3 in the set of instances based on similarity of sets of subjects in the second statements where the third instances are objects in the second statements; and storing at least one taxonomy induced by at least one of the first, the second and the third similarity metric, wherein the at least one induced taxonomy is usable for responding to requests submitted to the information retrieval system.
 24. Apparatus for measuring similarity between instances in an ontology for use in an information retrieval system, the apparatus comprising:
a memory; and a processor coupled to the memory and operative to; (i) obtain a set of instances from the ontology; (ii) compute at least one of the following similarity metrics for the set of instances; a first similarity metric that measures similarity between first instances in the set of instances with respect to ontology concepts to which the first instances belong, wherein the first similarity metric measures similarity of first instances i1 and j1 in the set of instances based on the similarity of C(i1) and C(j1), wherein the C(i1) and the C(j1) represent sets of concepts to which the first instances belong, and wherein the first similarity metric considers concept membership statements of the first instances in the set of instances by defining a description of an individual and a commonality between the first instances based on the ontology concepts to which the first instances belong; a second similarity metric which measures similarity between second instances in the set of instances where the second instances are subjects in first statements involving at least one given first ontology property, wherein the second similarity metric measures similarity of second instances i2 and j2 in the set of instances based on similarity of sets of objects in the first statements where the second instances are the subjects in the first statements; and a third similarity metric which measures similarity between third instances in the set of instances where the third instances are objects in second statements involving at least one given second ontology property, wherein the third similarity metric measures similarity of third instances i3 and j3 in the set of instances based on similarity of sets of subjects in the second statements where the third instances are objects in the second statements; and (iii) store at least one taxonomy induced by at least one of the first, the second and the third similarity metrics, wherein the at least one induced taxonomy is usable for responding to requests submitted to an information retrieval system.
 25. An information retrieval system, comprising a similarity measurement system comprising a memory and a processor coupled to the memory, the information retrieval system configured to:
(i) obtain a set of instances from the ontology; (ii) compute at least one of the following similarity metrics for the set of instances; a first similarity metric that measures similarity between first instances in the set of instances with respect to ontology concepts to which the first instances belong, wherein the first similarity metric measures similarity of first instances i1 and j1 in the set of instances based on the similarity of C(i1) and C(j1), wherein the C(i1) and the C(j1) represent sets of concepts to which the first instances belong, and wherein the first similarity metric considers concept membership statements of the first instances in the set of instances by defining a description of an individual and a commonality between the first instances based on the ontology concepts to which the first instances belong; a second similarity metric which measures similarity between second instances in the set of instances where the second instances are subjects in first statements involving at least one given first ontology property, wherein the second similarity metric measures similarity of second instances i2 and j2 in the set of instances based on similarity of sets of objects in the first statements where the second instances are the subjects in the first statements; and a third similarity metric which measures similarity between third instances in the set of instances where the third instances are objects in second statements involving at least one given second ontology property, wherein the third similarity metric measures similarity of third instances i3 and j3 in the set of instances based on similarity of sets of subjects in the second statements where the third instances are objects in the second statements; and (iii) store at least one taxonomy induced by at least one of the first, the second and the third similarity metrics, wherein the at least one induced taxonomy is usable for responding to requests submitted to an information retrieval system.
1 Specification
The present invention relates to information processing techniques and, more particularly, to techniques for measuring similarity between instances in an ontology.
It is known that the “Semantic Web” is an evolving extension of the World Wide Web in which web content can be expressed not only in natural (human) language, but also in a form that can be understood, interpreted and used by machines (e.g., computing devices) that are executing software programs (e.g., applications), thus permitting the applications to find, share and integrate information more easily. Accordingly, the growth of the Semantic Web has seen increasing amounts of knowledge in different domains being expressed using ontology languages such as the OWL Web Ontology Language (or simply “OWL”).
As is known, OWL is intended to be used when the information contained in documents needs to be processed by applications (i.e., needs to be machineinterpretable), as opposed to situations where the content only needs to be presented to humans (i.e., humaninterpretable). OWL can be used to explicitly represent the meaning of terms in vocabularies and the relationships between those terms. This representation of terms and their interrelationships is referred to as an “ontology.”
Ontologies in OWL define the “concepts” (or classes), “properties” and “individuals” (or instances) relevant to some area of interest. The concepts are usually organized in a taxonomy based on a subclass relationship. Properties are associated with a domain and a range. Individuals belong to one or more concepts, and may be related to other individuals or literals through properties.
A key challenge in a number of search and information retrieval systems is finding the similarity between concepts in a taxonomy. The problem of finding the similarity between terms in a taxonomy has been widely studied. Some of these approaches use the structure of the taxonomy to derive a measure of similarity. Others make use of informationtheory based approaches.
However, none of the existing approaches address the specific problem of combining taxonomic and relationship knowledge of instances (i.e., individuals) to measure their similarity.
Accordingly, improved information processing techniques are needed for measuring similarity between instances in an ontology.
Principles of the present invention provide improved information processing techniques for measuring similarity between instances in an ontology.
For example, a method of measuring similarity between instances in an ontology for use in an information retrieval system includes the following steps. A set of instances from the ontology is obtained. At least one of the following similarity metrics for the set of instances is computed: (i) a first metric that measures similarity between instances in the set of instances with respect to ontology concepts to which the instances belong; (ii) a second metric which measures similarity between instances in the set of instances where the instances are subjects in statements involving a given ontology property; and (iii) a third metric which measures similarity between instances in the set of instances where the instances are objects in statements involving a given ontology property. At least one taxonomy induced by the at least one computed similarity metric is stored, wherein the at least one induced taxonomy is usable for responding to requests submitted to an information retrieval system.
When two or more of the first metric, the second metric and the third metric are computed, and two or more induced taxonomies corresponding to the two or more computed similarity metrics are stored, the method may include merging the two or more induced taxonomies to form a combined taxonomy, wherein the combined taxonomy is usable for responding to requests submitted to an information retrieval system.
The first metric, the second metric and the third metric may include information theorybased measurements. The first metric may measure similarity of instances i and j in the set of instances based on the similarity of C(i) and C(j), where C(i) and C(j) represent sets of concepts to which the instances belong. The second metric may measure similarity of instances i and j in the set of instances based on the similarity of sets of objects in statements where the instances are subjects in the statements. The third metric may measure similarity of instances i and j in the set of instances based on the similarity of sets of subjects in statements where the instances are objects in the statements.
These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.
It is to be understood that while principles of the invention will be described below in the context of the OWL ontology language, principles of the invention are not so limited. Rather, principles of the invention are more generally applicable to any ontology based environment in which it would be desirable to provide a measure of similarity between instances in the ontology. In accordance with an OWLbased embodiment, the remainder of the detailed description refers to “instance(s)” as “individual(s).”
Before describing principles of the present invention, some information theorybased similarity principles will be described.
In information theory, the information content in an event e is I(e)=−log(p(e)), where p(e) is the probability for e to occur. Lin (D. Lin, “An informationtheoretic definition of similarity,” In Proc. 15th International Conf on Machine Learning, pp. 296304, Morgan Kaufmann, San Francisco, Calif., 1998) defined semantic similarity between two objects based on information theory. According to Lin'"'"'s definition, what we know about two objects is their “description.” The description is composed of two things: “commonality” and “difference.” Similarity is measured based on the following three intuitions:
(1) The more information content there is in the commonality between two objects, the more similar they are. If there is no commonality, similarity is zero.
(2) The more information content there is in the difference between two objects, the less similar they are.
(3) Maximum similarity is between an object to itself only.
sim(a, b), the similarity between a and b, is defined to be:
This similarity measure is a number between zero (0) and one (1). One (1) is the similarity measure between an object and itself, a case where the information content in the description is equal to the information content in the commonality. Zero (0) is the similarity measure between two objects whose commonality does not have any information content. The measure is commutative.
Resnik (P. Resnik, “Semantic similarity in a taxonomy: An informationbased measure and its application to problems of ambiguity in natural language,” Journal of Artificial Intelligence Research, Vol. 11, pp. 95130, 1999) proposed that p(A), the “probability” associated with A, a concept in a hierarchical taxonomy, is the probability that a random individual belongs to A (in other words, the number of individuals in A divided by the total number of individuals populating the taxonomy). Hence, the information contents of the description of a class A is −log(p(A)). Lin used this definition to say that given two classes, A and B, the information content of the description of the two classes is the sum of their information contents, i.e., (−log(p(a)))+(−log(p(b))).
The information content of the commonality of A and B can be defined in terms of the information content of their least common ancestor, represented as lca(A, B). The information content of the commonality of classes A and B is −log([p(lca(a, b))]^{2})=−2 log(p(lca(a, b))). This is the probability for individuals i and j to satisfy iεlca(a, b)^jεlca(a, b). Therefore:
We now explain why the least common ancestor represents the commonality of A and B. The key intuition is that if an individual i belongs to a class A, it also belongs to A'"'"'s ancestors (represented as A_{1}, A_{2}, . . . , A_{m}). The description of a class can be defined as a set of class membership statements, i.e., Desc(A)={iεA, iεA_{1}, iεA_{2}, . . . iεA_{m}}. Similarly, Desc(B)={iεB, iεB_{1}, iεB_{2}, . . . , iεB_{n}}.
The set of statements describing commonality of A and B is Comm(A, B)={(iεq): ((iεq)εDesc(A))^((iεq)εDesc(B))}. The information content of this set is based on the probability for two random individuals to satisfy the statements in it. If the taxonomy is a tree, then this commonality set contains statements describing membership in the least common ancestor of A and B, along with all its superclasses. Hence, the least common ancestor represents the commonality of A and B.
Principles of the invention realize that ontologies may provide more information about individuals than just the concepts to which they belong. Individuals in the Semantic Web may also be related to other individuals based on various properties. Accordingly, principles of the invention provide an approach for measuring the similarity of individuals in the Semantic Web by combining the knowledge about the taxonomy of concepts the individuals belong to, and the knowledge about the relationships between different individuals.
A core idea behind the inventive approach is to consider similarity between two individuals based on different, independent aspects. Each aspect is essentially one dimension for comparing the individuals. The fast aspect involves measuring the similarity between two individuals in terms of the similarity of the concepts they belong to. In addition, for each property defined in the ontology, there are two separate aspects that measure the similarity between the two individuals in terms of the similarity of the individuals they are related to, based on this property. One aspect considers relationships where the two individuals are the subjects in statements involving that property, while the other aspect considers relationships where the two individuals are the objects.
An algorithm according to an embodiment of the invention uses an informationtheoretic approach to find similarity between two individuals for any aspect. For each aspect, it finds the commonality and the description in the information available about the individuals for that aspect. It uses the commonality and description to obtain similarity values for the individuals, using the above described Lin approach. It finally combines the information from different aspects to obtain an overall similarity measure.
As will be evident below, we describe a number of useful properties of similarity metrics and show how the inventive algorithm satisfies these properties. We also compare our metric with a pure taxonomydriven metric on a sample ontology, and show a number of cases where the inventive algorithm is better than a pure taxonomydriven metric. In particular, the inventive algorithm allows differentiating between individuals that may belong to the similar concepts, but which do not share much commonality in their relationships with other individuals. It also allows detecting the similarity between individuals of distant concepts in the taxonomy, which share a high degree of similarity in their relationships with other individuals.
As will be evident, principles of the invention provide many advantages. By way of example only, such advantages include: (i) the identification of a number of useful features for a similarity metric between individuals; (ii) generalization of Lin'"'"'s similarity measure to find the similarity of two individuals in an ontology based on a DAGtaxonomy of concepts; (iii) an informationtheoretic measurement of the similarity of two individuals taking into account their relationships with other individuals based on a specific property; and (iv) an information theoretic approach for combining different aspects for measuring similarity between two individuals.
Referring initially to
The taxonomy of the wine ontology shown in
Ontologies describe terms using highly expressive and wellstructured semantic graphs. This facilitates the discovery of interesting and unanticipated relationships between different concepts and individuals. One of these relationships, which is similarity of terms (concepts and individuals), is the problem that principles of the invention address. We now formally define the similarity problem.
An Ontology O is defines as a fourtuple (T, P, I, S):
 T, which is O'"'"'s taxonomy, is a hierarchy of C, the set of classes. Being an “isa” or subclassbased hierarchy, every class in C represents a concept which is a specialization of the concept represented by its parent class. T is the organization of the elements in C as a DAG, but with one root only. This root represents the most general concept in the ontology (typically, “Thing”).
 P is a set of properties defined in O. Every property pεP has a set of domain classes (denoted p.d) and a set of range classes (denoted p.r) explicitly defined for it, In this embodiment, we consider only OWL object properties, though the inventive algorithm can be extended to datatype properties as well.
 I is the set of individuals in O. Each individual explicitly belongs to at least one class.
 S is the set of statements in O. Elements in S describe class membership and relationships between individuals. Class membership statements are of the form (iεX), where iεI, XεC Relations between individuals are represented as a triple (s, p, o), containing a subject, a predicate and an object. Here, s, oεI (specifically, sεp.d and oεp.r), and pεP.
For example, in the context of the wine ontology of
We are interested in finding a measure for the semantic similarity of individuals in an ontology O. That is, a function that maps pairs of individuals to a totally ordered set and whose values are in line with intuition for similarity. Without loss of generality, the similarity can be considered to belong to the realvalued range [0,1]. Formally, we are looking for a function ƒ:I×I→[0,1]. For the sake of comparison, most similarity measures described below are proposals of the function g:C×C→[0,1].
Finding similar individuals is motivated by many kinds of investigative tasks; for example, to find how similar two products or patents are. Properties express a lot of information about individuals. Consider a concept in some ontology, in which every individual is associated with a color using a property. We would like to be able to say that two blue individuals are more similar to each other than a blue one and a red one. Moreover, we would like to consider similarity in the values so that light blue and dark blue individuals would be closer to each other than a black individual to a white one.
We now describe similarity between individuals in accordance with illustrative embodiments of the invention.
An OWL ontology gives different kinds of information about individuals. The information includes the concepts to which they belong and their relationships to other individuals based on different properties. In many scenarios, it is useful to consider the similarity between two individuals based on only a subset of all the information contained in the ontology. For example, it is sometimes useful to consider similarity based only on the values of a certain property, or a certain set of properties. In other cases, only the concept membership information may be considered relevant for calculating the similarity.
In order to achieve such fine control over the calculation of similarity, we first partition the space of information about different individuals into a number of different aspects. Each aspect only considers a portion of the information in the ontology. We then measure the similarity between individuals based on a certain aspect, or on a certain set of aspects. Below we describe the different kinds of aspects and some desired properties of the similarity measures for an aspect and for a set of aspects.
Classes are the representation of abstract concepts. We use “concept” to emphasize semantics and “class” to emphasize the hierarchical structure. In most cases both terms can be used interchangeably.
Principles of the invention provide for three kinds of aspects:
(1) ConceptSetSimilarity. This aspect measures the similarity between individuals based on the similarity of the sets of classes (or concepts) they belong to in the taxonomy. That is, it measures the similarity of individuals i and j based on the similarity of C(i) and C(j), where C(i) and C(j) represent the sets of classes that the individuals belong to. We denote this aspect by the symbol CS.
(2) ObjectSetSimilarity. For a property, p, this aspect measures the similarity between two individuals, i and j, based on the similarity of the sets of objects in statements where these individuals are the subjects.
Note that an ontologybased statement typically includes a subject, a relationship, and an object (e.g., (banana, hasColor, yellow), where banana is an individual that is the subject of the statement, yellow is an individual that is the object of the statement, and hasColor is a relationship that represents a property in the ontology.)
Let O_{p}(i)={o:(i, p, o)εS} and O_{p}(j)={o:(j, p, o)εS}. Then, this aspect measures the similarity of i and j in terms of the similarity between the object sets O_{p}(i) and O_{p}(j). We denote this aspect by the symbol OS(p).
(3) SubjectSetSimilarity. For a property, p, this aspect measures the similarity between two individuals i and j based on the similarity of the sets of subjects in statements where these individuals are the objects. Let S_{p}(i)={s:(s, p, i)εS} and S_{p}(j)={s:(s, p, j)εS}. Then, this aspect measures the similarity of i and j in terms of the similarity between the subject sets S_{p}(i) and S_{p}(j). We denote this aspect by SS(p).
For any ontology, O(T, P, I, S), we can define 1+2P aspects. We refer to this set of aspects as A. If there are n properties in P of the form p_{1}, . . . , p_{n}, then A={CS, OS(p_{1}), . . . , OS(p_{n}), SS(p_{1}), . . . , SS(p_{n})}.
We are interested in finding a measure for the similarity of individuals in an ontology O based only on a specific aspect. That is, a function ƒ_{A }s.t. (such that): ƒ_{A}:I×I→[0,1], which measures the similarity according to aspect A, s.t. AεA.
In addition to the similarity based on a single aspect, we are also interested in finding a combined similarity measure for any set of aspects. We define the combined similarity function as ƒ_{x}*:I×I→[0,1] where X is a set of aspects, i.e., Xε2^{A}.
We now describe some desired features for a similarity metric between individuals. These features try to capture the intuition for similarity between individuals in an OWL ontology. Let i and j be two individuals. For any aspect, we would like to preserve Lin'"'"'s intuitions on similarity: (i) increase with commonality, i.e., the more commonality i and j share with respect to aspect A, the greater the value of ƒ_{A}(i, j); (ii) decrease with difference, i.e., the more differences i and j have, with respect to aspect A, the lesser the value of ƒ_{A}(i, j); and (iii) maximum similarity under equality, i.e., the maximum similarity is between an individual and itself only; in other words, (ƒ_{A}(i,j)=1)(i=j).
The same intuitions also hold for a combination of aspects and for the combined similarity metric. Based on these intuitions, there are a number (listed 19 below) of other desired features for the similarity metric.
(1) Propagation of ConceptSimilarity
If two concepts are similar, then this similarity propagates to individuals that belong to these concepts.
Consider two concepts C_{i }and C_{j}. Let individual i belong to C_{i }and individual j belong to C_{j}. According to this intuition, the greater the commonality between C_{i }and C_{j}, the higher is the similarity between i and j.
Let ƒ be the similarity metric defined for given classes. Consider a third concept, C_{k}, to which the individual k belongs. If i, j, k and l are not members of any other concepts, then, according to this intuition,
If ƒ(C_{i}, C_{j})≧ƒ(C_{i}, C_{k}),
 then ƒ_{cs}(i,j)≧ƒ_{cs}(i, k)
Example. Let an ontology have the statements:
 (bananaεFruit), (carrotεVegetable), (ravenεBird).
Assuming the individuals belong to no other concepts, then, in an ontology where
 ƒ(Fruit, Vegetable)≧ƒ(Fruit, Bird), we expect
 ƒ_{cs}(banana, carrot)≧ƒ_{cs}(banana, raven).
The same intuition can be easily extended to the case where an individual belongs to more than one concept. This can be done by considering the intersection of all the concepts to which the individual belongs.
(2) Propagation of ObjectSimilarity
If two individuals, a and b, are similar, then this similarity propagates to any other pair of individuals, i and j, that are related to a and b, respectively, by a property p. In particular, if a and b have a high conceptsetsimilarity, then i and j have high objectsetsimilarity for the property, p.
Assume the ontology has the following three statements on the property p, which are (i, p, a), (j, p, b) and (k, p, c). If i, j and k do not have other objects associated with them for property p, then this feature states the following intuition about ObjectSetSimilarity:
If ƒ_{cs}(a, b)≧ƒ_{cs}(a, c)
 then ƒ_{os}_{(p)}(i,j)≧ƒ_{os}_{(p)}(i, k)
Example. Consider the statements
 (banana, hasColor, yellow),
 (carrot, hasColor, orange),
 (raven, hasColor, black),
Assume the subjects above have no other statements with the hasColor property. If in an ontology:
 ƒ_{cs}(yellow, orange)≧ƒ_{cs}(yellow, black), then
 ƒ_{os}_{(p)}(banana, carrot)≧ƒ_{os}_{(p)}(banana, raven).
The same feature can be extended to the case where the subjects have multiple object values for a property.
(3) Propagation of SubjectSimilarity
This is a dual of the above feature. If two subjects are very similar, then this similarity propagates to the objects they are related to by a property. Again, consider the statements: (i, p, a), (j, p, b) and (k, p, c). If a, b and c do not have other objects associated with them for property p,
If ƒ_{cs}(i,j)≧ƒ_{cs}(i, k), then ƒ_{ss}_{(p)}(a, b)≧ƒ_{ss}_{(p)}(a, c)
Example. Consider the statements
 (cream, usedInMaking, coffee),
 (milk, usedInMaking, hotchocolate),
 (glass, usedInMaking, window).
Assume the objects above have no other statements with the hasColor property. In an ontology where
 ƒ_{cs}(cream, milk)≧ƒ_{cs}(cream, glass), then
 ƒ_{ss}_{(p)}(coffee, hotchocolate)≧ƒ_{ss}_{(p)}(coffee, window).
(4) Inverse Dependence on Concept Cardinality
If i and j belong to a concept represented by a class C, that has a small cardinality (small number of elements), then the individuals have a high conceptsetsimilarity. Intuitively, this means that they share a piece of information which is rare (since there are only a few individuals that belong to C), and therefore, the information content in their commonality is higher.
Let iεC, jεC, kεD, lεD. Assuming that i, j, k and l do not belong to any other class,
If C<D, then ƒ_{cs}(i, j)>ƒ_{cs}(k, l) where C is the cardinality of C.
Example. Assume that in an ontology there are 100 individuals in the concept Drink, but only 3 individuals in the concept Fruit. In the context of this ontology, two individuals which are fruits share greater commonality than individuals which are drinks.
(5) Inverse Dependence on SubjectSet Cardinality
If i and j share a “rare” object value, x, for a property p, then the information content in their commonality is higher. In other words, if the cardinality of the set of individuals that are the subjects of statements with the property p and the object o, is small, then i and j share a rare piece of information, and are hence their objectset similarity is higher.
Let (i, p, x), (j, p, x), (k, p, y), (l, p, y) be statements in the ontology. Assuming that i, j, k and l are not the subjects of any other triples with the property p,
If S_{p}(x)<S_{p}(y), then ƒ_{os}_{(p)}(i, j)>ƒ_{os}_{(p)}(k, l) where S_{p}(x) is the number of individuals that are the subjects of statements with the property p and the object x.
Example. Assume that an ontology of cars describes that the color of many cars is white, but very few cars have the color purple. Then assuming that a car has just one color value, two purple cars are more similar to each other than two white cars in a similarity measure based on this aspect.
(6) Inverse Dependence on ObjectSet Cardinality
This is the dual of the previous feature. Let (i, p, x), (i, p, y), (j, p, z), (j, p, w) be statements in the ontology. Assuming that x, y, z and w are not the objects of any other triples with the property p.
If O_{p}(i)<O_{p}(j), then ƒ_{ss}_{(p)}(x, y)>ƒ_{ss}_{(p)}(z, w).
(7) Commutativity
ƒ_{A}(i,j)=ƒ_{A}(j, i) for any aspect, A.
(8) Minimum Similarity Under Disjointness
When there is no information content in the commonality between i and j with respect to some aspect A, then ƒ_{A}(i,j)=0.
Following from this point, for a property p, if O_{p}(i)=0 or if O_{p}(j)=0 then ƒ_{os}_{(p)}(i, j)=0. This is because if either individual is not the subject of any statement on the property p, then the commonality based on this aspect is 0. Similarity, if S_{p}(i)=0 or if S_{p}(j)=0 then ƒ_{ss}_{(p)}(i, j)=0.
(9) Monotonic Nature of Combined Similarity
The similarity between two individuals can only increase as more aspects are considered. This follows from the open world assumption on which OWL is based, and the monotonic nature of description logics. That is, new information from new aspects cannot cause existing information to become false or invalid (unless it causes the ontology to become logically inconsistent; however, we assume that the ontologies are consistent).
Formally, let X and X′ be two sets of aspects. If X⊂X′, then ƒ_{x}(i,j)≦ƒ_{x′}(i,j).
This does not mean that individuals become more similar as more aspects are considered. Relative similarity can change as more aspects are considered.
We now describe similarity based on a single aspect.
In order to find the similarity of two individuals based on a single aspect, we need to come up with sufficient measures that capture the description and commonality of two individuals based on the aspect. Once we have these measures, we can compute the similarity of the individuals.
A. ConceptSetSimilarity Measure
ConceptSetSimilarity considers only concept membership statements of individuals. We define the description of an individual and commonality between two individuals based on the concepts to which they belong.
In order to compute the description of an individual, we define a new virtual class of size one (1), and make the individual the only member of this virtual class. We call this new class a virtual class since it did not exist in the original ontology.
For an individual, i, let its virtual class be denoted by V_{i}. If, in the ontology, i belongs to a set of concepts, C(i), then each concept in C(i) is a superclass of V_{i}.
For example, consider two individuals in taxonomy 200 of the wine ontology in
The definition of the virtual class helps in capturing the fact that an individual has a unique identity. The description of the individual is in terms of its membership to this virtual class, i.e., Desc(i)={iεV_{i}}. The information content of the description is the probability that a random individual satisfies the statement in the description, i.e., the probability that a random individual belongs to V_{i}. Since the size of V_{i }is 1, I(Desc(i))=−log(1/I).
In order to calculate the commonality between two individuals i and j, we first expand the descriptions to include other concept membership statements that can be inferred based on the concept taxonomy.
ExpDesc(i)={(iεC): C is an ancestor of V_{i}}, and
ExpDesc(j)={(jεC): C is an ancestor of V_{j}}.
In the example in
We define commonality as a pair of class membership statements for classes that appear in both the expanded descriptions.
Comm(i,j)={(iεC), (jεC): (iεC)εExpDesc(i)(jεC)εExpDesc(j)}
i.e., Comm(i,j)={(iεC), (jεC): C is an ancestor of V_{i }and V_{j}}.
The information content in the commonality is the probability that a pair of random individuals satisfies the pair of class membership statements in the commonality. Let C(i,j) denote the set of classes that are the least common ancestors of V_{i }and V_{j}. It is sufficient to consider the classes in C(i,j) for, calculating commonality since membership to all common ancestors can be derived from the set of least common ancestors.
Let V_{i,j }denote the intersection of all the classes in C(i, j). We call V_{i,j }the lcaintersection class (or least common ancestors intersection class) for individuals i and j for the aspect CS.
I(Comm(i,j))=−log(p[(xεV_{i,j})^(yεV_{i,j})]), or
I(Comm(i,j))=−2·log(V_{i,j}/I)
In the example discussed above, C(i, j)={RedTableWine, DryRedWine}. The number of individuals in V_{i,j}, the intersection of these two classes is 25. The number of individuals in the wine ontology is 206. Therefore, I(Comm(i, j))=−2·log(25/206). Thus, the measure of conceptsetsimilarity is:
The concept V_{i,j }contains at least two individuals, i and j. If there are only a few other individuals that belong to V_{i,j }then the information content in the commonality is large, and, thus, the similarity between i and j is high.
We show that the conceptsetsimilarity metric satisfies several of the desired features described above.
It is easy to prove that the similarity measure is inversely dependent on concept cardinality, which is one of the desired features of a similarity metric. Consider four individuals, i, j, k and l, such that, iεC, jεC, kεD, lεD. Assuming that i, j, k and I do not belong to any other class, it can be seen that V_{i,j}=C and V_{k,l}=D.
Thus, if C<D, then ƒ_{cs}(i, j)>ƒ_{cs}(k, l).
The metric is commutative since it does not depend on the order of considering i and j. Also, the maximum value of similarity occurs when we compare an individual to itself The lcaintersection class in this case is the same as the virtual class defined for the individual. Thus, ƒ_{cs}(i, i)=1.
In addition, we can prove that this similarity metric satisfies the propagation of conceptsimilarity feature in the case of a treebased taxonomy. Consider concepts C_{i}, C_{j }and C_{k}, to which individuals i, j and k respectively belong. Assume that i, j and k do not belong to any other concepts. Since, we are considering a treebased taxonomy, C_{i }and C_{j }have exactly one least common ancestor, which is, in fact, the same as V_{i,j}. Similarly, the least common ancestor of C_{i }and C_{k }is V_{i,k}.
If ƒ(C_{i}, C_{j})≧ƒ(C_{i}, C_{k}), then, in a treebased taxonomy, it means that either V_{i,j }is the same as V_{i,k }or V_{i,j }is below V_{i,k }in the taxonomy. This is because, in such a taxonomy, all the ancestors of C_{i }are along the path from C_{i }to the root. Thus, V_{i,k }is an ancestor of V_{i,j}; hence any instance of V_{i,k }is also an instance of V_{i,j}. As a result, V_{i,k}≦V_{i,j}. Since the conceptsetsimilarity of two individuals depends on the cardinality of their virtual lcaintersection class, ƒ_{cs}(i, j)≧ƒ_{cs}(k, l).
B. ObjectSetSimilarity Measure
We now consider the similarity between i and j based on statements where i and j are the subjects and the predicate is a certain property, p.
Just as in the case of conceptsetsimilarity, the individual can be considered to belong to a virtual class of size one (1), and the description can be defined in terms of its membership to this virtual class. However, this virtual class is different from the V_{i }that we defined for conceptsetsimilarity. It is not associated with the class taxonomy defined in the ontology, but is associated with a different taxonomy that can be defined based on the range of property p.
We now describe how this new taxonomy can be built. Let us first consider the information known about i for this aspect. Let O_{p}(i) be the set of objects in statements where i is the subject and p is the predicate. O_{p}(i)={g_{1}, g_{2}, . . . , g_{m}}. For each g_{k}, k=1, . . . , m, we construct a new virtual class, V(p, g_{k}), that represents the set of all individuals that are related by property p to object g_{k}. Note that these virtual classes did not exist in the original ontology.
For an individual, i, let the virtual class it belongs to, based on the property p, be denoted by V_{i}(p). The superclasses of V_{i}(p) are V(p, g_{k}) where g_{k}εO_{p}(i).
Now, Desc(i)={iεV_{i}(p)}. The information content of the description is the probability that a random individual satisfies the statement in the description, i.e., the probability that a random individual belongs to V_{i}(p). Since the size of V_{i}(p) is 1, I(Desc(i))=−log(1/I).
If individual i is not the subject of any statement with property p, i.e., O_{p}(i)=0, then the only superclass of its virtual class, V_{i}(p), is the root class of the taxonomy.
We illustrate this in
In order to calculate the commonality, we first expand the descriptions to include other statements that can be inferred based on the objectset of i for property p and the taxonomy associated with the range of p.
The classes V_{i}(p) and V_{j}(p) will be a part of a taxonomy that we now build, called OS(p)induced taxonomy. As described earlier, the superclasses of V_{i}(p) are V(p, g_{k}) where g_{k}εO_{p}(i). Similarly, the superclasses of V_{j}(p) are V(p, h_{l}) where h_{l}εO_{p}(j).
Consider any individual, g_{k}, that satisfies (i, p, g_{k}). Now, g_{k }itself belongs to one or more classes, some subset of which lies in the range of p. Let one such classes to which g_{k }directly belongs be C. Then, V(p, g_{k}) has a superclass whose members consists of all those individuals that are related by property p to some individual that happens to be of type C. We denote this superclass as V(p, C) (and as pC in the figures). To state it more formally,
V(p,C)={s:∃x(s,p,x)xεC}.
For example, in
Hence, we can construct a taxonomy based on the virtual classes. This taxonomy exactly reflects the portion of the concept taxonomy that is rooted at the classes in the range of p, i.e., under p.r. We call this taxonomy the OS(p)induced taxonomy since it is induced based on the set of object values of the property p, as shown in
The commonality between two individuals based on the objectsetsimilarity aspect can now be defined based on this induced taxonomy. The similarity between virtual classes in this taxonomy can be calculated in the same way as was done for classes in the original taxonomy. That is, we construct expanded descriptions of i and j based on the OS(p)induced taxonomy:
ExpDesc(i)={(iεC): C is an ancestor of V_{i}(p)}, and
ExpDesc(j)={(jεC): C is an ancestor of V_{j}(p)}
The commonality is defined as a pair of class membership statements on classes that appear in the expanded descriptions of i and j, or:
Comm(i, j)={(iεC), (iεC): C is an ancestor of V_{i}(p) and V_{j}(p)}.
The information content in the commonality is the probability that a pair of random individuals satisfies the pairs of class membership statements in the commonality. Let C(i,j) denote the set of classes that are the least common ancestors of V_{i}(p) and V_{j}(p). Note that these classes lie in the OS(p)induced taxonomy. And let V_{i,j}(p) denote the intersection of all the classes in C(i, j). We call V_{i,j}(p) the lcaintersection class for i and j for the aspect OS(p).
I(Comm(i,j))=−log(p[(xεV_{i,j}(p))^(yεV_{i,j}(p))]), or
I(Comm(i, j))=−2·log(V_{i,j}(p)/I)
To summarize, to calculate the similarity of two individuals i and j, we first define new virtual classes for these individuals. We then construct the relevant portions of the OS(p)induced taxonomy in order to find the set of least common ancestors of the new virtual classes. Finally, we find the intersection of these classes, and use that to calculate the commonality.
The OS(p)induced taxonomy helps in organizing all individuals based on the value of their property, p. For example, based on this aspect, two wines that have the same color, red, are more similar to each other than a red wine and white wine. Also, a red wine and a white wine are more similar to each other than two individuals that do not have any color associated with them.
In
For Merlot and ChateauMorgonBeaujolais, that are both red, the lcaintersection class is V(hasColor, Red), the cardinality of which is 26. Thus
Thus, according to our metric, the similarity between two red wines is more than the similarity between a red wine and a white wine, which, in fact, meets our intuition.
Note that the set V_{i,j}(p) contains at least two individuals, i and j. If there are only a few other individuals that belong to V_{i,j}(p), then it means that relatively few individuals have an object value of a property p that belongs to the least common ancestors of V_{i}(p) and V_{j}(p). In this case, the information content in the commonality is large, and hence the similarity between i and j is high.
We show that the objectsetsimilarity metric satisfies several of the desired features described above.
It is easy to prove that the similarity measure is inversely dependent on subjectset cardinality. Let (i, p, x), (i, p, x), (k, p, y), (l, p, y) be statements in the ontology. Assume that i, j, k and l are not the subjects of any other triples with the property p. Then, it can be seen that V_{i,j}(p)=S_{p}(x), where S_{p}(x) represents the set of all subjects that are related via property p to the object x. Similarly, V_{k,l}(p)=S_{p}(y). Thus, if S_{p}(x)<S_{p}(y), then ƒ_{os}_{(p)}(i, j)>ƒ_{os}_{(p)}(k, l).
It is also clear that the metric is commutative. Also, ƒ_{os}_{(p)}(i, i)=1 because V_{i,i}(p) is the same as V_{i}(p). Hence, the maximum value of similarity occurs when we compare an individual to itself.
In addition, we can prove that this similarity metric satisfies the propagation of objectsimilarity feature in the case of an ontology that has a treebased concept taxonomy. Assume the ontology has the following statements on the property p, which are (i, p, a), (i, p, b) and (k, p, c). Assume that i, j and k do not have other objects associated with them for property p. Since the OS(p)induced taxonomy is based on the concept taxonomy under p.r (the range of p), the lcaintersection concept for the OS(p) aspect for i and j (denoted by V_{i,j}(p)) is defined in terms of the lcaintersection concept for the CS aspect for a and b (denoted by V_{a,b}), i.e., V_{i,j}(p)=V(p,V_{a,b})={s:∃x, (s, p, x)xεV_{a,b}}
If ƒ_{cs}(a, b)≧ƒ_{cs}(a, c), then, in a treebased taxonomy, it means that either V_{a,b }is the same as V_{a,c }or V_{a,b }is below V_{a,c }in the taxonomy. This implies that in the OS(p)induced taxonomy, either V_{i,j}(p) is the same as V_{i,k}(p) or V_{i,j}(p) is below V_{i,k}(p). Hence, V_{i,j}(p)≦V_{i,k}(p) and, ƒ_{os}_{(p)}(i,j)≧ƒ_{os}_{(p)}(i, k).
C. SubjectSetSimilarity Measure
We now describe subjectsetsimilarity for a certain property.
The SubjectSetSimilarity for two individuals i and j for a property p is calculated in a similar manner to the ObjectSetSimilarity.
D. Combining Similarity Measures
We now consider the similarity between i and j based on information from multiple aspects, i.e., a combined similarity measure that considers two or more of the single measures described above.
Above, we showed how, for the conceptsetsimilarity aspect and the subjectset and objectset similarity aspects, we were able to reduce the description of an individual to a statement indicating its membership to a virtual class of size one. This virtual class was associated with different taxonomies for the different aspects. For the conceptsetsimilarity aspect, the taxonomy was the one defined in the ontology (with individualassociated virtual classes). For the objectset and subjectset similarity aspects, we induced new taxonomies by defining new virtual classes based on the relationships between individuals based on some property.
In order to combine information from different aspects, we merge the taxonomies associated with the different aspects. We now describe how we generate the merged taxonomy.
Let us first consider the case of two aspects, A_{1 }and A_{2}. Let the virtual classes defined for an individual, i, for these aspects be Vi(A_{1}) and Vi(A_{2}). In the new merged taxonomy, we combine the virtual classes to create a new virtual class Vi(A_{1}, A_{2}). This size of this virtual class is again one; it only contains the individual, i. The superclasses of this new virtual class includes all the superclasses of Vi(A_{1}) and Vi(A_{2}). The description of an individual is now a statement indicating its membership to this new virtual class.
Desc(i)={iεV_{1}(A_{1},A_{2})}
For example, consider the individuals Merlot and CRiesling in taxonomy 500 in
In order to calculate the commonality between two individuals, we first expand the descriptions to include other statements that can be inferred based on the taxonomies of the two aspects.
ExpDesc(i)={(iεC): C is an ancestor of V_{i}(A_{1}, A_{2})},
ExpDesc(j)={(jεC): C is an ancestor of V_{j}(A_{1}, A_{2})}
As before, the commonality is defined as pairs of class membership statements:
Comm(i,j)={(iεC), (jεC): C is an ancestor of V_{i}(A_{1}, A_{2}) and V_{j}(A_{1}, A_{2})}
The information content in the commonality is the probability that a pair of random individuals satisfies the pair of class membership statements in the commonality. Let C_{i,j}(A_{1}, A_{2}) denote the set of classes that are the least common ancestors of V_{i}(A_{1}, A_{2}) and V_{j}(A_{1}, A_{2}). And let V_{i,j}(A_{1}, A_{2}) denote the intersection of the individuals in the classes in C_{i,j}(A_{1}, A_{2}). We call V_{i,j}(A_{1}, A_{2}) the lcaintersection class for i and j based on aspects A_{1 }and A_{2}.
I(Comm(i,j))=−log(p[(xεV_{i,j}(A_{1},A_{2}))^(yεV_{i,j}(A_{1},A_{2})))
i.e., I(Comm(i,j))=−2·log(V_{i,j}(A_{1},A_{2})/I).
A naive way of obtaining the intersection class V_{i,j}(A_{1}, A_{2}) is to create intersection classes for every pair of classes in the two taxonomies. However, a very useful optimization is based on the insight that the intersection class V_{i,j}(A_{1}, A_{2}) can be calculated by intersecting the virtual intersection classes for the two aspects.
Let the virtual intersection class that were generated for aspects A_{1 }and A_{2 }be V_{i,j}(A_{1}) and V_{i,j}(A_{2}). Then V_{i,j}(A_{1}, A_{2})=V_{i,j}(A_{1})∩V_{i,j}(A_{2}). The reason for this is if any individual, x, belongs to both V_{i,j}(A_{1}) and V_{i,j}(A_{2}), then it has to belong to V_{i,j}(A_{1}, A_{2}).
The above method can be extended for any number of aspects, say A_{1}, A_{2}, . . . , A_{n}. The taxonomy for each additional aspect can be incrementally merged with the existing combined taxonomy.
For example,
Now, let us consider, a third aspect, the objectsetsimilarity based on the property, locatedIn (not shown in the figure). The two individuals have the same value, NewZealandRegion, for this property. In the induced taxonomy for the property locatedIn, seven individuals are in the class locatedInNewZealandRegion. When we merge the three induced taxonomies, the new lcaintersection class has six individuals. Based on the three properties only, the similarity is:
The example above showed one of the desired features of a similarity metric, viz. the monotonic nature of combined similarity. We now prove that our metric does indeed satisfy the monotonicity feature in the general case.
Let X and X′ be sets of aspects, and let X⊂X′. Since our algorithm for constructing the merged taxonomy is incremental, we can construct the merged taxonomy corresponding to X′ by first constructing the merged taxonomy for X and then incrementally adding in the remaining aspects in X′−X. Now, the individuals in the lcaintersection class for X′ will be a subset of the individuals in the lcaintersection class for X. Hence, V_{i,j}(X)>V_{i,j}(X′), and thus ƒ_{x}(i,j)≦ƒ_{x′}(i,j).
Referring now to
It is to be appreciated that similarity measures associated with the individuals of the subject ontology may be used by an information retrieval system. For example, as shown in
Referring now to
In step 702, one or more sets of individuals (instances) are obtained.
In step 704, a conceptsetsimilarity measure is computed, as described above, for the one or more sets of individuals.
In step 706, a taxonomy resulting from the computation of the conceptsetsimilarity measure is stored.
In step 708, an objectsetsimilarity measure is computed, as described above, for the one or more sets of individuals.
In step 710, a taxonomy resulting from the computation of the objectsetsimilarity measure is stored.
In step 712, a subjectsetsimilarity measure is computed, as described above, for the one or more sets of individuals.
In step 714, a taxonomy resulting from the computation of the subjectsetsimilarity measure is stored.
In step 716, two or more of the resulting taxonomies are merged, as described above.
In step 718, the merged taxonomy is stored.
In step 720, the merged taxonomy (from step 716) or one or more individual taxonomies (from steps 704, 708 and 712) are utilized for information retrieval in response to an application query (see
Referring lastly to
Further, it is to be understood that the individual components/steps may be implemented on one such computer system, or more preferably, on more than one such computer system. In the case of an implementation on a distributed system, the individual computer systems and/or devices may be connected via a suitable network, e.g., the Internet or World Wide Web. However, the system may be realized via private or local networks. The invention is not limited to any particular network.
As shown, the computer system 800 may be implemented in accordance with a processor 802, a memory 804, I/O devices 806, and a network interface 808, coupled via a computer bus 810 or alternate connection arrangement.
It is to be appreciated that the term “processor” as used herein is intended to include any processing device, such as, for example, one that includes a CPU (central processing unit) and/or other processing circuitry. It is also to be understood that the term “processor” may refer to more than one processing device and that various elements associated with a processing device may be shared by other processing devices.
The term “memory” as used herein is intended to include memory associated with a processor or CPU, such as, for example, RAM, ROM, a fixed memory device (e.g., hard drive), a removable memory device (e.g., diskette), flash memory, etc.
In addition, the phrase “input/output devices” or “I/O devices” as used herein is intended to include, for example, one or more input devices (e.g., keyboard, mouse, etc.) for entering data to the processing unit, and/or one or more output devices (e.g., speaker, display, etc.) for presenting results associated with the processing unit.
Still further, the phrase “network interface” as used herein is intended to include, for example, one or more transceivers to permit the computer system to communicate with another computer system via an appropriate communications protocol.
Accordingly, software components including instructions or code for performing the methodologies described herein may be stored in one or more of the associated memory devices (e.g., ROM, fixed or removable memory) and, when ready to be utilized, loaded in part or in whole (e.g., into RAM) and executed by a CPU.
We now present an evaluation of the time complexity of the inventive algorithm. We also describe some of the results of applying this similarity metric on a sample ontology.
The complexity of our algorithm for computing the similarity between two individuals depends on the number of individuals (I), classes (C), properties (P) and statements (S) of property values and class membership in the ontology. Creating virtual individual classes for the two individuals under consideration takes constant time. Finding the set of least common ancestors takes O(C) time. Finding their members is O(S). Computing the intersection costs O(I^{2}) steps. If we consider the combined similarity based on all properties, this process must, potentially, be repeated twice for each property (to consider both objectset and subjectset similarity). Computing the log values for the description and the commonality, as well as dividing the results are neglected. Hence, the worst case complexity is therefore O(P)·(C+S+I^{2})).
The main element of the complexity comes from the computation of the intersection, where lookup of a single element could take linear time. Using a bash set to represent the individuals in a concepts reduces the amortized lookup time to O(1). Therefore, real life (amortized) complexity is O(P)·(C+S+I)). Note that in a complete implementation, there is great scope for optimization since induced taxonomies produced when calculating similarity for one pair of individuals can be reused for another pair.
While there are a number of publicly available ontologies, only a few of them have both a taxonomy and a rich set of relationships between individuals, on which we can apply our algorithm. For purposes of evaluation, we chose the wine ontology [W3C, “The wine ontology,” In http://www.w3.org/TR/2004/RECowlguide0040210/wine.rdf], which has both these features. In addition, since the wine ontology describes well known entities (like wines, colors and locations), it is possible to gain an intuition for the results returned by the similarity metric.
In the following tables, we show the similarity measure, ƒ, for a few pairs of individuals from the wine ontology. The first data row in every table gives the similarity between the individuals based on a single aspect only. The aspects are conceptset similarity and objectset similarity of different properties. The second row gives the combined measure for a certain aspect and all the aspects to its left. The number of individuals in the lcaintersection of an aspect (or for a combination of aspects) is given in brackets.
The individuals StonleighSauvignonBlanc and SelaksIceWine have the following measures:
The lcaintersection of conceptsimilarity has 25 individuals, and similarity is 0.395. The objectset similarity of hasColor is also 0.395. Combining the two, the lcaintersection has 14 individuals, and the combined measure is 0.505. This is an example of how the combined similarity increases as more aspects are considered.
The locatedIn property contributes a significant amount of information content, having only 7 individuals in the lcaintersection. That is, very few wines share the same location (or the same class of location) as these two wines. This is an example of the inverse dependence of our metric on subjectset cardinality. A similarity measure that does not take the locatedIn property into consideration misses the significance of the relatively rare value for this property. Combining the different aspects gives a total similarity of 0.698, while considering taxonomy only gives 0.395.
The individuals ChateauMorgon and LaneTannerPinotNoir have the following measures:
The individuals StonleighSauvignonBlanc and CongressSemillon have the following measures:
The two tables above show that even though adding aspects increases the absolute values of similarity, the relative similarity between two pairs of individuals may change as more information is considered. In one case, adding aspects increased similarity from 0.395 to 0.869. In the other case, similarity increased from 0.663 to 0.739.
The individuals ChateauChevalBlancStEmilion and WhitehallLanePrimavera have the following measures:
Conceptset similarity is 0.255 since there are relatively many individuals in the corresponding lcaintersection (inverse dependence on conceptset cardinality). In this example, there is no similarity for the aspect hasColor, since WhitehallLanePrimavera does not have a color (minimum similarity under disjointness). The lcaintersection of hasDescriptor has the exact same set as the lcaintersection associated with concept similarity. Thus, the combined similarity of the three aspects remains 0.255. The lcaintersection of locatedIn has 79 individuals, but intersecting it with the other lcaintersections gives 50 individuals. The combined similarity is 0.265.
Our results show that there are a number of cases where considering the relationships between individuals, based on different properties, is very useful while calculating the similarity between the individuals. Our proposed similarity metric is able to capture a number of useful features about the different aspects, and give a combined value of similarity for the different aspects.
Advantageously, as described above, based on an informationtheoretic definition of similarity between classes in a taxonomy, we developed a method to measure the similarity between individuals. Description and commonality are computed using an extended taxonomy which includes virtual classes, and which is obtained by reorganizing domain individuals of a property according to the range'"'"'s subtaxonomy and vice versa. We also proved that our metric satisfies a number of useful properties of similarity between individuals. We considered object properties, though our algorithm can be extended to data type properties using an appropriate definition of similarity between literals.
Although illustrative embodiments of the present invention have been described herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments, and that various other changes and modifications may be made by one skilled in the art without departing from the scope or spirit of the invention.