Document comparison using multiple similarity measures
First Claim
Patent Images
1. A method of comparing a plurality of documents, said method comprising:
- determining a plurality of similarity measures for said plurality of documents; and
determining an overall similarity measure for said plurality of documents, based on said plurality of similarity measures,wherein said plurality of similarity measures are chosen from the group of similarity measures comprising;
a semantic similarity measures based on similarity of terms contained in said plurality of documents;
a structural similarity measure based on the structures of chemical terms described in said plurality of documents; and
a reference similarity measure based on references contained in said plurality of documents;
wherein said plurality of similarity measures include a semantic similarity measure and a reference similarity measure, andfurther wherein two documents Pi and Pj being compared are represented by corresponding vectors;
di={wi1, wi2, . . . , win}, where wik (1<
k<
n) is a non-negative value denoting the weight of the term k in the document i, anddj={wj1, wj2, . . . , wjn}, where wil (1<
l<
n) is a non-negative value denoting the weight of the term l in the document j,and the semantic similarity measure SemSim( ) is given by the equation;
in which where di·
dj is the dot product between the vectors, calculated asΣ
(from k=0 to n)Σ
(from 1=0 to n) wik·
wil;
and wherein if terms k and l belong to ontology classes Ck and Cl, respectively, wik·
wil is calculated as;
tfik×
tfjl×
idf(Ck)×
idf(Cl)×
sim(Ck, Cl)×
WTont , wherein tfik is the frequency of the term k in the document i, tfjl is the frequency of the term l in the document j, idf(Ck) represents the inverse document frequency of class Ck in document corpus of document i and is calculated as;
log2 (N/nCk), where N is the number of documents in a collection, and nCk is the number of documents in which a term of class Ck occurs at least once, idf(Cl) represents the inverse document frequency of class Cl in document corpus of document j and is calculated as;
log2 (N/nCl) , where N is the number of documents in a collection, and nCl is the number of documents in which a term of class Cl occurs at least once, sim(Ck, Cl) is calculated as;
where pCk is the probability of encountering the class Ck or a child of the term in a given taxonomy, pCl is the probability of encountering the class Cl or a child of the term in a given taxonomy and pmin is the minimum probability among common ancestors of classes Ck and Cl, andWTont is a predefined constant in the range 0<
WTont<
1.
1 Assignment
0 Petitions
Accused Products
Abstract
Disclosed herein is a method for comparing documents. The method includes the steps of: determining a plurality of similarity measures; and determining an overall similarity measure for the plurality of documents, based on the plurality of similarity measures. In one embodiment, the similarity measures are chosen from the group of similarity measures consisting of semantic and reference similarity measures. When comparing documents from the chemical, biochemical or pharmaceutical domains, the determination of the similarity utilizes a determination of structural similarity of the chemical formulas described in the plurality of documents.
39 Citations
9 Claims
-
1. A method of comparing a plurality of documents, said method comprising:
-
determining a plurality of similarity measures for said plurality of documents; and determining an overall similarity measure for said plurality of documents, based on said plurality of similarity measures, wherein said plurality of similarity measures are chosen from the group of similarity measures comprising; a semantic similarity measures based on similarity of terms contained in said plurality of documents; a structural similarity measure based on the structures of chemical terms described in said plurality of documents; and a reference similarity measure based on references contained in said plurality of documents; wherein said plurality of similarity measures include a semantic similarity measure and a reference similarity measure, and further wherein two documents Pi and Pj being compared are represented by corresponding vectors; di={wi1, wi2, . . . , win}, where wik (1<
k<
n) is a non-negative value denoting the weight of the term k in the document i, anddj={wj1, wj2, . . . , wjn}, where wil (1<
l<
n) is a non-negative value denoting the weight of the term l in the document j,and the semantic similarity measure SemSim( ) is given by the equation; in which where di·
dj is the dot product between the vectors, calculated asΣ
(from k=0 to n)Σ
(from 1=0 to n) wik·
wil;and wherein if terms k and l belong to ontology classes Ck and Cl, respectively, wik·
wil is calculated as;tfik×
tfjl×
idf(Ck)×
idf(Cl)×
sim(Ck, Cl)×
WTont , whereintfik is the frequency of the term k in the document i, tfjl is the frequency of the term l in the document j, idf(Ck) represents the inverse document frequency of class Ck in document corpus of document i and is calculated as;
log2 (N/nCk), where N is the number of documents in a collection, and nCk is the number of documents in which a term of class Ck occurs at least once,idf(Cl) represents the inverse document frequency of class Cl in document corpus of document j and is calculated as;
log2 (N/nCl) , where N is the number of documents in a collection, and nCl is the number of documents in which a term of class Cl occurs at least once,sim(Ck, Cl) is calculated as; where pCk is the probability of encountering the class Ck or a child of the term in a given taxonomy, pCl is the probability of encountering the class Cl or a child of the term in a given taxonomy and pmin is the minimum probability among common ancestors of classes Ck and Cl, and WTont is a predefined constant in the range 0<
WTont<
1.- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
wherein tfik is the frequency of the term k in the document i, tfjl is the frequency of the term l in the document j, N is the number of documents in a collection, nk is the number of documents in which the term k occurs at least once, and nl is the number of documents in which the term l occurs at least once.
-
-
8. The method according to claim 1, further comprising:
determining said terms contained in said plurality of documents by utilizing at least one of an ontology, a taxonomy, and a dictionary.
-
9. A computer program product having a computer readable storage medium having a computer program recorded therein for comparing documents, said computer program product comprising a method comprising:
-
determining a plurality of similarity measures for said plurality of documents; and determining an overall similarity measure for said plurality of documents, based on said plurality of similarity measures, wherein said plurality of similarity measures are chosen from the group of similarity measures comprising; a semantic similarity measures based on similarity of terms contained in said plurality of documents; a structural similarity measure based on the structures of chemical terms described in said plurality of documents; and a reference similarity measure based on references contained in said plurality of documents; wherein said plurality of similarity measures include a semantic similarity measure and a reference similarity measure, and further wherein two documents Pi and Pj being compared are represented by corresponding vectors; di={wi1, wi2, . . . , win}, where wik (1<
k<
n) is a non-negative value denoting the weight of the term k in the document i, anddj={wj1, wj2, . . . , wjn}, where wil (1<
l<
n) is a non-negative value denoting the weight of the term l in the document j,and the semantic similarity measure SemSim( ) is given by the equation; SemSim(Pi, Pj)=(di·
dj)/(√
(di·
dj)√
(dj·
dj)),in which where di·
dj is the dot product between the vectors, calculated asΣ
(from k=0 to n)Σ
(from 1=0 to n) wik·
wil;and wherein if terms k and l belong to ontology classes Ck and Cl, respectively, wik·
wil is calculated as;tfik×
tfjl×
idf(Ck)×
idf(Cl)×
sim(Ck, Cl)×
WTont, whereintfik is the frequency of the term k in the document i, tfjl is the frequency of the term l in the document j, idf(Ck) represents the inverse document frequency of class Ck in document corpus of document i and is calculated as; where N is the number of documents in a collection, and nCk is the number of documents in which a term of class Ck occurs at least once, idf(Cl) represents the inverse document frequency of class Cl in document corpus of document j and is calculated as; where N is the number of documents in a collection, and nCl is the number of documents in which a term of class Cl occurs at least once, sim(Ck, Cl) is calculated as;
sim(Ck, Cl)=(2×
[ln (pmin)])/(ln (pCk)+ln (pCl)), where pCk is the probability of encountering the class Ck or a child of the term in a given taxonomy, pCl is the probability of encountering the class Cl or a child of the term in a given taxonomy and pmin is the minimum probability among common ancestors of classes Ck and Cl, andWTont is a predefined constant in the range 0<
WTont<
1.
-
Specification