Document comparison using multiple similarity measures

US 7,472,121 B2
Filed: 12/15/2005
Issued: 12/30/2008
Est. Priority Date: 12/15/2005
Status: Expired due to Fees

First Claim

Patent Images

1. A method of comparing a plurality of documents, said method comprising:

determining a plurality of similarity measures for said plurality of documents; and

determining an overall similarity measure for said plurality of documents, based on said plurality of similarity measures,wherein said plurality of similarity measures are chosen from the group of similarity measures comprising;

a semantic similarity measures based on similarity of terms contained in said plurality of documents;

a structural similarity measure based on the structures of chemical terms described in said plurality of documents; and

a reference similarity measure based on references contained in said plurality of documents;

wherein said plurality of similarity measures include a semantic similarity measure and a reference similarity measure, andfurther wherein two documents P_iand P_jbeing compared are represented by corresponding vectors;

d_i={w_i1, w_i2, . . . , w_in}, where w_ik(1<

k<

n) is a non-negative value denoting the weight of the term k in the document i, anddj={wj₁, wj₂, . . . , wj_n}, where w_il(1<

l<

n) is a non-negative value denoting the weight of the term l in the document j,and the semantic similarity measure SemSim( ) is given by the equation;

$SemSim (P_{i}, P_{j}) = \frac{d_{i}, d_{j}}{\sqrt{d_{i} \cdot d_{i}} \sqrt{d_{j} \cdot d_{j}}},$ in which where d_i·

dj is the dot product between the vectors, calculated asΣ

(from k=0 to n)Σ

(from 1=0 to n) w_ik·

w_il;

and wherein if terms k and l belong to ontology classes Ck and Cl, respectively, w_ik·

w_ilis calculated as;

tf_ik×

tf_jl×

idf(Ck)×

idf(Cl)×

sim(Ck, Cl)×

WT_ont, wherein tf_ikis the frequency of the term k in the document i, tf_jlis the frequency of the term l in the document j, idf(Ck) represents the inverse document frequency of class Ck in document corpus of document i and is calculated as;

log₂(N/n_Ck), where N is the number of documents in a collection, and n_Ckis the number of documents in which a term of class Ck occurs at least once, idf(Cl) represents the inverse document frequency of class Cl in document corpus of document j and is calculated as;

log₂(N/n_Cl) , where N is the number of documents in a collection, and n_Clis the number of documents in which a term of class Cl occurs at least once, sim(Ck, Cl) is calculated as;

$sim (Ck, Cl) = \frac{2 \times [\ln (p_{\min})]}{\ln (p_{Ck}) + \ln (p_{Cl})},$ where p_Ckis the probability of encountering the class Ck or a child of the term in a given taxonomy, p_Clis the probability of encountering the class Cl or a child of the term in a given taxonomy and p_minis the minimum probability among common ancestors of classes Ck and Cl, andWT_ontis a predefined constant in the range 0<

WT_ont<

1.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Disclosed herein is a method for comparing documents. The method includes the steps of: determining a plurality of similarity measures; and determining an overall similarity measure for the plurality of documents, based on the plurality of similarity measures. In one embodiment, the similarity measures are chosen from the group of similarity measures consisting of semantic and reference similarity measures. When comparing documents from the chemical, biochemical or pharmaceutical domains, the determination of the similarity utilizes a determination of structural similarity of the chemical formulas described in the plurality of documents.

39 Citations

View as Search Results

9 Claims

1. A method of comparing a plurality of documents, said method comprising:
- determining a plurality of similarity measures for said plurality of documents; and
  
  determining an overall similarity measure for said plurality of documents, based on said plurality of similarity measures,wherein said plurality of similarity measures are chosen from the group of similarity measures comprising;
  
  a semantic similarity measures based on similarity of terms contained in said plurality of documents;
  
  a structural similarity measure based on the structures of chemical terms described in said plurality of documents; and
  
  a reference similarity measure based on references contained in said plurality of documents;
  
  wherein said plurality of similarity measures include a semantic similarity measure and a reference similarity measure, andfurther wherein two documents P_iand P_jbeing compared are represented by corresponding vectors;
  
  d_i={w_i1, w_i2, . . . , w_in}, where w_ik(1<
  
  k<
  
  n) is a non-negative value denoting the weight of the term k in the document i, anddj={wj₁, wj₂, . . . , wj_n}, where w_il(1<
  
  l<
  
  n) is a non-negative value denoting the weight of the term l in the document j,and the semantic similarity measure SemSim( ) is given by the equation;
  
  $SemSim (P_{i}, P_{j}) = \frac{d_{i}, d_{j}}{\sqrt{d_{i} \cdot d_{i}} \sqrt{d_{j} \cdot d_{j}}},$ in which where d_i·
  
  dj is the dot product between the vectors, calculated asΣ
  
  (from k=0 to n)Σ
  
  (from 1=0 to n) w_ik·
  
  w_il;
  
  and wherein if terms k and l belong to ontology classes Ck and Cl, respectively, w_ik·
  
  w_ilis calculated as;
  
  tf_ik×
  
  tf_jl×
  
  idf(Ck)×
  
  idf(Cl)×
  
  sim(Ck, Cl)×
  
  WT_ont, wherein tf_ikis the frequency of the term k in the document i, tf_jlis the frequency of the term l in the document j, idf(Ck) represents the inverse document frequency of class Ck in document corpus of document i and is calculated as;
  
  log₂(N/n_Ck), where N is the number of documents in a collection, and n_Ckis the number of documents in which a term of class Ck occurs at least once, idf(Cl) represents the inverse document frequency of class Cl in document corpus of document j and is calculated as;
  
  log₂(N/n_Cl) , where N is the number of documents in a collection, and n_Clis the number of documents in which a term of class Cl occurs at least once, sim(Ck, Cl) is calculated as;
  
  $sim (Ck, Cl) = \frac{2 \times [\ln (p_{\min})]}{\ln (p_{Ck}) + \ln (p_{Cl})},$ where p_Ckis the probability of encountering the class Ck or a child of the term in a given taxonomy, p_Clis the probability of encountering the class Cl or a child of the term in a given taxonomy and p_minis the minimum probability among common ancestors of classes Ck and Cl, andWT_ontis a predefined constant in the range 0<
  
  WT_ont<
  
  1.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method according to claim 1, wherein said plurality of similarity measures include each of a semantic, a structural, and a reference similarity measure.
  - 3. The method according to claim 1, further comprising:
    - weighting the plurality of similarity measures to determine said overall similarity measure.
  - 4. The method according to claim 1, wherein said documents being compared comprise chemical documents, and one of said similarity measures is a semantic similarity measure, wherein determination of said semantic similarity measure utilizes a determination of chemical structural similarity of said terms contained in said plurality of documents.
  - 5. The method according to claim 4, wherein said terms contained in said plurality of documents are associated with chemical substructures represented as strings.
  - 6. The method according to claim 1, wherein said reference similarity measure is determined utilizing a reference graph, each of said plurality of documents associated with a corresponding node in said reference graph, said reference similarity measure comprising the inverse of the minimum path length between nodes in the reference graph associated with said plurality of documents.
  - 7. The method according to claim 1, wherein if terms k and l are the same, w_ik·
    - w_ilis calculated as;
8. The method according to claim 1, further comprising:
- determining said terms contained in said plurality of documents by utilizing at least one of an ontology, a taxonomy, and a dictionary.

9. A computer program product having a computer readable storage medium having a computer program recorded therein for comparing documents, said computer program product comprising a method comprising:
- determining a plurality of similarity measures for said plurality of documents; and
  
  determining an overall similarity measure for said plurality of documents, based on said plurality of similarity measures,wherein said plurality of similarity measures are chosen from the group of similarity measures comprising;
  
  a semantic similarity measures based on similarity of terms contained in said plurality of documents;
  
  a structural similarity measure based on the structures of chemical terms described in said plurality of documents; and
  
  a reference similarity measure based on references contained in said plurality of documents;
  
  wherein said plurality of similarity measures include a semantic similarity measure and a reference similarity measure, andfurther wherein two documents P_iand P_jbeing compared are represented by corresponding vectors;
  
  d_i={w_i1, w_i2, . . . , w_in}, where w_ik(1<
  
  k<
  
  n) is a non-negative value denoting the weight of the term k in the document i, anddj={wj₁, wj₂, . . . , wj_n}, where w_il(1<
  
  l<
  
  n) is a non-negative value denoting the weight of the term l in the document j,and the semantic similarity measure SemSim( ) is given by the equation;
  
  SemSim(P_i, P_j)=(d_i·
  
  d_j)/(√
  
  (d_i·
  
  d_j)√
  
  (d_j·
  
  d_j)),in which where d_i·
  
  dj is the dot product between the vectors, calculated asΣ
  
  (from k=0 to n)Σ
  
  (from 1=0 to n) w_ik·
  
  w_il;
  
  and wherein if terms k and l belong to ontology classes Ck and Cl, respectively, w_ik·
  
  w_ilis calculated as;
  
  tf_ik×
  
  tf_jl×
  
  idf(Ck)×
  
  idf(Cl)×
  
  sim(Ck, Cl)×
  
  WT_ont, whereintf_ikis the frequency of the term k in the document i,tf_jlis the frequency of the term l in the document j,idf(Ck) represents the inverse document frequency of class Ck in document corpus of document i and is calculated as;
  
  $\log_{2} \frac{N}{n_{Ck}},$ where N is the number of documents in a collection, and n_Ckis the number of documents in which a term of class Ck occurs at least once,idf(Cl) represents the inverse document frequency of class Cl in document corpus of document j and is calculated as;
  
  $\log_{2} \frac{N}{n_{Cl}},$ where N is the number of documents in a collection, and n_Clis the number of documents in which a term of class Cl occurs at least once,sim(Ck, Cl) is calculated as;
  
  sim(Ck, Cl)=(2×
  
  [ln (p_min)])/(ln (p_Ck)+ln (p_Cl)), where p_Ckis the probability of encountering the class Ck or a child of the term in a given taxonomy, p_Clis the probability of encountering the class Cl or a child of the term in a given taxonomy and p_minis the minimum probability among common ancestors of classes Ck and Cl, andWT_ontis a predefined constant in the range 0<
  
  WT_ont<
  
  1.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Kothari, Ravi, Mukherjea, Sougata
Primary Examiner(s)
Mofiz; Apu
Assistant Examiner(s)
Padmanabhan; Kavita

Application Number

US11/304,029
Publication Number

US 20070143322A1
Time in Patent Office

1,111 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G06F 40/194 Calculation of difference b...

Document comparison using multiple similarity measures

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

39 Citations

9 Claims

Specification

Solutions

Use Cases

Quick Links

Document comparison using multiple similarity measures

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

39 Citations

9 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links