Software and method for recognizing similarity of documents written in different languages based on a quantitative measure of similarity

US 6,519,557 B1
Filed: 06/06/2000
Issued: 02/11/2003
Est. Priority Date: 06/06/2000
Status: Expired due to Term

First Claim

Patent Images

1. A method of identifying different versions of the same structured document comprising steps of:

reading a first file including text;

reading a second file including text;

generating from the first file a first hierarchical structured document using formatting codes and leaf content in the first file, wherein the first hierarchical document includes the text of the first file;

generating from the second file a second hierarchical structured document using formatting codes and leaf content in the second file, wherein the second hierarchical document includes the text of the second file;

reading a first portion of text which occupies, a first position in the first hierarchical structured document;

reading a second portion of text which occupies a second position which is congruent to the first position in the second hierarchical structured document; and

obtaining a quantitative measure of similarity of the first and the second portions of text.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system for identifying different language versions of the same structured format document (e.g., HTML web page) detects the language of the two documents and translates one or both into a preferred language if necessary, parses the two candidate documents and builds two hierarchical data structure based on the document. The data structures are used to compare the hierarchical structure of the two documents and also to access text portions in congruent positions in the two documents. A fuzzy measure of similarity of a set of text portions occupying congruent positions in the two documents is then obtained, to induce a measure of the similarity of the two documents which is compared to a fuzzy threshold.

Citations

16 Claims

1. A method of identifying different versions of the same structured document comprising steps of:
- reading a first file including text;
  
  reading a second file including text;
  
  generating from the first file a first hierarchical structured document using formatting codes and leaf content in the first file, wherein the first hierarchical document includes the text of the first file;
  
  generating from the second file a second hierarchical structured document using formatting codes and leaf content in the second file, wherein the second hierarchical document includes the text of the second file;
  
  reading a first portion of text which occupies, a first position in the first hierarchical structured document;
  
  reading a second portion of text which occupies a second position which is congruent to the first position in the second hierarchical structured document; and
  
  obtaining a quantitative measure of similarity of the first and the second portions of text.
- View Dependent Claims (2, 3, 4, 5)
- - 2. A method according to claim 1 further comprising steps of:
3. A method according to claim 2 further comprising steps of:
- adjusting a measure of similarity of the first and the second hierarchical structured documents according to the quantitative measure of similarity of the first and the second portions of text; and
  
  comparing the measure of similarity of the first and the second hierarchical structured documents to a bound.
4. A method according to claim 3 further comprising steps of:
- reading the first hierarchical data structure to identify a first set of children of a first node, reading the second hierarchical data structure;
  
  to identify a second set of children of second node which occupies a position congruent to the first node in the first hierarchical data structure; and
  
  comparing the first set of children to the second set of children to obtain a quantitative measure of the degree of match.
5. A method according to claim 4 further comprising steps of:
- adjusting a measure representing a degree of match of the first and the second structured hierarchical documents in accordance with the quantitative measure of the degree of match; and
  
  comparing the measure representing the degree of match of the first and the second hierarchical documents with a threshold value.

6. A computer readable medium containing programming instructions for identifying different versions of the same structured document including programming instructions for:
- reading a first file including text;
  
  reading a second file including text;
  
  generating from the first file a first hierarchical structured document using formatting codes and leaf content in the first file, wherein the first hierarchical document includes the text of the first file;
  
  generating from the second file a second hierarchical structured document using formatting codes and leaf content in the second file, wherein the second hierarchical document includes the text of the second file;
  
  reading a first portion of text which occupies a first position in the first hierarchical structured document;
  
  reading a second portion of text which occupies a second position which is congruent to the first position in the second hierarchical structured document; and
  
  obtaining a quantitative measure of similarity of the first and the second portions of text.
- View Dependent Claims (7, 8, 9, 10)
- - 7. A computer readable medium according to claim 6 further comprising programming instructions for:
8. A computer readable medium according to claim 7 further comprising programming instructions for:
- adjusting a measure of similarity of the first and the second hierarchical structured documents according to the quantitative measure of similarity of the first and the second portions of text; and
  
  comparing the measure of similarity of the first and the second hierarchical structured documents to a bound.
9. A computer readable medium according to claim 8 further comprising programming instructions for:
- reading the first hierarchical data structure to identify a first set of children of a first node, reading the second hierarchical data structure to identify a second set of children of second node which occupies a position congruent to the first node in the first hierarchical data structure; and
  
  comparing the first set of children to the second set of children to obtain a quantitative measure of the degree of match.
10. A computer readable medium according to claim 9 further comprising programming instructions for:
- adjusting a measure representing a degree of match of the first and the second structured hierarchical documents in accordance with the quantitative measure of the degree of match; and
  
  comparing the measure representing the degree of match of the first and the second hierarchical documents with a threshold value.

11. A system for identifying different versions of the same structured document comprising:
- means for reading a first file including text;
  
  means for reading a second file including text;
  
  means for generating from the first file a first hierarchical structured document using formatting codes and leaf content in the first file, wherein the first hierarchical document includes the text of the first file;
  
  means for generating from the second file a second hierarchical structured document using formatting codes and leaf content in the second file, wherein the second hierarchical document includes the text of the second file;
  
  means for reading a first portion of text which occupies a first position in the first hierarchical structured document;
  
  means for reading a second portion of text which occupies a second position which is congruent to the first position in the second hierarchical structured document; and
  
  means for obtaining a quantitative measure of similarity of the first and the second portions of text.

12. A method of identifying different versions of the same structured document comprising:
- reading a first file and a second file, wherein the first file and the second file include language data;
  
  generating from the first file a first hierarchical structured document using formatting codes and leaf content in the first file, wherein the first hierarchical structured document includes the language data of the first file;
  
  generating for the second file a second hierarchical structured document using formatting codes and leaf content in the second file, wherein the second hierarchical structured document includes the language data of the second file;
  
  comparing the hierarchical structure of the first hierarchical structured document with the hierarchical structure of the second hierarchical structured document; and
  
  calculating a quantitative measure of similarity between the hierarchical structure of the first hierarchical structured document and the hierarchical structure of the second hierarchical structured document.
- View Dependent Claims (13, 14, 15, 16)
- - 13. The method of claim 12, further comprising:
14. The method of claim 13, further comprising:
- determining whether the language data of the second hierarchical structured document is in the preferred format; and
  
  wherein if the language data of the second hierarchical structured document is not in the preferred format, then transforming the language data of the second hierarchical structured document into the preferred format.
15. The method of claim 14, further comprising:
- reading a first portion of language data which occupies a first position in the first hierarchical structured document;
  
  reading a second portion of language data which occupies a second position in the second hierarchical structured document wherein the second position is congruent to the first position in the first hierarchical structured document; and
  
  calculating a quantitative measure of similarity of the first and the second portions of language data.
16. The method of claim 12, further comprising:
- wherein if the hierarchical structure of the first hierarchical structured document and the hierarchical structure of the second hierarchical structured document are substantially similar, then;
  
  reading a first portion of language data which occupies a first position in the first hierarchical structured document;
  
  reading a second portion of language data which occupies a second position in the second hierarchical structured document, wherein the second position is congruent to the first position in the first hierarchical structured document; and
  
  calculating a quantitative measure of similarity of the first and the second portions of language data.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Original Assignee
International Business Machines Corporation
Inventors
Yim, Peter Chi-Shing, Kraft, Reiner, Emens, Michael L.
Primary Examiner(s)
EDOUARD, PATRICK NESTOR

Application Number

US09/588,250
Time in Patent Office

980 Days
Field of Search

704/2, 704/3, 704/4, 704/5, 704/6, 704/7, 704/8, 707/500, 707/501, 707/513, 707/514, 707/531, 707/536
US Class Current

704/8
CPC Class Codes

G06F 40/194 Calculation of difference b...

G06F 40/58 Use of machine translation,...

Software and method for recognizing similarity of documents written in different languages based on a quantitative measure of similarity

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

Software and method for recognizing similarity of documents written in different languages based on a quantitative measure of similarity

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links