Software and method for recognizing similarity of documents written in different languages based on a quantitative measure of similarity
First Claim
1. A method of identifying different versions of the same structured document comprising steps of:
- reading a first file including text;
reading a second file including text;
generating from the first file a first hierarchical structured document using formatting codes and leaf content in the first file, wherein the first hierarchical document includes the text of the first file;
generating from the second file a second hierarchical structured document using formatting codes and leaf content in the second file, wherein the second hierarchical document includes the text of the second file;
reading a first portion of text which occupies, a first position in the first hierarchical structured document;
reading a second portion of text which occupies a second position which is congruent to the first position in the second hierarchical structured document; and
obtaining a quantitative measure of similarity of the first and the second portions of text.
2 Assignments
0 Petitions
Accused Products
Abstract
A system for identifying different language versions of the same structured format document (e.g., HTML web page) detects the language of the two documents and translates one or both into a preferred language if necessary, parses the two candidate documents and builds two hierarchical data structure based on the document. The data structures are used to compare the hierarchical structure of the two documents and also to access text portions in congruent positions in the two documents. A fuzzy measure of similarity of a set of text portions occupying congruent positions in the two documents is then obtained, to induce a measure of the similarity of the two documents which is compared to a fuzzy threshold.
-
Citations
16 Claims
-
1. A method of identifying different versions of the same structured document comprising steps of:
-
reading a first file including text;
reading a second file including text;
generating from the first file a first hierarchical structured document using formatting codes and leaf content in the first file, wherein the first hierarchical document includes the text of the first file;
generating from the second file a second hierarchical structured document using formatting codes and leaf content in the second file, wherein the second hierarchical document includes the text of the second file;
reading a first portion of text which occupies, a first position in the first hierarchical structured document;
reading a second portion of text which occupies a second position which is congruent to the first position in the second hierarchical structured document; and
obtaining a quantitative measure of similarity of the first and the second portions of text. - View Dependent Claims (2, 3, 4, 5)
detecting a first language of the first portion of text; and
translating the first portion of text into a second language prior to obtaining the quantitative measure of similarity of the first and the second portions of text.
-
-
3. A method according to claim 2 further comprising steps of:
-
adjusting a measure of similarity of the first and the second hierarchical structured documents according to the quantitative measure of similarity of the first and the second portions of text; and
comparing the measure of similarity of the first and the second hierarchical structured documents to a bound.
-
-
4. A method according to claim 3 further comprising steps of:
-
reading the first hierarchical data structure to identify a first set of children of a first node, reading the second hierarchical data structure;
to identify a second set of children of second node which occupies a position congruent to the first node in the first hierarchical data structure; and
comparing the first set of children to the second set of children to obtain a quantitative measure of the degree of match.
-
-
5. A method according to claim 4 further comprising steps of:
-
adjusting a measure representing a degree of match of the first and the second structured hierarchical documents in accordance with the quantitative measure of the degree of match; and
comparing the measure representing the degree of match of the first and the second hierarchical documents with a threshold value.
-
-
6. A computer readable medium containing programming instructions for identifying different versions of the same structured document including programming instructions for:
-
reading a first file including text;
reading a second file including text;
generating from the first file a first hierarchical structured document using formatting codes and leaf content in the first file, wherein the first hierarchical document includes the text of the first file;
generating from the second file a second hierarchical structured document using formatting codes and leaf content in the second file, wherein the second hierarchical document includes the text of the second file;
reading a first portion of text which occupies a first position in the first hierarchical structured document;
reading a second portion of text which occupies a second position which is congruent to the first position in the second hierarchical structured document; and
obtaining a quantitative measure of similarity of the first and the second portions of text. - View Dependent Claims (7, 8, 9, 10)
detecting a first language of the first portion of text; and
translating the first portion of text into a second language prior to obtaining the quantitative measure of similarity of the first and the second portions of text.
-
-
8. A computer readable medium according to claim 7 further comprising programming instructions for:
-
adjusting a measure of similarity of the first and the second hierarchical structured documents according to the quantitative measure of similarity of the first and the second portions of text; and
comparing the measure of similarity of the first and the second hierarchical structured documents to a bound.
-
-
9. A computer readable medium according to claim 8 further comprising programming instructions for:
-
reading the first hierarchical data structure to identify a first set of children of a first node, reading the second hierarchical data structure to identify a second set of children of second node which occupies a position congruent to the first node in the first hierarchical data structure; and
comparing the first set of children to the second set of children to obtain a quantitative measure of the degree of match.
-
-
10. A computer readable medium according to claim 9 further comprising programming instructions for:
-
adjusting a measure representing a degree of match of the first and the second structured hierarchical documents in accordance with the quantitative measure of the degree of match; and
comparing the measure representing the degree of match of the first and the second hierarchical documents with a threshold value.
-
-
11. A system for identifying different versions of the same structured document comprising:
-
means for reading a first file including text;
means for reading a second file including text;
means for generating from the first file a first hierarchical structured document using formatting codes and leaf content in the first file, wherein the first hierarchical document includes the text of the first file;
means for generating from the second file a second hierarchical structured document using formatting codes and leaf content in the second file, wherein the second hierarchical document includes the text of the second file;
means for reading a first portion of text which occupies a first position in the first hierarchical structured document;
means for reading a second portion of text which occupies a second position which is congruent to the first position in the second hierarchical structured document; and
means for obtaining a quantitative measure of similarity of the first and the second portions of text.
-
-
12. A method of identifying different versions of the same structured document comprising:
-
reading a first file and a second file, wherein the first file and the second file include language data;
generating from the first file a first hierarchical structured document using formatting codes and leaf content in the first file, wherein the first hierarchical structured document includes the language data of the first file;
generating for the second file a second hierarchical structured document using formatting codes and leaf content in the second file, wherein the second hierarchical structured document includes the language data of the second file;
comparing the hierarchical structure of the first hierarchical structured document with the hierarchical structure of the second hierarchical structured document; and
calculating a quantitative measure of similarity between the hierarchical structure of the first hierarchical structured document and the hierarchical structure of the second hierarchical structured document. - View Dependent Claims (13, 14, 15, 16)
wherein if the hierarchical structure of the first hierarchical structured document and the hierarchical structure of the second hierarchical structured document are substantially similar, then determining whether the language data of the first hierarchical structured document is in a preferred format; and
wherein if the language data of the first hierarchical structured document is not in a preferred format, then transforming the language data of the first hierarchical structured document into the preferred format.
-
-
14. The method of claim 13, further comprising:
-
determining whether the language data of the second hierarchical structured document is in the preferred format; and
wherein if the language data of the second hierarchical structured document is not in the preferred format, then transforming the language data of the second hierarchical structured document into the preferred format.
-
-
15. The method of claim 14, further comprising:
-
reading a first portion of language data which occupies a first position in the first hierarchical structured document;
reading a second portion of language data which occupies a second position in the second hierarchical structured document wherein the second position is congruent to the first position in the first hierarchical structured document; and
calculating a quantitative measure of similarity of the first and the second portions of language data.
-
-
16. The method of claim 12, further comprising:
-
wherein if the hierarchical structure of the first hierarchical structured document and the hierarchical structure of the second hierarchical structured document are substantially similar, then;
reading a first portion of language data which occupies a first position in the first hierarchical structured document;
reading a second portion of language data which occupies a second position in the second hierarchical structured document, wherein the second position is congruent to the first position in the first hierarchical structured document; and
calculating a quantitative measure of similarity of the first and the second portions of language data.
-
Specification