Method and device to estimate similarity between documents having multiple segments
First Claim
Patent Images
1. A method for comparing a first document and a second document, the method comprising:
- associating, by a processor, a respective weight with each of a plurality of information types including text-based information, graphical information, audio information, or video information;
identifying, for each of the first document and the second document, by the processor, one or more segments each corresponding to one of the plurality of information types; and
estimating, by the processor, a similarity value between the first document and the second document, by comparing each segment of the first document with a segment of the second document that corresponds to a same information type, wherein the similarity value is based on a distance, in a semantic hierarchy, between a first semantic class associated with the first document and a common ancestor, in the semantic hierarchy, of the first semantic class and a second semantic class associated with a second document; and
combining results of the comparison based on the respective associated weights.
4 Assignments
0 Petitions
Accused Products
Abstract
Described herein are methods for finding substantially similar/different sources (files and documents), and estimating similarity or difference between given sources. Similarity and difference may be found across a variety of formats. Sources may be in one or more languages such that similarity and difference may be found across any number and types of languages. A variety of characteristics may be used to arrive at an overall measure of similarity or difference including determining or identifying syntactic roles, semantic roles and semantic classes in reference to sources.
178 Citations
23 Claims
-
1. A method for comparing a first document and a second document, the method comprising:
-
associating, by a processor, a respective weight with each of a plurality of information types including text-based information, graphical information, audio information, or video information; identifying, for each of the first document and the second document, by the processor, one or more segments each corresponding to one of the plurality of information types; and estimating, by the processor, a similarity value between the first document and the second document, by comparing each segment of the first document with a segment of the second document that corresponds to a same information type, wherein the similarity value is based on a distance, in a semantic hierarchy, between a first semantic class associated with the first document and a common ancestor, in the semantic hierarchy, of the first semantic class and a second semantic class associated with a second document; and combining results of the comparison based on the respective associated weights. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. An electronic device for comparing documents, the electronic device comprising:
-
a memory; and a processor communicatively coupled to the memory, the processor configured to; associate a respective weight with each of a plurality of information types including text-based information, graphical information, audio information, or video information; identify, for each of the first document and the second document, one or more segments each corresponding to one of the plurality of information types; and estimate a similarity value between the first document and the second document, by comparing each segment of the first document with a segment of the second document that correspond to a same information type, wherein the similarity value is based on a distance, in a semantic hierarchy, between a first semantic class associated with the first document and a common ancestor, in the semantic hierarchy, of the first semantic class and a second semantic class associated with a second document; and combine results of the comparison based on the respective associated weights. - View Dependent Claims (21, 22, 23)
-
Specification