Method of text similarity measurement
First Claim
1. A method for estimating the similarity between at least two portions of text, said method comprising the steps of:
- receiving said at least two portions of text;
forming a set of syntactic tuples from said portions of text, each tuple comprising two terms and a relation between the two terms;
classifying the relation between the terms in the tuples according to a predefined set of relations;
predefining classes of agreement between tuples under comparison, comprising a class of full agreement wherein tuples under comparison are identical, a class of partial agreement wherein only two of corresponding elements in tuples under comparison are identical, and a class of term agreement wherein only one of corresponding terms in tuples under comparison are identical;
determining a respective class of relative agreement between each pair of syntactic tuples from the portions of text under comparison according to the predefined classes of agreement;
calculating a value representative of the similarity between the portions of text for each of the classes of agreement, based on the plurality of tuples determined to belong to the respective class of agreement; and
determining and outputting a measure of the similarity between the portions of text by calculating a weighted sum of the values representative of the similarity between the portions of text for each of the classes of agreement.
1 Assignment
0 Petitions
Accused Products
Abstract
In one aspect, the present invention provides a for estimating the similarity between at least two portions of text including the steps of forming a set of syntactic tuples, each tuple including at least two terms and a relation betweeen the two terms; classifying the relation between the terms in the tuples according to a predefined set of relations; establishing the relative agreement between syntactic tuples from the portions of text under comparison according to predefined classes of agreement; calculating a value representative of the similarity between the portions of text of each of the classes of agreement; and establishing a value for the similarity between the portions of text by calculating a weighted sum of the values representative of the similarity between the portions of text for each of the classes of agreement. Preferaly, the step of calculating a value representative of the similarity between the portions of text for each of the classes of agreement includes a weighting based upon the number of matched terms occurring in particular parts of speech in which the text occurs. It is also preferred that the step of calculating a value representative of the similarity between the portions of text for each of the classes of agreement include the application of a weighting factor to the estimate of similarity for each of the classes of agreement and the parts of speech in which matched terms occur.
-
Citations
19 Claims
-
1. A method for estimating the similarity between at least two portions of text, said method comprising the steps of:
-
receiving said at least two portions of text; forming a set of syntactic tuples from said portions of text, each tuple comprising two terms and a relation between the two terms; classifying the relation between the terms in the tuples according to a predefined set of relations; predefining classes of agreement between tuples under comparison, comprising a class of full agreement wherein tuples under comparison are identical, a class of partial agreement wherein only two of corresponding elements in tuples under comparison are identical, and a class of term agreement wherein only one of corresponding terms in tuples under comparison are identical; determining a respective class of relative agreement between each pair of syntactic tuples from the portions of text under comparison according to the predefined classes of agreement; calculating a value representative of the similarity between the portions of text for each of the classes of agreement, based on the plurality of tuples determined to belong to the respective class of agreement; and determining and outputting a measure of the similarity between the portions of text by calculating a weighted sum of the values representative of the similarity between the portions of text for each of the classes of agreement. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
-
Specification