Methods and systems for identifying paraphrases from an index of information items and associated sentence fragments
First Claim
Patent Images
1. A system comprising:
- one or more computers programmed to implement a paraphrase engine to create an index of paraphrases, paraphrases being groups of one or more words in a same language, the groups having a same or a similar meaning but not being identical, the paraphrase engine to perform operations including;
identifying a first sentence fragment and a second sentence fragment that, in text of one or more electronic documents, are both associated with a same date or entity name, the first sentence fragment and the second sentence fragment each comprising two or more like-words found in both the first sentence fragment and the second sentence fragment and one or more dissimilar-words found in only one of the first sentence fragment or the second sentence fragment;
aligning the like-words in the first sentence fragment with the like-words in the second sentence fragment;
determining that the alignment satisfies a threshold frequency value; and
in response to determining that the alignment satisfies the threshold frequency value, extracting the dissimilar-words from the first sentence fragment and the dissimilar-words from the second sentence fragment; and
outputting the index of paraphrases after the extracting, the index identifying the dissimilar-words from the first sentence fragment as paraphrasing the dissimilar-words from the second sentence fragment.
3 Assignments
0 Petitions
Accused Products
Abstract
Methods and systems for identification of paraphrases from an index of information items and associated sentence fragments are described. One method described comprises identifying a pair of sentence fragments each having a same associated information item from an index, wherein the index comprises a plurality of information items and associated sentence fragments, and identifying a paraphrase pair from the pair of sentence fragments.
-
Citations
51 Claims
-
1. A system comprising:
one or more computers programmed to implement a paraphrase engine to create an index of paraphrases, paraphrases being groups of one or more words in a same language, the groups having a same or a similar meaning but not being identical, the paraphrase engine to perform operations including; identifying a first sentence fragment and a second sentence fragment that, in text of one or more electronic documents, are both associated with a same date or entity name, the first sentence fragment and the second sentence fragment each comprising two or more like-words found in both the first sentence fragment and the second sentence fragment and one or more dissimilar-words found in only one of the first sentence fragment or the second sentence fragment; aligning the like-words in the first sentence fragment with the like-words in the second sentence fragment; determining that the alignment satisfies a threshold frequency value; and in response to determining that the alignment satisfies the threshold frequency value, extracting the dissimilar-words from the first sentence fragment and the dissimilar-words from the second sentence fragment; and outputting the index of paraphrases after the extracting, the index identifying the dissimilar-words from the first sentence fragment as paraphrasing the dissimilar-words from the second sentence fragment. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
14. A method implemented by a system of one or more computers, the method comprising:
-
identifying, by the system, a first sentence fragment and a second sentence fragment that, in text of one or more electronic documents, are both associated with a same date or entity name, the first sentence fragment and the second sentence fragment each comprising two or more like-words found in both the first sentence fragment and the second sentence fragment and one or more dissimilar-words found in only one of the first sentence fragment or the second sentence fragment; aligning, by the system, the like-words in the first sentence fragment with the like-words in the second sentence fragment; determining, by the system, that the alignment satisfies a threshold frequency value; and in response to determining that the alignment satisfies the threshold frequency value, extracting, by the system, the dissimilar-words from the first sentence fragment and the dissimilar-words from the second sentence fragment; and outputting, to memory, the index of paraphrases after the extracting, the index identifying the dissimilar-words from the first sentence fragment as paraphrasing the dissimilar-words from the second sentence fragment. - View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
-
-
25. A method implemented by a system of one or more computers, the method comprising:
-
identifying, by the system, different sentence fragments in different electronic documents, wherein the sentence fragments each comprise two or more words that are alike and are all associated within the different electronic documents with a same date, with a same entity name, or with a same concept, and wherein the sentence fragments further comprising words that are dissimilar and not found in all the sentence fragments; aligning, by the system, the different sentence fragments based on positioning of the words that are alike within the different sentence fragments; determining, by the system, that some of the alignments of the words that are alike satisfy a criterion; in response to determining that some of the alignments of the words that are alike satisfy the criterion, extracting, by the system, the words that are dissimilar from the at least some of the different sentence fragments; determining, by the system, frequencies at which the words that are dissimilar appear between the words that are alike within the different sentence fragments; determining, by the system, that some of the words that are dissimilar satisfy a threshold frequency value; in response to the determining that at least some of the words that are dissimilar satisfy the threshold frequency value, adding, by the system, the words that are dissimilar and that satisfy the threshold frequency value to an index of paraphrases; and outputting, by the system, the index of paraphrases after the adding, the index identifying that the words that are dissimilar and that satisfy the threshold frequency value are paraphrases, wherein paraphrases are groups of one or more words in a same language, the groups having a same or a similar meaning but not being identical. - View Dependent Claims (26, 27, 28, 29)
-
-
30. A system comprising:
one or more computers programmed to implement a paraphrase engine to create an index of paraphrases, paraphrases being groups of one or more words in a same language, the groups having a same or a similar meaning but not being identical, the paraphrase engine to perform operations including; identifying different sentence fragments in different electronic documents, wherein the sentence fragments each comprise two or more words that are alike and are all associated within the different electronic documents with a same date, with a same entity name, or with a same concept, and wherein the sentence fragments further comprising words that are dissimilar and not found in all the sentence fragments; aligning the different sentence fragments based on positioning of the words that are alike within the different sentence fragments; determining that some of the alignments of the words that are alike satisfy a criterion; in response to determining that some of the alignments of the words that are alike satisfy the criterion, extracting the words that are dissimilar from the at least some of the different sentence fragments; determining frequencies at which the words that are dissimilar appear between the words that are alike within the different sentence fragments; determining that some of the words that are dissimilar satisfy a threshold frequency value; in response to the determining that at least some of the words that are dissimilar satisfy the threshold frequency value, adding the words that are dissimilar and that satisfy the threshold frequency value to an index of paraphrases; and outputting the index of paraphrases after the adding, the index identifying that the words that are dissimilar and that satisfy the threshold frequency value are paraphrases, wherein paraphrases are groups of one or more words in a same language, the groups having a same or a similar meaning but not being identical. - View Dependent Claims (31, 32, 33, 34)
-
35. An article comprising one or more computer-readable data storage media storing program code operable to cause one or more machines to perform operations, the operations comprising:
-
identifying different sentence fragments in different electronic documents, wherein the sentence fragments each comprise two or more words that are alike and are all associated within the different electronic documents with a same date, with a same entity name, or with a same concept, and wherein the sentence fragments further comprising words that are dissimilar and not found in all the sentence fragments; aligning the different sentence fragments based on positioning of the words that are alike within the different sentence fragments; determining that some of the alignments of the words that are alike satisfy a criterion; in response to determining that some of the alignments of the words that are alike satisfy the criterion, extracting the words that are dissimilar from the at least some of the different sentence fragments; determining frequencies at which the words that are dissimilar appear between the words that are alike within the different sentence fragments; determining that some of the words that are dissimilar satisfy a threshold frequency value; in response to the determining that at least some of the words that are dissimilar satisfy the threshold frequency value, adding the words that are dissimilar and that satisfy the threshold frequency value to an index of paraphrases; and outputting the index of paraphrases after the adding, the index identifying that the words that are dissimilar and that satisfy the threshold frequency value are paraphrases, wherein paraphrases are groups of one or more words in a same language, the groups having a same or a similar meaning but not being identical. - View Dependent Claims (36, 37, 38, 39)
-
-
40. An article comprising one or more computer-readable data storage media storing program code operable to cause one or more machines to perform operations implementing a paraphrase engine to create an index of paraphrases, paraphrases being groups of one or more words in a same language, the groups having a same or a similar meaning but not being identical, the operations comprising:
-
identifying a first sentence fragment and a second sentence fragment that, in text of one or more electronic documents, are both associated with a same date or entity name, the first sentence fragment and the second sentence fragment each comprising two or more like-words found in both the first sentence fragment and the second sentence fragment and one or more dissimilar-words found in only one of the first sentence fragment or the second sentence fragment; aligning the like-words in the first sentence fragment with the like-words in the second sentence fragment; determining that the alignment satisfies a threshold frequency value; and in response to determining that the alignment satisfies the threshold frequency value, extracting the dissimilar-words from the first sentence fragment and the dissimilar-words from the second sentence fragment; and outputting the index of paraphrases after the extracting, the index identifying the dissimilar-words from the first sentence fragment as paraphrasing the dissimilar-words from the second sentence fragment. - View Dependent Claims (41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51)
-
Specification