Methods and systems for identifying paraphrases from an index of information items and associated sentence fragments
First Claim
Patent Images
1. A system comprising:
- a machine-readable index;
one or more computers programmed to perform operations in accordance with computer-readable instructions, the operations comprising;
identifying, in the machine-readable index, a first sentence fragment and a second sentence fragment that are both associated with a same first information item, wherein the first information item is one of a date, an entity name, and a concept, wherein the machine-readable index comprises a plurality of information items and sentence fragments associated with respective of the information items;
in response to identifying that the first sentence fragment and the second sentence fragment are both associated with a same first information item, identifying a paraphrase pair in the first and second sentence fragments;
repeating the identifying of the first sentence fragment and the second sentence fragment and the identifying of the paraphrase pair to identify a plurality of paraphrase pairs; and
determining a frequency of occurrence value for each of the paraphrase pairs in the plurality of paraphrase pairs, wherein the frequency of occurrence value embodies the frequency at which each paraphrase pair appears in the plurality of paraphrase pairs,whereinthe paraphrase pair comprises a first paraphrase and a second paraphrase,the first paraphrase comprises a proper subset of the words in the first sentence fragment,the second paraphrase comprises a proper subset of the words in the second sentence fragment, andthe first paraphrase and the second paraphrase are in a same language, have a same or a similar meaning, and are not identical.
3 Assignments
0 Petitions
Accused Products
Abstract
Methods and systems for identification of paraphrases from an index of information items and associated sentence fragments are described. One method described comprises identifying a pair of sentence fragments each having a same associated information item from an index, wherein the index comprises a plurality of information items and associated sentence fragments, and identifying a paraphrase pair from the pair of sentence fragments.
-
Citations
22 Claims
-
1. A system comprising:
-
a machine-readable index; one or more computers programmed to perform operations in accordance with computer-readable instructions, the operations comprising; identifying, in the machine-readable index, a first sentence fragment and a second sentence fragment that are both associated with a same first information item, wherein the first information item is one of a date, an entity name, and a concept, wherein the machine-readable index comprises a plurality of information items and sentence fragments associated with respective of the information items; in response to identifying that the first sentence fragment and the second sentence fragment are both associated with a same first information item, identifying a paraphrase pair in the first and second sentence fragments; repeating the identifying of the first sentence fragment and the second sentence fragment and the identifying of the paraphrase pair to identify a plurality of paraphrase pairs; and determining a frequency of occurrence value for each of the paraphrase pairs in the plurality of paraphrase pairs, wherein the frequency of occurrence value embodies the frequency at which each paraphrase pair appears in the plurality of paraphrase pairs, wherein the paraphrase pair comprises a first paraphrase and a second paraphrase, the first paraphrase comprises a proper subset of the words in the first sentence fragment, the second paraphrase comprises a proper subset of the words in the second sentence fragment, and the first paraphrase and the second paraphrase are in a same language, have a same or a similar meaning, and are not identical. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A system comprising:
-
a machine-readable index that associates information items and sentence fragments, wherein the information items are each one of a date, an entity name, and a concept; and one or more computers programmed to perform operations in accordance with computer-readable instructions, the operations comprising; identifying a collection of paraphrase pairs from the machine-readable index; determining a frequency of occurrence value for a first paraphrase pair of the collection of paraphrase pairs, wherein the frequency of occurrence value embodies the frequency at which the first paraphrase pair appears in the collection; and adding the first paraphrase pair to a data collection based at least in part on the frequency of occurrence value meeting a criterion, wherein each paraphrase pair comprises a first paraphrase and a second paraphrase, the first paraphrase comprises a proper subset of the words in a first sentence fragment, the second paraphrase comprises a proper subset of the words in a second sentence fragment, and each of the first paraphrase and the second paraphrase in a paraphrase pair are in a same language, have a same or a similar meaning, and are not identical. - View Dependent Claims (7, 8)
-
-
9. A method performed by a system of one or more computers, the method comprising:
-
identifying, by the system in a machine-readable index, a first sentence fragment and a second sentence fragment that are both associated with a same first information item, wherein the first information item is one of a date, an entity name, and a concept, wherein the machine-readable index comprises a plurality of information items and sentence fragments associated with respective of the information items; in response to identifying that the first sentence fragment and the second sentence fragment are both associated with a same first information item, identifying, by the system, a paraphrase pair in the first and second sentence fragments; repeating, by the system, the identifying of the first sentence fragment and the second sentence fragment and the identifying of the paraphrase pair to identify a plurality of paraphrase pairs; and determining, by the system, a frequency of occurrence value for each of the paraphrase pairs in the plurality of paraphrase pairs, wherein the frequency of occurrence value embodies the frequency at which each paraphrase pair appears in the plurality of paraphrase pairs, wherein the paraphrase pair comprises a first paraphrase and a second paraphrase, the first paraphrase comprises a proper subset of the words in the first sentence fragment, the second paraphrase comprises a proper subset of the words in the second sentence fragment, and the first paraphrase and the second paraphrase are in a same language, have a same or a similar meaning, and are not identical. - View Dependent Claims (10, 11)
-
-
12. A method performed by a system of one or more computers, the method comprising:
-
identifying, by the system, a collection of paraphrase pairs that each associates information items and sentence fragments, wherein the information items are each one of a date, an entity name, and a concept; determining, by the system, a frequency of occurrence value for a first paraphrase pair of the collection of paraphrase pairs, wherein the frequency of occurrence value embodies the frequency at which the first paraphrase pair appears in the collection; adding, by the system, the first paraphrase pair to a machine-readable index based at least in part on the frequency of occurrence value meeting a criterion, wherein each paraphrase pair comprises a first paraphrase and a second paraphrase, the first paraphrase comprises a proper subset of the words in a first sentence fragment in the index, the second paraphrase comprises a proper subset of the words in a second sentence fragment in the index, and each of the first paraphrase and the second paraphrase in a paraphrase pair are in a same language, have a same or a similar meaning, and are not identical.
-
-
13. A system comprising:
-
a machine-readable index that associates information items and sentence fragments; and one or more computers programmed to perform operations in accordance with computer-readable instructions, the operations comprising; repeatedly identifying a first sentence fragment and a second sentence fragment, each comprising a plurality of tokens and each associated with a same information item in the machine-readable index, aligning the first sentence fragment and the second sentence fragment so that tokens in the first sentence fragment match tokens in the second sentence fragment, determining a number of matched non-stop tokens in the aligned first and second sentence fragments, determining a number of dissimilar tokens in the aligned first and second sentence fragments, and identifying a paraphrase pair in the dissimilar tokens based at least in part on the number of matched non-stop tokens and the number of dissimilar tokens, wherein paraphrases in the paraphrase pair are in a same language; determining a frequency of occurrence value for each of the paraphrase pairs in the plurality of paraphrase pairs, wherein the frequency of occurrence value embodies the frequency at which each paraphrase pair appears in a collection of the identified paraphrase pairs; identifying a subset of the plurality of paraphrase pairs, wherein each paraphrase pair in the subset has a frequency of occurrence value that is above a criteria; and adding the subset of the plurality of paraphrase pairs to a machine-readable index.
-
-
14. A system comprising:
-
a machine-readable index that includes a collection of index entries, wherein each of the index entries comprises a sentence fragment and an associated date; and one or more computers programmed to perform operations in accordance with computer-readable instructions, the operations comprising; accessing the index; repeatedly selecting, from the index, first index entries comprising a date and second index entries comprising the same date; identifying whether first portions of first sentence fragments from the first index entries paraphrase second portions of second sentence fragments from the second index entries, including determining a frequency of occurrence value for each first portion and second portion in the selected index entries, wherein the frequency of occurrence value embodies the frequency at which the first portions and the second portions are in the sentence fragments of the selected entries and identifying a subset of the first portions and the second portions having a frequency of occurrence value above a threshold; and in response to identifying that the first portions paraphrase the second portions, storing the first portions and the second portions in a machine-readable data collection, wherein paraphrases are in a same language, have a same or a similar meaning, and are not identical.
-
-
15. An article comprising one or more computer-readable data storage media storing program code operable to cause one or more machines to perform operations, the operations comprising:
-
identifying, in a machine-readable index, a first sentence fragment and a second sentence fragment that are both associated with a same first information item, wherein the first information item is an entity name, wherein the machine-readable index comprises a plurality of information items and sentence fragments associated with respective of the information items; in response to identifying that the first sentence fragment and the second sentence fragment are both associated with the same first information item, identifying a paraphrase pair in the first and second sentence fragments; repeating the identifying of the first sentence fragment and the second sentence fragment and the identifying of the paraphrase pair to identify a plurality of paraphrase pairs; and determining a frequency of occurrence value for each of the paraphrase pairs in the plurality of paraphrase pairs, wherein the frequency of occurrence value embodies the frequency at which each paraphrase pair appears in the plurality of paraphrase pairs, wherein the paraphrase pair comprises a first paraphrase and a second paraphrase, the first paraphrase comprises a proper subset of the words in the first sentence fragment, the second paraphrase comprises a proper subset of the words in the second sentence fragment, and the first paraphrase and the second paraphrase are in a same language, have a same or a similar meaning, and are not identical. - View Dependent Claims (16, 17, 18)
-
-
19. An article comprising one or more computer-readable data storage media storing program code operable to cause
one or more machines to perform operations, the operations comprising: - identifying, in a machine-readable index, a first sentence fragment and a second sentence fragment that are both associated with a same first information item, wherein the first information item is a concept, wherein the machine-readable index comprises a plurality of information items and sentence fragments associated with respective of the information items;
in response to identifying that the first sentence fragment and the second sentence fragment are both associated with a same first information item, identifying a paraphrase pair in the first and second sentence fragments; repeating the identifying of the first sentence fragment and the second sentence fragment and the identifying of the paraphrase pair to identify a plurality of paraphrase pairs; and determining a frequency of occurrence value for each of the paraphrase pairs in the plurality of paraphrase pairs, wherein the frequency of occurrence value embodies the frequency at which each paraphrase pair appears in the plurality of paraphrase pairs, wherein the paraphrase pair comprises a first paraphrase and a second paraphrase, the first paraphrase comprises a proper subset of the words in the first sentence fragment, the second paraphrase comprises a proper subset of the words in the second sentence fragment, and the first paraphrase and the second paraphrase are in a same language, have a same or a similar meaning, and are not identical. - View Dependent Claims (20, 21)
- identifying, in a machine-readable index, a first sentence fragment and a second sentence fragment that are both associated with a same first information item, wherein the first information item is a concept, wherein the machine-readable index comprises a plurality of information items and sentence fragments associated with respective of the information items;
-
22. An article comprising one or more computer-readable data storage media storing program code operable to cause one or more machines to perform operations, the operations comprising:
-
repeatedly identifying a first sentence fragment and a second sentence fragment, each comprising a plurality of tokens and each associated with a same information item in a machine-readable index that associates information items and sentence fragments, aligning the first sentence fragment and the second sentence fragment so that tokens in the first sentence fragment match tokens in the second sentence fragment, determining a number of matched non-stop tokens in the aligned first and second sentence fragments, determining a number of dissimilar tokens in the aligned first and second sentence fragments, and identifying a paraphrase pair in the dissimilar tokens based at least in part on the number of matched non-stop tokens and the number of dissimilar tokens, wherein paraphrases in the paraphrase pair are in a same language; determining a frequency of occurrence value for each of the paraphrase pairs in the plurality of paraphrase pairs, wherein the frequency of occurrence value embodies the frequency at which each paraphrase pair appears in a collection of the identified paraphrase pairs; identifying a subset of the plurality of paraphrase pairs, wherein each paraphrase pair in the subset has a frequency of occurrence value that is above a criteria; and adding the subset of the plurality of paraphrase pairs to a machine-readable index.
-
Specification