Paraphrase acquisition
First Claim
1. A method comprising:
- receiving textual input in data processing apparatus;
identifying, by operation of the data processing apparatus, a plurality of ngrams, each ngram being a sequence of words within the textual input;
dividing, by operation of the data processing apparatus, each identified ngram into three portions;
a beginning constant portion containing a first number of words at the beginning of the ngram, an ending constant portion containing a second number of words at the end of the ngram, and a middle portion containing the words of the ngram between the beginning constant portion and the ending constant portion;
determining, by operation of the data processing apparatus, an anchor for each ngram, the anchor comprising the beginning constant portion and the ending constant portion of the ngram; and
identifying, by operation of the data processing apparatus, a plurality of potential paraphrase pairs, wherein if the anchor of a first ngram is the same as the anchor of a second ngram in the plurality of ngrams, the middle portion of the first ngram and the middle portion of the second ngram is identified as being a potential paraphrase pair, wherein the middle portion of the first ngram is textually different from the middle portion of the second ngram.
2 Assignments
0 Petitions
Accused Products
Abstract
Methods and apparatus, including systems and computer program products, to acquire potential paraphrases from textual input. In one aspect, textual input is received, a first map is generated, where the key of the first map is an ngram identified in the textual input and the value associated with the key of the first map is a unique identifier, a second map is generated, where the key of the second map is an anchor identified from the ngram and the value associated with the key of the second map is one or more middle portions associated with the anchor, and a third map is generated, where the key of the third map is a potential paraphrase pair identified from the middle portions and the value associated with the key of the third map is the one or more unique anchors associated with the potential paraphrase pair.
101 Citations
24 Claims
-
1. A method comprising:
-
receiving textual input in data processing apparatus; identifying, by operation of the data processing apparatus, a plurality of ngrams, each ngram being a sequence of words within the textual input; dividing, by operation of the data processing apparatus, each identified ngram into three portions;
a beginning constant portion containing a first number of words at the beginning of the ngram, an ending constant portion containing a second number of words at the end of the ngram, and a middle portion containing the words of the ngram between the beginning constant portion and the ending constant portion;determining, by operation of the data processing apparatus, an anchor for each ngram, the anchor comprising the beginning constant portion and the ending constant portion of the ngram; and identifying, by operation of the data processing apparatus, a plurality of potential paraphrase pairs, wherein if the anchor of a first ngram is the same as the anchor of a second ngram in the plurality of ngrams, the middle portion of the first ngram and the middle portion of the second ngram is identified as being a potential paraphrase pair, wherein the middle portion of the first ngram is textually different from the middle portion of the second ngram. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A computer program product, encoded on a computer readable medium, operable to cause data processing apparatus to perform operations comprising:
-
receiving textual input; identifying a plurality of ngrams, each ngram being a sequence of words within the textual input; dividing each identified ngram into three portions;
a beginning constant portion containing a first number of words at the beginning of the ngram, an ending constant portion containing a second number of words at the end of the ngram, and a middle portion containing the words of the ngram between the beginning constant portion and the ending constant portion;determining an anchor for each ngram, the anchor comprising the beginning constant portion and the ending constant portion of the ngram; and identifying a plurality of potential paraphrase pairs, wherein if the anchor of a first ngram is the same as the anchor of a second ngram in the plurality of ngrams, the middle portion of the first ngram and the middle portion of the second ngram is identified as being a potential paraphrase pair, wherein the middle portion of the first ngram is textually different from the middle portion of the second ngram.
-
-
13. A system comprising:
-
one or more computers; where the one or more computers are configured to perform operations comprising; receiving textual input; identifying a plurality of ngrams, each ngram being a sequence of words within the textual input; dividing each identified ngram into three portions;
a beginning constant portion containing a first number of words at the beginning of the ngram, an ending constant portion containing a second number of words at the end of the ngram, and a middle portion containing the words of the ngram between the beginning constant portion and the ending constant portion;determining an anchor for each ngram, the anchor comprising the beginning constant portion and the ending constant portion of the ngram; identifying a plurality of potential paraphrase pairs, wherein if the anchor of a first ngram is the same as the anchor of a second ngram in the plurality of ngrams, the middle portion of the first ngram and the middle portion of the second ngram is identified as being a potential paraphrase pair, wherein the middle portion of the first ngram is textually different from the middle portion of the second ngram; receiving a query entered as an input in a search engine by a user; identifying one or more suggestions, each suggestion being an alternative term for replacing a sequence of one or more words in the query, where the sequence is one paraphrase member of a potential paraphrase pair and the suggestion is the other paraphrase member of the potential paraphrase pair; and generating one or more alternative queries using the one or more suggestions. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
-
Specification