Paraphrase acquisition

US 7,937,265 B1
Filed: 09/27/2005
Issued: 05/03/2011
Est. Priority Date: 09/27/2005
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

receiving textual input in data processing apparatus;

identifying, by operation of the data processing apparatus, a plurality of ngrams, each ngram being a sequence of words within the textual input;

dividing, by operation of the data processing apparatus, each identified ngram into three portions;

a beginning constant portion containing a first number of words at the beginning of the ngram, an ending constant portion containing a second number of words at the end of the ngram, and a middle portion containing the words of the ngram between the beginning constant portion and the ending constant portion;

determining, by operation of the data processing apparatus, an anchor for each ngram, the anchor comprising the beginning constant portion and the ending constant portion of the ngram; and

identifying, by operation of the data processing apparatus, a plurality of potential paraphrase pairs, wherein if the anchor of a first ngram is the same as the anchor of a second ngram in the plurality of ngrams, the middle portion of the first ngram and the middle portion of the second ngram is identified as being a potential paraphrase pair, wherein the middle portion of the first ngram is textually different from the middle portion of the second ngram.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods and apparatus, including systems and computer program products, to acquire potential paraphrases from textual input. In one aspect, textual input is received, a first map is generated, where the key of the first map is an ngram identified in the textual input and the value associated with the key of the first map is a unique identifier, a second map is generated, where the key of the second map is an anchor identified from the ngram and the value associated with the key of the second map is one or more middle portions associated with the anchor, and a third map is generated, where the key of the third map is a potential paraphrase pair identified from the middle portions and the value associated with the key of the third map is the one or more unique anchors associated with the potential paraphrase pair.

101 Citations

View as Search Results

24 Claims

1. A method comprising:
- receiving textual input in data processing apparatus;
  
  identifying, by operation of the data processing apparatus, a plurality of ngrams, each ngram being a sequence of words within the textual input;
  
  dividing, by operation of the data processing apparatus, each identified ngram into three portions;
  
  a beginning constant portion containing a first number of words at the beginning of the ngram, an ending constant portion containing a second number of words at the end of the ngram, and a middle portion containing the words of the ngram between the beginning constant portion and the ending constant portion;
  
  determining, by operation of the data processing apparatus, an anchor for each ngram, the anchor comprising the beginning constant portion and the ending constant portion of the ngram; and
  
  identifying, by operation of the data processing apparatus, a plurality of potential paraphrase pairs, wherein if the anchor of a first ngram is the same as the anchor of a second ngram in the plurality of ngrams, the middle portion of the first ngram and the middle portion of the second ngram is identified as being a potential paraphrase pair, wherein the middle portion of the first ngram is textually different from the middle portion of the second ngram.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, wherein determining the anchor comprises including a named entity as part of the anchor.
  - 3. The method of claim 2, wherein the anchor further comprises a remainder of an adverbial relative clause modifying the named entity.
  - 4. The method of claim 3, further comprising:
    - counting the number of unique anchors associated with each identified potential paraphrase pair to determine the quality of each potential paraphrase pair; and
      
      identifying potential paraphrase pairs that are associated with a larger number of anchors as being of higher quality than potential paraphrase pairs that are associated with a smaller number of anchors.
  - 5. The method of claim 2, further comprising:
    - counting the number of unique anchors associated with each identified potential paraphrase pair to determine the quality of each potential paraphrase pair; and
      
      identifying potential paraphrase pairs that are associated with a larger number of anchors as being of higher quality than potential paraphrase pairs that are associated with a smaller number of anchors.
  - 6. The method of claim 1, further comprising:
    - counting the number of unique anchors associated with each identified potential paraphrase pair to determine the quality of each potential paraphrase pair; and
      
      identifying potential paraphrase pairs that are associated with a larger number of anchors as being of higher quality than potential paraphrase pairs that are associated with a smaller number of anchors.
  - 7. The method of claim 6, further comprising:
    - identifying each potential paraphrase pair as being a higher quality paraphrase pair if the number of unique anchors associated with the potential paraphrase pair is equal to or greater than a threshold value.
  - 8. The method of claim 7, further comprising:
    - receiving a term, the term being a sequence of one or more words;
      
      identifying one or more higher quality paraphrase pairs in each of which the term is identical to a paraphrase member; and
      
      adding the non-identical paraphrase member of each identified higher quality paraphrase pair to a set of suggested alternatives for the term.
  - 9. The method of claim 1, wherein each ngram has between seven and ten words.
  - 10. The method of claim 1, wherein the first number of words is three words, and the second number of words is three words.
  - 11. The method of claim 1, further comprising:
    - identifying one or more sentences in the textual input, wherein each identified ngram is a sequence of one or more words within an identified sentence.

12. A computer program product, encoded on a computer readable medium, operable to cause data processing apparatus to perform operations comprising:
- receiving textual input;
  
  identifying a plurality of ngrams, each ngram being a sequence of words within the textual input;
  
  dividing each identified ngram into three portions;
  
  a beginning constant portion containing a first number of words at the beginning of the ngram, an ending constant portion containing a second number of words at the end of the ngram, and a middle portion containing the words of the ngram between the beginning constant portion and the ending constant portion;
  
  determining an anchor for each ngram, the anchor comprising the beginning constant portion and the ending constant portion of the ngram; and
  
  identifying a plurality of potential paraphrase pairs, wherein if the anchor of a first ngram is the same as the anchor of a second ngram in the plurality of ngrams, the middle portion of the first ngram and the middle portion of the second ngram is identified as being a potential paraphrase pair, wherein the middle portion of the first ngram is textually different from the middle portion of the second ngram.

13. A system comprising:
- one or more computers;
  
  where the one or more computers are configured to perform operations comprising;
  
  receiving textual input;
  
  identifying a plurality of ngrams, each ngram being a sequence of words within the textual input;
  
  dividing each identified ngram into three portions;
  
  a beginning constant portion containing a first number of words at the beginning of the ngram, an ending constant portion containing a second number of words at the end of the ngram, and a middle portion containing the words of the ngram between the beginning constant portion and the ending constant portion;
  
  determining an anchor for each ngram, the anchor comprising the beginning constant portion and the ending constant portion of the ngram;
  
  identifying a plurality of potential paraphrase pairs, wherein if the anchor of a first ngram is the same as the anchor of a second ngram in the plurality of ngrams, the middle portion of the first ngram and the middle portion of the second ngram is identified as being a potential paraphrase pair, wherein the middle portion of the first ngram is textually different from the middle portion of the second ngram;
  
  receiving a query entered as an input in a search engine by a user;
  
  identifying one or more suggestions, each suggestion being an alternative term for replacing a sequence of one or more words in the query, where the sequence is one paraphrase member of a potential paraphrase pair and the suggestion is the other paraphrase member of the potential paraphrase pair; and
  
  generating one or more alternative queries using the one or more suggestions.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
- - 14. The system of claim 13, wherein the operations further comprise:
    - counting the number of times each identified potential paraphrase pair occurs in a set of documents; and
      
      determining rankings for the identified potential paraphrase pairs according to the number of times each identified potential paraphrase pair occurs in the set of documents.
  - 15. The system of claim 13, wherein determining the anchor comprises including a named entity as part of the anchor.
  - 16. The system of claim 15, wherein the anchor further comprises a remainder of an adverbial relative clause modifying the named entity.
  - 17. The system of claim 16, the operations further comprising:
    - counting the number of unique anchors associated with each identified potential paraphrase pair to determine the quality of each potential paraphrase pair; and
      
      identifying potential paraphrase pairs that are associated with a larger number of anchors as being of higher quality than potential paraphrase pairs that are associated with a smaller number of anchors.
  - 18. The system of claim 15, the operations further comprising:
    - counting the number of unique anchors associated with each identified potential paraphrase pair to determine the quality of each potential paraphrase pair; and
      
      identifying potential paraphrase pairs that are associated with a larger number of anchors as being of higher quality than potential paraphrase pairs that are associated with a smaller number of anchors.
  - 19. The system of claim 13, the operations further comprising:
    - identifying one or more sentences in the textual input, wherein each identified ngram is a sequence of one or more words within an identified sentence.
  - 20. The system of claim 13, the operations further comprising:
    - counting the number of unique anchors associated with each identified potential paraphrase pair to determine the quality of each potential paraphrase pair; and
      
      identifying potential paraphrase pairs that are associated with a larger number of anchors as being of higher quality than potential paraphrase pairs that are associated with a smaller number of anchors.
  - 21. The system of claim 20, the operations further comprising:
    - identifying each potential paraphrase pair as being a higher quality paraphrase pair if the number of unique anchors associated with the potential paraphrase pair is equal to or greater that a threshold value.
  - 22. The system of claim 21, the operations further comprising:
    - receiving a term, the term being a sequence of one or more words;
      
      identifying one or more higher quality paraphrase pairs in each of which the term is identical to a paraphrase member; and
      
      adding the non-identical paraphrase member of each identified higher quality paraphrase pair to a set of suggested alternatives for the term.
  - 23. The system of claim 13, wherein each ngram has between seven and ten words.
  - 24. The system of claim 13, wherein the first number of words is three words, and the second number of words is three words.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Pasca, Alexandru Marius, Dienes, Peter Szabolcs
Primary Examiner(s)
Han; Qi

Application Number

US11/238,623
Time in Patent Office

2,044 Days
Field of Search

704/9, 704/1, 704/257, 704/10, 704/8, 704/7
US Class Current

704/9
CPC Class Codes

G06F 40/247   Thesauruses; Synonyms

G06F 40/289   Phrasal analysis, e.g. fini...

G06F 40/295   Named entity recognition

Paraphrase acquisition

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

101 Citations

24 Claims

Specification

Use Cases

Quick Links

Others

Paraphrase acquisition

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

101 Citations

24 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others