Please download the dossier by clicking on the dossier button x
×

Paraphrasing the web by search-based data collection

  • US 8,244,521 B2
  • Filed: 03/16/2007
  • Issued: 08/14/2012
  • Est. Priority Date: 01/11/2007
  • Status: Active Grant
First Claim
Patent Images

1. A method for supporting a language processing application, the method comprising:

  • obtaining a starting target phrase of words that includes a first word, a last word, and a group of other words between the first and the last words;

    identifying n-grams that correspond to the starting target phrase of words, each of the n-grams including a portion of the words in the starting target phrase, each n-gram being different from the other n-grams, at least some of the n-grams having different numbers of words, a first group of the n-grams including the first word and combinations of the other words, a second group of the n-grams including the last word and different combinations of the other words;

    utilizing each of the n-grams to search a web index to identify other phrases of words that are different than the starting target phrase of words and that include the n-grams;

    simultaneously identifying a plurality of left contexts and a plurality of right contexts for each of the n-grams by identifying words in the other phrases that precede and follow the n-grams in the other phrases, the plurality of left contexts including the words that precede the n-grams, and the plurality of right contexts including the words that follow the n-grams;

    combining the plurality of left contexts and the plurality of right contexts for each of the n-grams with a wildcard to search for a list of phrases of words that are distributionally similar to the starting target phrase of words, the phrases of words including phrases having different numbers of words and including some phrases that are semantically similar to the starting target phrase and some phrases that are not semantically similar to the starting target phrase;

    utilizing a computer processor that is a component of a computer to determine, based on results of multiple index queries, whether each phrase in the list of distributionally similar phrases of words is semantically equivalent to the starting target phrase of words, the multiple index queries including a query that explicitly identifies a presence or an absence of a synonymy relationship between the starting target phrase of words and the list of distributionally similar phrases of words in a string of text; and

    adding a portion of the list of phrases of words that are determined to be semantically equivalent to the starting target phrase of words to a lattice of replacement candidates for the starting target phrase of words.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×