Paraphrasing the web by search-based data collection
First Claim
Patent Images
1. A method for supporting a language processing application, the method comprising:
- obtaining a starting target phrase of words that includes a first word, a last word, and a group of other words between the first and the last words;
identifying n-grams that correspond to the starting target phrase of words, each of the n-grams including a portion of the words in the starting target phrase, each n-gram being different from the other n-grams, at least some of the n-grams having different numbers of words, a first group of the n-grams including the first word and combinations of the other words, a second group of the n-grams including the last word and different combinations of the other words;
utilizing each of the n-grams to search a web index to identify other phrases of words that are different than the starting target phrase of words and that include the n-grams;
simultaneously identifying a plurality of left contexts and a plurality of right contexts for each of the n-grams by identifying words in the other phrases that precede and follow the n-grams in the other phrases, the plurality of left contexts including the words that precede the n-grams, and the plurality of right contexts including the words that follow the n-grams;
combining the plurality of left contexts and the plurality of right contexts for each of the n-grams with a wildcard to search for a list of phrases of words that are distributionally similar to the starting target phrase of words, the phrases of words including phrases having different numbers of words and including some phrases that are semantically similar to the starting target phrase and some phrases that are not semantically similar to the starting target phrase;
utilizing a computer processor that is a component of a computer to determine, based on results of multiple index queries, whether each phrase in the list of distributionally similar phrases of words is semantically equivalent to the starting target phrase of words, the multiple index queries including a query that explicitly identifies a presence or an absence of a synonymy relationship between the starting target phrase of words and the list of distributionally similar phrases of words in a string of text; and
adding a portion of the list of phrases of words that are determined to be semantically equivalent to the starting target phrase of words to a lattice of replacement candidates for the starting target phrase of words.
2 Assignments
0 Petitions
Accused Products
Abstract
String-oriented web queries are utilized as a tool to examine the fabric of how words, phrases and/or n-grams alternate in a language. This fabric is exploited in order to build up a matrix of semantically equivalent pieces of language. In one embodiment, the Distributional Hypothesis is utilized, along with strategies for confirming synonymy, to systematically build up a picture of what words/phrases can be legitimately substituted for one another.
-
Citations
20 Claims
-
1. A method for supporting a language processing application, the method comprising:
-
obtaining a starting target phrase of words that includes a first word, a last word, and a group of other words between the first and the last words; identifying n-grams that correspond to the starting target phrase of words, each of the n-grams including a portion of the words in the starting target phrase, each n-gram being different from the other n-grams, at least some of the n-grams having different numbers of words, a first group of the n-grams including the first word and combinations of the other words, a second group of the n-grams including the last word and different combinations of the other words; utilizing each of the n-grams to search a web index to identify other phrases of words that are different than the starting target phrase of words and that include the n-grams; simultaneously identifying a plurality of left contexts and a plurality of right contexts for each of the n-grams by identifying words in the other phrases that precede and follow the n-grams in the other phrases, the plurality of left contexts including the words that precede the n-grams, and the plurality of right contexts including the words that follow the n-grams; combining the plurality of left contexts and the plurality of right contexts for each of the n-grams with a wildcard to search for a list of phrases of words that are distributionally similar to the starting target phrase of words, the phrases of words including phrases having different numbers of words and including some phrases that are semantically similar to the starting target phrase and some phrases that are not semantically similar to the starting target phrase; utilizing a computer processor that is a component of a computer to determine, based on results of multiple index queries, whether each phrase in the list of distributionally similar phrases of words is semantically equivalent to the starting target phrase of words, the multiple index queries including a query that explicitly identifies a presence or an absence of a synonymy relationship between the starting target phrase of words and the list of distributionally similar phrases of words in a string of text; and adding a portion of the list of phrases of words that are determined to be semantically equivalent to the starting target phrase of words to a lattice of replacement candidates for the starting target phrase of words. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
-
-
18. A language processing system, comprising:
-
an index query engine; and a processing component that utilizes the index query engine to perform a first left and right context search to identify an item of text that is distributionally similar to a target item of text, the first left and right context search including generating n-grams based on the target item of text and utilizing the n-grams to identify left contexts and right contexts for each of the n-grams, the item of text being identified based on it having one or more of the left and right contexts in common with the target item of text, the processing component applying multiple tests to the item of text to determine whether the item of text is semantically equivalent to the target item of text, one of the multiple tests including a second left and right context search that utilizes the item of text and the target item of text to identify additional left and right contexts for the item of text and the target item of text, the processing component making a determination that the item of text and the target item of text are semantically equivalent based on the item of text having one or more of the additional left and right contexts in common with target item of text, another one of the multiple tests including a context grouping test, the context grouping test including performing a web search to identify lists of words that are found proximate to the item of text and the target item of text in a world wide web, each word in the lists of words being associated with a web count that identifies a number of times the word is found in the web search, the context grouping test making a determination that the item of text and the target item of text are semantically equivalent based at least in part on the web counts, the item of text being added to a lattice of words for the target item of text based on the determinations from the context grouping test and from the second left and right context search that the item of text is semantically equivalent to the target item of text, and the lattice of words including at least one complete sentence and groups of words that are semantically related to portions of the at least one complete sentence. - View Dependent Claims (19)
-
-
20. A method for supporting a language processing application, the method comprising:
-
obtaining a target item of text; utilizing a context associated with the target item of text to identify another item of text that is semantically equivalent to the target item of text; combining the target item of text and the another item of text into a slot; generating an additional context by adding the slot to the context; utilizing a computer processor that is a component of a computer to perform one or more index queries utilizing the additional context to identify additional items of text that are semantically equivalent to the target item of text and to the another item of text; collapsing the target item of text, the another item of text, and the additional items of text into another slot; utilizing the another slot to perform additional index queries to identify strings of words in a web index that correspond to the another slot; generating a count of a number of occurrences for each of the strings of words in the web index; and utilizing the counts of the number of occurrences to identify syntactic boundaries.
-
Specification