Paraphrasing the web by search-based data collection

US 8,244,521 B2
Filed: 03/16/2007
Issued: 08/14/2012
Est. Priority Date: 01/11/2007
Status: Active Grant

First Claim

Patent Images

1. A method for supporting a language processing application, the method comprising:

obtaining a starting target phrase of words that includes a first word, a last word, and a group of other words between the first and the last words;

identifying n-grams that correspond to the starting target phrase of words, each of the n-grams including a portion of the words in the starting target phrase, each n-gram being different from the other n-grams, at least some of the n-grams having different numbers of words, a first group of the n-grams including the first word and combinations of the other words, a second group of the n-grams including the last word and different combinations of the other words;

utilizing each of the n-grams to search a web index to identify other phrases of words that are different than the starting target phrase of words and that include the n-grams;

simultaneously identifying a plurality of left contexts and a plurality of right contexts for each of the n-grams by identifying words in the other phrases that precede and follow the n-grams in the other phrases, the plurality of left contexts including the words that precede the n-grams, and the plurality of right contexts including the words that follow the n-grams;

combining the plurality of left contexts and the plurality of right contexts for each of the n-grams with a wildcard to search for a list of phrases of words that are distributionally similar to the starting target phrase of words, the phrases of words including phrases having different numbers of words and including some phrases that are semantically similar to the starting target phrase and some phrases that are not semantically similar to the starting target phrase;

utilizing a computer processor that is a component of a computer to determine, based on results of multiple index queries, whether each phrase in the list of distributionally similar phrases of words is semantically equivalent to the starting target phrase of words, the multiple index queries including a query that explicitly identifies a presence or an absence of a synonymy relationship between the starting target phrase of words and the list of distributionally similar phrases of words in a string of text; and

adding a portion of the list of phrases of words that are determined to be semantically equivalent to the starting target phrase of words to a lattice of replacement candidates for the starting target phrase of words.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

String-oriented web queries are utilized as a tool to examine the fabric of how words, phrases and/or n-grams alternate in a language. This fabric is exploited in order to build up a matrix of semantically equivalent pieces of language. In one embodiment, the Distributional Hypothesis is utilized, along with strategies for confirming synonymy, to systematically build up a picture of what words/phrases can be legitimately substituted for one another.

Citations

20 Claims

1. A method for supporting a language processing application, the method comprising:
- obtaining a starting target phrase of words that includes a first word, a last word, and a group of other words between the first and the last words;
  
  identifying n-grams that correspond to the starting target phrase of words, each of the n-grams including a portion of the words in the starting target phrase, each n-gram being different from the other n-grams, at least some of the n-grams having different numbers of words, a first group of the n-grams including the first word and combinations of the other words, a second group of the n-grams including the last word and different combinations of the other words;
  
  utilizing each of the n-grams to search a web index to identify other phrases of words that are different than the starting target phrase of words and that include the n-grams;
  
  simultaneously identifying a plurality of left contexts and a plurality of right contexts for each of the n-grams by identifying words in the other phrases that precede and follow the n-grams in the other phrases, the plurality of left contexts including the words that precede the n-grams, and the plurality of right contexts including the words that follow the n-grams;
  
  combining the plurality of left contexts and the plurality of right contexts for each of the n-grams with a wildcard to search for a list of phrases of words that are distributionally similar to the starting target phrase of words, the phrases of words including phrases having different numbers of words and including some phrases that are semantically similar to the starting target phrase and some phrases that are not semantically similar to the starting target phrase;
  
  utilizing a computer processor that is a component of a computer to determine, based on results of multiple index queries, whether each phrase in the list of distributionally similar phrases of words is semantically equivalent to the starting target phrase of words, the multiple index queries including a query that explicitly identifies a presence or an absence of a synonymy relationship between the starting target phrase of words and the list of distributionally similar phrases of words in a string of text; and
  
  adding a portion of the list of phrases of words that are determined to be semantically equivalent to the starting target phrase of words to a lattice of replacement candidates for the starting target phrase of words.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 2. The method of claim 1, wherein identifying the plurality of right contexts and the plurality of left contexts for each of the n-grams includes utilizing a search engine to execute a query against content of a world wide web, wherein the multiple index queries also include performing additional left and right context searches utilizing each of the phrases of words, and wherein phrases within the phrases of words are determined to be semantically similar to the starting target phrase of words based at least in part on the phrases having one or more left or right contexts in common with the starting target phrase of words.
  - 3. The method of claim 1, wherein determining whether each phrase in the list of distributionally similar phrases of words is semantically equivalent to the starting target phrase of words comprises utilizing an index query engine to identify whether a string of text indicates that the starting target phrase of words and one of the distributionally similar phrases of words are equivalents.
  - 4. The method of claim 1, wherein the web index includes web pages, text documents, and multimedia files.
  - 5. The method of claim 1, wherein determining whether each phrase in the list of distributionally similar phrases of words is semantically equivalent to the starting target phrase of words includes utilizing a statistical classifier to make a confirm/deny decision.
  - 6. The method of claim 5, wherein the statistical classifier utilizes heuristic query results as features.
  - 7. The method of claim 1, wherein determining whether each phrase in the list of distributionally similar phrases of words is semantically equivalent to the starting target phrase of words includes identifying neologram reinforcements.
  - 8. The method of claim 1, wherein determining whether each phrase in the list of distributionally similar phrases of words is semantically equivalent to the starting target phrase of words includes utilizing an index query engine to perform at least one world-wide-web query.
  - 9. The method of claim 1, wherein determining whether each phrase in the list of distributionally similar phrases of words is semantically equivalent to the starting target phrase of words includes utilizing a set of templatic queries that reflect association heuristics.
  - 10. The method of claim 1, wherein determining whether each phrase in the list of distributionally similar phrases of words is semantically equivalent to the starting target phrase of words includes utilizing a predetermined number of wildcard n-grams as a basis for inferring semantic similarity.
  - 11. The method of claim 1, wherein determining whether each phrase in the list of distributionally similar phrases of words is semantically equivalent to the starting target phrase of words includes utilizing a coordination pattern.
  - 12. The method of claim 1, wherein determining whether each phrase in the list of distributionally similar phrases of words is semantically equivalent to the starting target phrase of words includes utilizing a negative coordination pattern.
  - 13. The method of claim 1, wherein determining whether each phrase in the list of distributionally similar phrases of words is semantically equivalent to the starting target phrase of words includes identifying a presence or an absence of strings signaling an explicit synonymy relationship.
  - 14. The method of claim 1, wherein determining whether each phrase in the list of distributionally similar phrases of words is semantically equivalent to the starting target phrase of words includes determining whether there is a coordination pattern indicative of a morphological alteration.
  - 15. The method of claim 1, wherein determining whether each phrase in the list of distributionally similar phrases of words is semantically equivalent to the starting target phrase of words includes utilizing context grouping.
  - 16. The method of claim 15, wherein the context grouping is performed based on the plurality of left contexts for each of the n-grams.
  - 17. The method of claim 15, wherein the context grouping is performed based on the plurality of right contexts for each of the n-grams.

18. A language processing system, comprising:
- an index query engine; and
  
  a processing component that utilizes the index query engine to perform a first left and right context search to identify an item of text that is distributionally similar to a target item of text, the first left and right context search including generating n-grams based on the target item of text and utilizing the n-grams to identify left contexts and right contexts for each of the n-grams, the item of text being identified based on it having one or more of the left and right contexts in common with the target item of text, the processing component applying multiple tests to the item of text to determine whether the item of text is semantically equivalent to the target item of text, one of the multiple tests including a second left and right context search that utilizes the item of text and the target item of text to identify additional left and right contexts for the item of text and the target item of text, the processing component making a determination that the item of text and the target item of text are semantically equivalent based on the item of text having one or more of the additional left and right contexts in common with target item of text, another one of the multiple tests including a context grouping test, the context grouping test including performing a web search to identify lists of words that are found proximate to the item of text and the target item of text in a world wide web, each word in the lists of words being associated with a web count that identifies a number of times the word is found in the web search, the context grouping test making a determination that the item of text and the target item of text are semantically equivalent based at least in part on the web counts, the item of text being added to a lattice of words for the target item of text based on the determinations from the context grouping test and from the second left and right context search that the item of text is semantically equivalent to the target item of text, and the lattice of words including at least one complete sentence and groups of words that are semantically related to portions of the at least one complete sentence.
- View Dependent Claims (19)
- - 19. The system of claim 18, wherein a third one of the multiple tests includes a negative coordination pattern test.

20. A method for supporting a language processing application, the method comprising:
- obtaining a target item of text;
  
  utilizing a context associated with the target item of text to identify another item of text that is semantically equivalent to the target item of text;
  
  combining the target item of text and the another item of text into a slot;
  
  generating an additional context by adding the slot to the context;
  
  utilizing a computer processor that is a component of a computer to perform one or more index queries utilizing the additional context to identify additional items of text that are semantically equivalent to the target item of text and to the another item of text;
  
  collapsing the target item of text, the another item of text, and the additional items of text into another slot;
  
  utilizing the another slot to perform additional index queries to identify strings of words in a web index that correspond to the another slot;
  
  generating a count of a number of occurrences for each of the strings of words in the web index; and
  
  utilizing the counts of the number of occurrences to identify syntactic boundaries.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Dolan, William
Primary Examiner(s)
Hudspeth, David R
Assistant Examiner(s)
Spooner, Lamont

Application Number

US11/724,703
Publication Number

US 20080172378A1
Time in Patent Office

1,978 Days
Field of Search

704/1, 704/9, 704/10, 707706-708, 707/718
US Class Current

704/9
CPC Class Codes

G06F 16/951 Indexing; Web crawling tech...

Paraphrasing the web by search-based data collection

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Paraphrasing the web by search-based data collection

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links