Document-based synonym generation

US 8,161,041 B1
Filed: 02/10/2011
Issued: 04/17/2012
Est. Priority Date: 02/07/2007
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method comprising:

receiving a pair of words comprising a first word and a second word, where each word appears in a collection of documents;

generating a word-form score for the pair of words based on a consistency of the pair of words with word-form rules, wherein a word-form rule indicates how words with a common portion can vary;

computing a probability that the first word occurs within a first number of words of the second word in the one or more documents in the collection;

computing a probability that the first word occurs within a second number of words of the second word in the one or more documents in the collection, wherein the second number is greater than the first number;

generating a closeness score for the pair of words by dividing the first number by the second number;

computing a relative frequency of occurrence for the first word and the second word in the collection of documents;

generating a correlation between occurrences of a first word in the title or the anchor of the documents and occurrences of a second word in a same document; and

determining that the first word and the second word are synonyms based at least on the correlation, the relative frequency of the first word and the second word, the closeness score, and the word-form score.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

One embodiment of the present invention provides a system that automatically generates synonyms for words from documents. During operation, this system determines co-occurrence frequencies for pairs of words in the documents. The system also determines closeness scores for pairs of words in the documents, wherein a closeness score indicates whether a pair of words are located so close to each other that the words are likely to occur in the same sentence or phrase. Finally, the system determines whether pairs of words are synonyms based on the determined co-occurrence frequencies and the determined closeness scores. While making this determination, the system can additionally consider correlations between words in a title or an anchor of a document and words in the document as well as word-form scores for pairs of words in the documents.

Citations

33 Claims

1. A computer-implemented method comprising:
- receiving a pair of words comprising a first word and a second word, where each word appears in a collection of documents;
  
  generating a word-form score for the pair of words based on a consistency of the pair of words with word-form rules, wherein a word-form rule indicates how words with a common portion can vary;
  
  computing a probability that the first word occurs within a first number of words of the second word in the one or more documents in the collection;
  
  computing a probability that the first word occurs within a second number of words of the second word in the one or more documents in the collection, wherein the second number is greater than the first number;
  
  generating a closeness score for the pair of words by dividing the first number by the second number;
  
  computing a relative frequency of occurrence for the first word and the second word in the collection of documents;
  
  generating a correlation between occurrences of a first word in the title or the anchor of the documents and occurrences of a second word in a same document; and
  
  determining that the first word and the second word are synonyms based at least on the correlation, the relative frequency of the first word and the second word, the closeness score, and the word-form score.
- View Dependent Claims (2, 3)
- - 2. The method of claim 1, further comprising:
    - determining a co-occurrence frequency for the pair of words in the collection of documents;
      
      anddetermining that the pair of words are synonyms based at least on the determined the co-occurrence frequency.
  - 3. The method of claim 1, further comprising:
    - generating an alternative search query for a search query that includes one of the words of the pair of words using another word of the pair of words.

4. A computer-implemented method comprising:
- receiving a pair of words;
  
  generating a word-form score for the pair of words based on a consistency of the pair of words with word-form rules, wherein a word-form rule indicates how words with a common portion can vary;
  
  determining that the pair of words is a synonym pair based at least on the generated word-form score; and
  
  generating an alternative search query for a search query that includes one of the words of the pair of words using another word of the pair of words.
- View Dependent Claims (5, 6, 7, 8)
- - 5. The method of claim 4, further comprising:
    - receiving one or more synonyms; and
      
      generating one or more word-form rules from the synonyms.
  - 6. The method of claim 5, wherein the synonyms share common portions.
  - 7. The method of claim 4, wherein receiving pairs of words comprises receiving pairs of words in which each word in a pair occurs in a document.
  - 8. The method of claim 4, further comprising:
    - determining a co-occurrence frequency for the pair of words in a collection of documents;
      
      anddetermining that the pair of words are synonyms based at least on the determined co-occurrence frequency.

9. A computer-implemented method comprising:
- receiving a pair of words that includes first word and a second word;
  
  computing a probability that the first word occurs within a first number of words of the second word in a collection of documents;
  
  computing a probability that the first word occurs within a second number of words of the second word in the one or more documents, wherein the second number is greater than the first number;
  
  generating a closeness score for the pair of words by dividing the first number by the second number;
  
  determining a co-occurrence frequency for the pair of words in a collection of documents; and
  
  determining that the pair of words are not synonyms based at least on the generated closeness score and the co-occurrence frequency.
- View Dependent Claims (10, 11)
- - 10. The method of claim 9, further comprising:
    - excluding the pair of words from a list of synonyms used to generate alternative search queries.
  - 11. The method of claim 9, wherein the first number is 4 words, and the second number is 100 words.

12. A system comprising:
- one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising;
  
  receiving a pair of words comprising a first word and a second word, where each word appears in a collection of documents;
  
  generating a word-form score for the pair of words based on a consistency of the pair of words with word-form rules, wherein a word-form rule indicates how words with a common portion can vary;
  
  computing a probability that the first word occurs within a first number of words of the second word in the one or more documents in the collection;
  
  computing a probability that the first word occurs within a second number of words of the second word in the one or more documents in the collection, wherein the second number is greater than the first number;
  
  generating a closeness score for the pair of words by dividing the first number by the second number;
  
  computing a relative frequency of occurrence for the first word and the second word in the collection of documents;
  
  generating a correlation between occurrences of a first word in the title or the anchor of the documents and occurrences of a second word in a same document; and
  
  determining that the first word and the second word are synonyms based at least on the correlation, the relative frequency of the first word and the second word, the closeness score, and the word-form score.
- View Dependent Claims (13, 14)
- - 13. The system of claim 12, further comprising:
    - determining a co-occurrence frequency for the pair of words in the collection of documents; and
      
      determining that the pair of words are synonyms based at least on the determined the co-occurrence frequency.
  - 14. The system of claim 12, further comprising:
    - generating an alternative search query for a search query that includes one of the words of the pair of words using another word of the pair of words.

15. A system comprising:
- one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising;
  
  receiving a pair of words;
  
  generating a word-form score for the pair of words based on a consistency of the pair of words with word-form rules, wherein a word-form rule indicates how words with a common portion can vary;
  
  determining that the pair of words is a synonym pair based at least on the generated word form score; and
  
  generating an alternative search query for a search query that includes one of the words of the pair of words using another word of the pair of words.
- View Dependent Claims (16, 17, 18, 19)
- - 16. The system of claim 15, wherein the operations further comprise:
    - receiving one or more synonyms; and
      
      generating one or more word-form rules from the synonyms.
  - 17. The system of claim 16, wherein the synonyms share common portions.
  - 18. The system of claim 15, wherein receiving pairs of words comprises receiving pairs of words in which each word in a pair occurs in a document.
  - 19. The system of claim 15, wherein the operations further comprise:
    - determining a co-occurrence frequency for the pair of words in a collection of documents; and
      
      determining that the pair of words are synonyms based at least on the determined co-occurrence frequency.

20. A system comprising:
- one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising;
  
  receiving a pair of words that includes first word and a second word;
  
  computing a probability that the first word occurs within a first number of words of the second word in a collection of documents;
  
  computing a probability that the first word occurs within a second number of words of the second word in the one or more documents, wherein the second number is greater than the first number;
  
  generating a closeness score for the pair of words by dividing the first number by the second number;
  
  determining a co-occurrence frequency for the pair of words in a collection of documents; and
  
  determining that the pair of words are not synonyms based at least on the generated closeness score and the co-occurrence frequency.
- View Dependent Claims (21, 22)
- - 21. The system of claim 20, wherein the operations further comprise:
    - excluding the pair of words from a list of synonyms used to generate alternative search queries.
  - 22. The system of claim 20, wherein the first number is 4 words, and the second number is 100 words.

23. A computer program product, encoded on one or more non-transitory computer storage media, comprising instructions that when executed by one or more computers cause the one or more computers to perform operations comprising:
- receiving a pair of words comprising a first word and a second word, where each word appears in a collection of documents;
  
  generating a word-form score for the pair of words based on a consistency of the pair of words with word-form rules, wherein a word-form rule indicates how words with a common portion can vary;
  
  computing a probability that the first word occurs within a first number of words of the second word in the one or more documents in the collection;
  
  computing a probability that the first word occurs within a second number of words of the second word in the one or more documents in the collection, wherein the second number is greater than the first number;
  
  generating a closeness score for the pair of words by dividing the first number by the second number;
  
  computing a relative frequency of occurrence for the first word and the second word in the collection of documents;
  
  generating a correlation between occurrences of a first word in the title or the anchor of the documents and occurrences of a second word in a same document; and
  
  determining that the first word and the second word are synonyms based at least on the correlation, the relative frequency of the first word and the second word, the closeness score, and the word-form score.
- View Dependent Claims (24, 25)
- - 24. The computer program product of claim 23, further comprising:
    - determining a co-occurrence frequency for the pair of words in the collection of documents; and
      
      determining that the pair of words are synonyms based at least on the determined the co-occurrence frequency.
  - 25. The computer program product of claim 23, further comprising:
    - generating an alternative search query for a search query that includes one of the words of the pair of words using another word of the pair of words.

26. A computer program product, encoded on one or more non-transitory computer storage media, comprising instructions that when executed by one or more computers cause the one or more computers to perform operations comprising:
- receiving a pair of words;
  
  generating a word-form score for the pair of words based on a consistency of the pair of words with word-form rules, wherein a word-form rule indicates how words with a common portion can vary; and
  
  determining that the pair of words is a synonym pair based at least on the generated word-form score; and
  
  generating an alternative search query for a search query that includes one of the words of the pair of words using another word of the pair of words.
- View Dependent Claims (27, 28, 29, 30)
- - 27. The computer program product of claim 26, wherein the operations further comprise:
    - receiving one or more synonyms; and
      
      generating one or more word-form rules from the synonyms.
  - 28. The computer program product of claim 27, wherein the synonyms share common portions.
  - 29. The computer program product of claim 26, wherein receiving pairs of words comprises receiving pairs of words in which each word in a pair occurs in a document.
  - 30. The computer program product of claim 26, wherein the operations further comprise:
    - determining a co-occurrence frequency for the pair of words in a collection of documents; and
      
      determining that the pair of words are synonyms based at least on the determined co-occurrence frequency.

31. A computer program product, encoded on one or more non-transitory computer storage media, comprising instructions that when executed by one or more computers cause the one or more computers to perform operations comprising:
- receiving a pair of words that includes first word and a second word;
  
  computing a probability that the first word occurs within a first number of words of the second word in a collection of documents;
  
  computing a probability that the first word occurs within a second number of words of the second word in the one or more documents, wherein the second number is greater than the first number;
  
  generating a closeness score for the pair of words by dividing the first number by the second number;
  
  determining a co-occurrence frequency for the pair of words in a collection of documents; and
  
  determining that the pair of words are not synonyms based at least on the generated closeness score and the co-occurrence frequency.
- View Dependent Claims (32, 33)
- - 32. The computer program product of claim 31, wherein the operations further comprise:
    - excluding the pair of words from a list of synonyms used to generate alternative search queries.
  - 33. The computer program product of claim 31, wherein the first number is 4 words, and the second number is 100 words.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Grushetskyy, Oleksandr, Baker, Steven D.
Primary Examiner(s)
Breene, John E
Assistant Examiner(s)
Ly, Anh

Application Number

US13/024,731
Time in Patent Office

432 Days
Field of Search

707/727, 707/728, 707/730, 707/748, 707/750, 707/E17.096, 707/E17.072, 704/205, 704/9, 715/260
US Class Current

707/727
CPC Class Codes

G06F 16/9532   Query formulation

G06F 16/9535   Search customisation based ...

G06F 40/247   Thesauruses; Synonyms

Document-based synonym generation

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

33 Claims

Specification

Solutions

Use Cases

Quick Links

Document-based synonym generation

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

33 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links