Document-based synonym generation
First Claim
1. A computer-implemented method comprising:
- receiving a pair of words comprising a first word and a second word, where each word appears in a collection of documents;
generating a word-form score for the pair of words based on a consistency of the pair of words with word-form rules, wherein a word-form rule indicates how words with a common portion can vary;
computing a probability that the first word occurs within a first number of words of the second word in the one or more documents in the collection;
computing a probability that the first word occurs within a second number of words of the second word in the one or more documents in the collection, wherein the second number is greater than the first number;
generating a closeness score for the pair of words by dividing the first number by the second number;
computing a relative frequency of occurrence for the first word and the second word in the collection of documents;
generating a correlation between occurrences of a first word in the title or the anchor of the documents and occurrences of a second word in a same document; and
determining that the first word and the second word are synonyms based at least on the correlation, the relative frequency of the first word and the second word, the closeness score, and the word-form score.
2 Assignments
0 Petitions
Accused Products
Abstract
One embodiment of the present invention provides a system that automatically generates synonyms for words from documents. During operation, this system determines co-occurrence frequencies for pairs of words in the documents. The system also determines closeness scores for pairs of words in the documents, wherein a closeness score indicates whether a pair of words are located so close to each other that the words are likely to occur in the same sentence or phrase. Finally, the system determines whether pairs of words are synonyms based on the determined co-occurrence frequencies and the determined closeness scores. While making this determination, the system can additionally consider correlations between words in a title or an anchor of a document and words in the document as well as word-form scores for pairs of words in the documents.
-
Citations
33 Claims
-
1. A computer-implemented method comprising:
-
receiving a pair of words comprising a first word and a second word, where each word appears in a collection of documents; generating a word-form score for the pair of words based on a consistency of the pair of words with word-form rules, wherein a word-form rule indicates how words with a common portion can vary; computing a probability that the first word occurs within a first number of words of the second word in the one or more documents in the collection; computing a probability that the first word occurs within a second number of words of the second word in the one or more documents in the collection, wherein the second number is greater than the first number; generating a closeness score for the pair of words by dividing the first number by the second number; computing a relative frequency of occurrence for the first word and the second word in the collection of documents; generating a correlation between occurrences of a first word in the title or the anchor of the documents and occurrences of a second word in a same document; and determining that the first word and the second word are synonyms based at least on the correlation, the relative frequency of the first word and the second word, the closeness score, and the word-form score. - View Dependent Claims (2, 3)
-
-
4. A computer-implemented method comprising:
-
receiving a pair of words; generating a word-form score for the pair of words based on a consistency of the pair of words with word-form rules, wherein a word-form rule indicates how words with a common portion can vary; determining that the pair of words is a synonym pair based at least on the generated word-form score; and generating an alternative search query for a search query that includes one of the words of the pair of words using another word of the pair of words. - View Dependent Claims (5, 6, 7, 8)
-
-
9. A computer-implemented method comprising:
-
receiving a pair of words that includes first word and a second word; computing a probability that the first word occurs within a first number of words of the second word in a collection of documents; computing a probability that the first word occurs within a second number of words of the second word in the one or more documents, wherein the second number is greater than the first number; generating a closeness score for the pair of words by dividing the first number by the second number; determining a co-occurrence frequency for the pair of words in a collection of documents; and determining that the pair of words are not synonyms based at least on the generated closeness score and the co-occurrence frequency. - View Dependent Claims (10, 11)
-
-
12. A system comprising:
one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising; receiving a pair of words comprising a first word and a second word, where each word appears in a collection of documents; generating a word-form score for the pair of words based on a consistency of the pair of words with word-form rules, wherein a word-form rule indicates how words with a common portion can vary; computing a probability that the first word occurs within a first number of words of the second word in the one or more documents in the collection; computing a probability that the first word occurs within a second number of words of the second word in the one or more documents in the collection, wherein the second number is greater than the first number; generating a closeness score for the pair of words by dividing the first number by the second number; computing a relative frequency of occurrence for the first word and the second word in the collection of documents; generating a correlation between occurrences of a first word in the title or the anchor of the documents and occurrences of a second word in a same document; and determining that the first word and the second word are synonyms based at least on the correlation, the relative frequency of the first word and the second word, the closeness score, and the word-form score. - View Dependent Claims (13, 14)
-
15. A system comprising:
-
one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising; receiving a pair of words; generating a word-form score for the pair of words based on a consistency of the pair of words with word-form rules, wherein a word-form rule indicates how words with a common portion can vary; determining that the pair of words is a synonym pair based at least on the generated word form score; and generating an alternative search query for a search query that includes one of the words of the pair of words using another word of the pair of words. - View Dependent Claims (16, 17, 18, 19)
-
-
20. A system comprising:
-
one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising; receiving a pair of words that includes first word and a second word; computing a probability that the first word occurs within a first number of words of the second word in a collection of documents; computing a probability that the first word occurs within a second number of words of the second word in the one or more documents, wherein the second number is greater than the first number; generating a closeness score for the pair of words by dividing the first number by the second number; determining a co-occurrence frequency for the pair of words in a collection of documents; and determining that the pair of words are not synonyms based at least on the generated closeness score and the co-occurrence frequency. - View Dependent Claims (21, 22)
-
-
23. A computer program product, encoded on one or more non-transitory computer storage media, comprising instructions that when executed by one or more computers cause the one or more computers to perform operations comprising:
-
receiving a pair of words comprising a first word and a second word, where each word appears in a collection of documents; generating a word-form score for the pair of words based on a consistency of the pair of words with word-form rules, wherein a word-form rule indicates how words with a common portion can vary; computing a probability that the first word occurs within a first number of words of the second word in the one or more documents in the collection; computing a probability that the first word occurs within a second number of words of the second word in the one or more documents in the collection, wherein the second number is greater than the first number; generating a closeness score for the pair of words by dividing the first number by the second number; computing a relative frequency of occurrence for the first word and the second word in the collection of documents; generating a correlation between occurrences of a first word in the title or the anchor of the documents and occurrences of a second word in a same document; and determining that the first word and the second word are synonyms based at least on the correlation, the relative frequency of the first word and the second word, the closeness score, and the word-form score. - View Dependent Claims (24, 25)
-
-
26. A computer program product, encoded on one or more non-transitory computer storage media, comprising instructions that when executed by one or more computers cause the one or more computers to perform operations comprising:
-
receiving a pair of words; generating a word-form score for the pair of words based on a consistency of the pair of words with word-form rules, wherein a word-form rule indicates how words with a common portion can vary; and determining that the pair of words is a synonym pair based at least on the generated word-form score; and generating an alternative search query for a search query that includes one of the words of the pair of words using another word of the pair of words. - View Dependent Claims (27, 28, 29, 30)
-
-
31. A computer program product, encoded on one or more non-transitory computer storage media, comprising instructions that when executed by one or more computers cause the one or more computers to perform operations comprising:
-
receiving a pair of words that includes first word and a second word; computing a probability that the first word occurs within a first number of words of the second word in a collection of documents; computing a probability that the first word occurs within a second number of words of the second word in the one or more documents, wherein the second number is greater than the first number; generating a closeness score for the pair of words by dividing the first number by the second number; determining a co-occurrence frequency for the pair of words in a collection of documents; and determining that the pair of words are not synonyms based at least on the generated closeness score and the co-occurrence frequency. - View Dependent Claims (32, 33)
-
Specification