Document-based synonym generation
First Claim
1. A computer-implemented method comprising:
- selecting a pair of words;
receiving a document having an associated title or anchor;
determining that a word of the pair of words occurs in the title or anchor of the document and that a different word of the pair of words occurs within the document;
determining a co-occurrence frequency of the word of the pair of words and the different word of the pair of words in a collection of documents; and
determining that the word and the different word are synonyms based at least upon determining that the word of the pair of words occurs in the title or anchor of the document and that the different word of the pair of words occurs within the document and based at least on the determined co-occurrence frequency of the word of the pair of words and the different word of the pair of words.
2 Assignments
0 Petitions
Accused Products
Abstract
One embodiment of the present invention provides a system that automatically generates synonyms for words from documents. During operation, this system determines co-occurrence frequencies for pairs of words in the documents. The system also determines closeness scores for pairs of words in the documents, wherein a closeness score indicates whether a pair of words are located so close to each other that the words are likely to occur in the same sentence or phrase. Finally, the system determines whether pairs of words are synonyms based on the determined co-occurrence frequencies and the determined closeness scores. While making this determination, the system can additionally consider correlations between words in a title or an anchor of a document and words in the document as well as word-form scores for pairs of words in the documents.
-
Citations
18 Claims
-
1. A computer-implemented method comprising:
-
selecting a pair of words; receiving a document having an associated title or anchor; determining that a word of the pair of words occurs in the title or anchor of the document and that a different word of the pair of words occurs within the document; determining a co-occurrence frequency of the word of the pair of words and the different word of the pair of words in a collection of documents; and determining that the word and the different word are synonyms based at least upon determining that the word of the pair of words occurs in the title or anchor of the document and that the different word of the pair of words occurs within the document and based at least on the determined co-occurrence frequency of the word of the pair of words and the different word of the pair of words. - View Dependent Claims (2, 3)
-
-
4. A computer-implemented method comprising:
-
receiving a first word and a second word; computing a frequency with which the first word occurs in a collection of documents and a frequency with which the second word occurs in the collection of documents; determining that the frequency with which the first word occurs in the collection of documents is greater than the frequency with which the second word occurs in the collection of documents by at least a certain amount; and determining that the pair of words are not synonyms based at least on determining that the frequency with which the first word occurs in the collection of documents is greater than the frequency with which the second word occurs in the collection of documents by at least the certain the amount. - View Dependent Claims (5, 6)
-
-
7. A system comprising:
-
one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising; selecting a pair of words; receiving a document having an associated title or anchor; determining that a word of the pair of words occurs in the title or anchor of the document and that a different word of the pair of words occurs within the document; determining a co-occurrence frequency of the word of the pair of words and the different word of the pair of words in a collection of documents; and determining that the word and the different word are synonyms based at least upon determining that the word of the pair of words occurs in the title or anchor of the document and that the different word of the pair of words occurs within the document and based at least on the determined co-occurrence frequency of the word of the pair of words and the different word of the pair of words. - View Dependent Claims (8, 9)
-
-
10. A system comprising:
-
one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising; receiving a first word and a second word; computing a frequency with which the first word occurs in a collection of documents and a frequency with which the second word occurs in the collection of documents; determining that the frequency with which the first word occurs in the collection of documents is greater than the frequency with which the second word occurs in the collection of documents by at least a certain amount; and determining that the pair of words are not synonyms based at least on determining that the frequency with which the first word occurs in the collection of documents is greater than the frequency with which the second word occurs in the collection of documents by at least the certain amount. - View Dependent Claims (11, 12)
-
-
13. A non-transitory computer-readable medium storing software comprising instructions executable by one or more computers which, upon such execution, cause the one or more computers to perform operations comprising:
-
selecting a pair of words; receiving a document having an associated title or anchor; determining that a word of the pair of words occurs in the title or anchor of the document and that a different word of the pair of words occurs within the document; determining a co-occurrence frequency of the word of the pair of words and the different word of the pair of words in a collection of documents; and determining that the word and the different word are synonyms based at least upon determining that the word of the pair of words occurs in the title or anchor of the document and that the different word of the pair of words occurs in text within the document and based at least on the determined co-occurrence frequency of the word of the pair of words and the different word of the pair of words. - View Dependent Claims (14, 15)
-
-
16. A non-transitory computer-readable medium storing software comprising instructions executable by one or more computers which, upon such execution, cause the one or more computers to perform operations comprising:
-
receiving a first word and a second word; computing a frequency with which the first word occurs in a collection of documents and a frequency with which the second word occurs in the collection of documents; determining that the frequency with which the first word occurs in the collection of documents is greater than the frequency with which the second word occurs in the collection of documents by at least a certain amount; and determining that the pair of words are not synonyms based at least on determining that the frequency with which the first word occurs in the collection of documents is greater than the frequency with which the second word occurs in the collection of documents by at least the certain amount. - View Dependent Claims (17, 18)
-
Specification