Document-based synonym generation
First Claim
1. A method for automatically determining whether pairs of words are synonym pairs, comprising:
- determining co-occurrence frequencies for pairs of words in documents;
determining closeness scores for pairs of words in the documents, wherein a closeness score indicates whether words in a pair of words are located so close to each other in the documents that the words are likely to occur in the same sentence or phrase;
generating correlations between (i) words in a title or an anchor of each document in the documents and (ii) words in the document; and
determining whether each of the pairs of words is a synonym pair based on the determined co-occurrence frequencies, the determined closeness scores, and the correlations, wherein a high closeness score is a negative indicator that a pair of words is a synonym pair, a high co-occurrence frequency is a positive indicator that the pair of words is a synonym pair, and a high correlation is a positive indicator that the pair of words is a synonym pair;
wherein the determining co-occurrence frequencies, determining closeness scores, generating correlations, and determining whether each of the pairs of words is a synonym pair is performed by a computer system.
2 Assignments
0 Petitions
Accused Products
Abstract
One embodiment of the present invention provides a system that automatically generates synonyms for words from documents. During operation, this system determines co-occurrence frequencies for pairs of words in the documents. The system also determines closeness scores for pairs of words in the documents, wherein a closeness score indicates whether a pair of words are located so close to each other that the words are likely to occur in the same sentence or phrase. Finally, the system determines whether pairs of words are synonyms based on the determined co-occurrence frequencies and the determined closeness scores. While making this determination, the system can additionally consider correlations between words in a title or an anchor of a document and words in the document as well as word-form scores for pairs of words in the documents.
81 Citations
22 Claims
-
1. A method for automatically determining whether pairs of words are synonym pairs, comprising:
-
determining co-occurrence frequencies for pairs of words in documents; determining closeness scores for pairs of words in the documents, wherein a closeness score indicates whether words in a pair of words are located so close to each other in the documents that the words are likely to occur in the same sentence or phrase; generating correlations between (i) words in a title or an anchor of each document in the documents and (ii) words in the document; and determining whether each of the pairs of words is a synonym pair based on the determined co-occurrence frequencies, the determined closeness scores, and the correlations, wherein a high closeness score is a negative indicator that a pair of words is a synonym pair, a high co-occurrence frequency is a positive indicator that the pair of words is a synonym pair, and a high correlation is a positive indicator that the pair of words is a synonym pair; wherein the determining co-occurrence frequencies, determining closeness scores, generating correlations, and determining whether each of the pairs of words is a synonym pair is performed by a computer system. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A method for automatically determining whether pairs of words are synonym pairs, comprising:
-
determining co-occurrence frequencies for pairs of words in documents; determining closeness scores for pairs of words in the documents, wherein a closeness score indicates whether words in a pair of words are located so close to each other in the documents that the words are likely to occur in the same sentence or phrase; determining word-form scores for pairs of words in the documents, wherein a high word-form score results when a pair of words share common parts, which can include prefixes, suffixes and/or middle sections, and wherein remaining parts of the pair of words are consistent with word-form rules; generating correlations between (i) words in a title or an anchor of each document in the documents and (ii) words in the document; and determining whether pairs of words are synonyms based on the determined co-occurrence frequencies, the determined closeness scores, the determined word-form scores, and the generated correlations, wherein a high co-occurrence frequency is a positive indicator that the pair of words is a synonym pair, a high closeness score is a negative indicator that a pair of words is a synonym pair, a high word-form score is a positive indicator that the pair of words is a synonym pair, and a high correlation is a positive indicator that the pair of words is a synonym pair; wherein the determining co-occurrence frequencies, determining closeness scores, generating correlations, and determining whether each of the pairs of words is a synonym pair is performed by a computer system.
-
-
10. A computer-readable storage medium storing instructions that when executed by a computer cause the computer to perform operations comprising:
-
determining co-occurrence frequencies for pairs of words in documents; determining closeness scores for pairs of words in the documents, wherein a closeness score indicates whether words in a pair of words are located so close to each other in the documents that the words are likely to occur in the same sentence or phrase; generating correlations between (i) words in a title or an anchor of each document in the documents and (ii) words in the document; and determining whether each of the pairs of words is a synonym pair based on the determined co-occurrence frequencies, the determined closeness scores, and the correlations, wherein a high closeness score is a negative indicator that a pair of words is a synonym pair, a high co-occurrence frequency is a positive indicator that the pair of words is a synonym pair, and a high correlation is a positive indicator that the pair of words is a synonym pair. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17)
-
-
18. A computer-readable storage medium storing instructions that when executed by a computer cause the computer to perform operations comprising:
-
determining co-occurrence frequencies for pairs of words in documents; determining closeness scores for pairs of words in the documents, wherein a closeness score indicates whether words in a pair of words are located so close to each other in the documents that the words are likely to occur in the same sentence or phrase; determining word-form scores for pairs of words in the documents, wherein a high word-form score results when a pair of words share common parts, which can include prefixes, suffixes and/or middle sections, and wherein remaining parts of the pair of words are consistent with word-form rules; generating correlations between (i) words in a title or an anchor of each document in the documents and (ii) words in the document; and determining whether pairs of words are synonyms based on the determined co-occurrence frequencies, the determined closeness scores, the determined word-form scores, and the generated correlations, the determined word-form scores, and the generated correlations, wherein a high co-occurrence frequency is a positive indicator that the pair of words is a synonym pair, a high closeness score is a negative indicator that a pair of words is a synonym pair, a high word-form score is a positive indicator that the pair of words is a synonym pair, and a high correlation is a positive indicator that the pair of words is a synonym pair.
-
-
19. A system, comprising:
-
one or more computers configured to perform operations comprising; determining co-occurrence frequencies for pairs of words in documents; determining closeness scores for pairs of words in the documents, wherein a closeness score indicates whether words in a pair of words are located so close to each other in the documents that the words are likely to occur in the same sentence or phrase; generating correlations between (i) words in a title or an anchor of each document in the documents and (ii) words in the document; and determining whether each of the pairs of words is a synonym pair based on the determined co-occurrence frequencies, the determined closeness scores, and the correlations, wherein a high closeness score is a negative indicator that a pair of words is a synonym pair, a high co-occurrence frequency is a positive indicator that the pair of words is a synonym pair, and a high correlation is a positive indicator that the pair of words is a synonym pair; wherein the determining co-occurrence frequencies, determining closeness scores, generating correlations, and determining whether each of the pairs of words is a synonym pair is performed by a computer system.
-
-
20. A system, comprising:
one or more computers configured to perform operations comprising; determining co-occurrence frequencies for pairs of words in documents; determining closeness scores for pairs of words in the documents, wherein a closeness score indicates whether words in a pair of words are located so close to each other in the documents that the words are likely to occur in the same sentence or phrase; determining word-form scores for pairs of words in the documents, wherein a high word-form score results when a pair of words share common parts, which can include prefixes, suffixes and/or middle sections, and wherein remaining parts of the pair of words are consistent with word-form rules; generating correlations between (i) words in a title or an anchor of each document in the documents and (ii) words in the document; and determining whether pairs of words are synonyms based on the determined co-occurrence frequencies, the determined closeness scores, the determined word-form scores, and the generated correlations, wherein a high co-occurrence frequency is a positive indicator that the pair of words is a synonym pair, a high closeness score is a negative indicator that a pair of words is a synonym pair, a high word-form score is a positive indicator that the pair of words is a synonym pair, and a high correlation is a positive indicator that the pair of words is a synonym pair; wherein the determining co-occurrence frequencies, determining closeness scores, generating correlations, and determining whether each of the pairs of words is a synonym pair is performed by a computer system.
-
21. A computer implemented method for automatically determining whether pairs of words are synonym pairs, comprising:
-
determining, for a pair of words, a co-occurrence frequency of the pair of words in a plurality of documents; determining a closeness score for the pair of words in the plurality of documents, wherein the closeness score is derived from how close a first and a second word appear to each other in the plurality of documents; generating a correlation between (i) one of the words in each pair of words appearing in a title or anchor of each of one or more documents and (ii) another word in the pair of words appearing in each of the one or more documents; determining whether the pair of words is a synonym pair according to the co-occurrence frequency, the closeness score, and the correlation, wherein a high closeness score is a negative indicator that the pair of words is a synonym pair, a high co occurrence frequency is a positive indicator that the pair of words is a synonym pair, and a high correlation is a positive indicator that the pair of words is a synonym pair.
-
-
22. A computer-implemented method for automatically determining whether pairs of words are synonym pairs, comprising:
-
determining co-occurrence frequencies for pairs of words in documents; determining closeness scores for pairs of words in the documents, wherein a closeness score indicates whether words in a pair of words are located so close to each other in the documents that the words are likely to occur in the same sentence or phrase; determining word-form scores for pairs of words in the documents, wherein a word-form score specifies an extent to which a pair of words share common parts and an extent to which, which can include prcfixcs, cuffixcs and/or remaining parts of the pair of words are consistent with word-form rules, wherein the common parts include one or more prefixes, suffixes, and middle sections, and wherein a word-form rule designates acceptable variations among words that share a common stem; and determining whether each of the pairs of words is a synonym pair based on the determined co-occurrence frequencies, the determined closeness scores, and an analysis of the word-form scores.
-
Specification