Diverse topic phrase extraction
First Claim
Patent Images
1. A method, implemented by one or more computing devices, of summarizing search results from a corpus, the method comprising:
- re-weighting, by the one or more computing devices, documents based at least in part on a latent topic to identify topic phrases associated with candidate phrases occurring within one or more of the documents, the re-weighting comprising strengthening a subset of the documents that are indicated by the latent topic by adjusting document weights to increase a term frequency of one or more terms within the subset of the documents associated with the latent topic;
evaluating, by the one or more computing devices, based in part on the increase of the term frequency of the one or more terms, a modified latent semantic analysis (LSA)-weighted frequency of one or more documents of the subset of the documents to identify the topic phrases; and
filtering, by the one or more computing devices, the documents having similar topic phrases to remove redundancy of the similar topic phrases.
2 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods for implementing diverse topic phrase extraction are disclosed. According to one implementation, multiple word candidate phrases are extracted from a corpus and weighed. One or more documents are re-weighed to identify less obvious candidate topics using latent semantic analysis (LSA). Phrase diversification is then used to remove redundancy and select informative and distinct topic phrases.
-
Citations
18 Claims
-
1. A method, implemented by one or more computing devices, of summarizing search results from a corpus, the method comprising:
-
re-weighting, by the one or more computing devices, documents based at least in part on a latent topic to identify topic phrases associated with candidate phrases occurring within one or more of the documents, the re-weighting comprising strengthening a subset of the documents that are indicated by the latent topic by adjusting document weights to increase a term frequency of one or more terms within the subset of the documents associated with the latent topic; evaluating, by the one or more computing devices, based in part on the increase of the term frequency of the one or more terms, a modified latent semantic analysis (LSA)-weighted frequency of one or more documents of the subset of the documents to identify the topic phrases; and filtering, by the one or more computing devices, the documents having similar topic phrases to remove redundancy of the similar topic phrases. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system comprising:
-
one or more processors; a memory; a re-weighting module configured to perform acts, the acts comprising; identifying topic phrases associated with documents; determining a latent topic by associating the documents with weight factors based on an association of terms within constituent phrases that appear in the documents, at least one concept being characterized by the latent topic; and re-weighting the documents, the re-weighting comprising strengthening a subset of the documents that are indicated by the latent topic by adjusting document weights to increase a term frequency of one or more terms within the subset of the documents associated with the latent topic; a diversification module configured to perform acts comprising; evaluating, based in part on the increase of the term frequency of the one or more terms, a modified latent semantic analysis (LSA)-weighted frequency of one or more documents of the subset of the documents; removing documents associated with one or more similar concepts to result in a list of documents with unlike concepts. - View Dependent Claims (9, 10, 11, 12)
-
-
13. One or more computer-readable storage devices comprising computer executable instructions that, when executed, direct a computing system to perform acts, the acts comprising:
-
re-weighting documents in a corpus based at least in part on a latent topic to identify one or more concepts associated with the documents, the re-weighting comprising strengthening a subset of the documents that are indicated by the latent topic by adjusting document weights to increase a term frequency of each of one or more terms within the subset of the documents associated with the latent topic; evaluating, based in part on the increase of the term frequency of each of the one or more terms, a modified latent semantic analysis (LSA)-weighted frequency of one or more documents of the subset of the documents to identify one or more topic phrases associated with the latent topic; and filtering the documents associated with similar concepts to result in a document collection with unlike concepts. - View Dependent Claims (14, 15, 16, 17, 18)
-
Specification