Modeling Topics Using Statistical Distributions
First Claim
Patent Images
1. A method comprising:
- accessing a corpus stored in one or more tangible media, the corpus comprising a plurality of documents, a document comprising a plurality of words;
selecting one or more words of each document as one or more keywords of the each document;
clustering the documents according to the keywords to yield a plurality of clusters, each cluster corresponding to a topic;
generating a statistical distribution for each cluster from a subset of the words of the documents of the each cluster to yield a plurality of statistical distributions; and
modeling each topic using the statistical distribution generated for the cluster corresponding to the each topic.
1 Assignment
0 Petitions
Accused Products
Abstract
In one embodiment, modeling topics includes accessing a corpus comprising documents that include words. Words of a document are selected as keywords of the document. The documents are clustered according to the keywords to yield clusters, where each cluster corresponds to a topic. A statistical distribution is generated for a cluster from words of the documents of the cluster. A topic is modeled using the statistical distribution generated for the cluster corresponding to the topic.
123 Citations
26 Claims
-
1. A method comprising:
-
accessing a corpus stored in one or more tangible media, the corpus comprising a plurality of documents, a document comprising a plurality of words; selecting one or more words of each document as one or more keywords of the each document; clustering the documents according to the keywords to yield a plurality of clusters, each cluster corresponding to a topic; generating a statistical distribution for each cluster from a subset of the words of the documents of the each cluster to yield a plurality of statistical distributions; and modeling each topic using the statistical distribution generated for the cluster corresponding to the each topic. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 19, 20, 21)
-
-
9. One or more computer-readable tangible media encoding software operable when executed to:
-
access a corpus stored in one or more tangible media, the corpus comprising a plurality of documents, a document comprising a plurality of words; select one or more words of each document as one or more keywords of the each document; cluster the documents according to the keywords to yield a plurality of clusters, each cluster corresponding to a topic; generate a statistical distribution for each cluster from a subset of the words of the documents of the each cluster to yield a plurality of statistical distributions; and model each topic using the statistical distribution generated for the cluster corresponding to the each topic. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A system comprising:
-
means for accessing a corpus stored in one or more tangible media, the corpus comprising a plurality of documents, a document comprising a plurality of words; means for selecting one or more words of each document as one or more keywords of the each document; means for clustering the documents according to the keywords to yield a plurality of clusters, each cluster corresponding to a topic; means for generating a statistical distribution for each cluster from a subset of the words of the documents of the each cluster to yield a plurality of statistical distributions; and means for modeling each topic using the statistical distribution generated for the cluster corresponding to the each topic.
-
-
18. A method comprising:
-
accessing a document and a plurality of clusters stored in one or more tangible media, the document comprising a plurality of words, the clusters associated with a plurality of topics; performing the following for each word of a subset of the words to yield a plurality of statistical distributions; establishing a frequency of the each word in each cluster of the clusters to yield a plurality of frequencies; and generating a statistical distribution from the frequencies; and consolidating the statistical distributions to yield a consolidated statistical distribution, the consolidated statistical distribution indicating the frequencies of the subset of words in the topics.
-
-
22. One or more computer-readable tangible media encoding software operable when executed to:
-
access a document and a plurality of clusters stored in one or more tangible media, the document comprising a plurality of words, the clusters associated with a plurality of topics; perform the following for each word of a subset of the words to yield a plurality of statistical distributions; establishing a frequency of the each word in each cluster of the clusters to yield a plurality of frequencies; and generating a statistical distribution from the frequencies; and consolidate the statistical distributions to yield a consolidated statistical distribution, the consolidated statistical distribution indicating the frequencies of the subset of words in the topics. - View Dependent Claims (23, 24, 25)
-
-
26. A system comprising:
-
means for accessing a document and a plurality of clusters stored in one or more tangible media, the document comprising a plurality of words, the clusters associated with a plurality of topics; means for performing the following for each word of a subset of the words to yield a plurality of statistical distributions; establishing a frequency of the each word in each cluster of the clusters to yield a plurality of frequencies; and generating a statistical distribution from the frequencies; and means for consolidating the statistical distributions to yield a consolidated statistical distribution, the consolidated statistical distribution indicating the frequencies of the subset of words in the topics.
-
Specification