Modeling topics using statistical distributions
First Claim
Patent Images
1. A computer-implemented method comprising:
- accessing a corpus stored in one or more tangible media, the corpus comprising a plurality of documents, a document comprising a plurality of words;
selecting one or more words of each document as one or more keywords of the each document;
clustering the documents according to the keywords to yield a plurality of clusters, each cluster corresponding to a different topic;
generating a statistical distribution for each cluster from a subset of the words of the documents of the each cluster to yield a plurality of statistical distributions, wherein generating the statistical distribution for the each cluster comprises;
determining a co-occurrence value indicating a co-occurrence of the topic of the each cluster with the topics of the other clusters in the plurality of documents; and
generating a co-occurrence distribution from the co-occurrence values;
modeling each topic using the statistical distribution generated for the cluster corresponding to the each topic;
organizing the clusters according to the statistical distributions; and
assigning the topics of the organized clusters to the documents in the organized clusters.
1 Assignment
0 Petitions
Accused Products
Abstract
In one embodiment, modeling topics includes accessing a corpus comprising documents that include words. Words of a document are selected as keywords of the document. The documents are clustered according to the keywords to yield clusters, where each cluster corresponds to a topic. A statistical distribution is generated for a cluster from words of the documents of the cluster. A topic is modeled using the statistical distribution generated for the cluster corresponding to the topic.
-
Citations
12 Claims
-
1. A computer-implemented method comprising:
-
accessing a corpus stored in one or more tangible media, the corpus comprising a plurality of documents, a document comprising a plurality of words; selecting one or more words of each document as one or more keywords of the each document; clustering the documents according to the keywords to yield a plurality of clusters, each cluster corresponding to a different topic; generating a statistical distribution for each cluster from a subset of the words of the documents of the each cluster to yield a plurality of statistical distributions, wherein generating the statistical distribution for the each cluster comprises; determining a co-occurrence value indicating a co-occurrence of the topic of the each cluster with the topics of the other clusters in the plurality of documents; and generating a co-occurrence distribution from the co-occurrence values; modeling each topic using the statistical distribution generated for the cluster corresponding to the each topic; organizing the clusters according to the statistical distributions; and assigning the topics of the organized clusters to the documents in the organized clusters. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. One or more non-transitory computer-readable tangible media encoding software operable when executed to:
-
access a corpus stored in one or more tangible media, the corpus comprising a plurality of documents, a document comprising a plurality of words; select one or more words of each document as one or more keywords of the each document; cluster the documents according to the keywords to yield a plurality of clusters, each cluster corresponding to a different topic; generate a statistical distribution for each cluster from a subset of the words of the documents of the each cluster to yield a plurality of statistical distributions, wherein generating the statistical distribution for the each cluster comprises; determining a co-occurrence value indicating a co-occurrence of the topic of the each cluster with the topics of the other clusters in the plurality of documents; and generating a co-occurrence distribution from the co-occurrence values; and model each topic using the statistical distribution generated for the cluster corresponding to the each topic; organize the clusters according to the statistical distributions; and assign the topics of the organized clusters to the documents in the organized clusters. - View Dependent Claims (8, 9, 10, 11, 12)
-
Specification