DATA-PARALLEL PARAMETER ESTIMATION OF THE LATENT DIRICHLET ALLOCATION MODEL BY GREEDY GIBBS SAMPLING
First Claim
1. A method for identifying sets of correlated words comprising:
- receiving information for a set of documents;
wherein the set of documents comprises a plurality of words;
running a partially-collapsed Gibbs sampler over a Dirichlet distribution of the plurality of words in the set of documents to produce sampler result data, further comprising;
calculating a mean of the Dirichlet distribution;
determining, from the sampler result data, one or more sets of correlated words;
wherein the method is performed by one or more computing devices.
1 Assignment
0 Petitions
Accused Products
Abstract
A novel data-parallel algorithm is presented for topic modeling on a highly-parallel hardware architectures. The algorithm is a Markov-Chain Monte Carlo algorithm used to estimate the parameters of the LDA topic model. This algorithm is based on a highly parallel partially-collapsed Gibbs sampler, but replaces a stochastic step that draws from a distribution with an optimization step that computes the mean of the distribution directly and deterministically. This algorithm is correct, it is statistically performant, and it is faster than state-of-the art algorithms because it can exploit the massive amounts of parallelism by processing the algorithm on a highly-parallel architecture, such as a GPU. Furthermore, the partially-collapsed Gibbs sampler converges about as fast as the collapsed Gibbs sampler and identifies solutions that are as good, or even better, as the collapsed Gibbs sampler.
14 Citations
19 Claims
-
1. A method for identifying sets of correlated words comprising:
-
receiving information for a set of documents; wherein the set of documents comprises a plurality of words; running a partially-collapsed Gibbs sampler over a Dirichlet distribution of the plurality of words in the set of documents to produce sampler result data, further comprising; calculating a mean of the Dirichlet distribution; determining, from the sampler result data, one or more sets of correlated words; wherein the method is performed by one or more computing devices. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. One or more computer-readable media storing instructions which, when executed by one or more processors, cause performance of:
-
receiving information for a set of documents; wherein the set of documents comprises a plurality of words; running a partially-collapsed Gibbs sampler over a Dirichlet distribution of the plurality of words in the set of documents to produce sampler result data, further comprising; calculating a mean of the Dirichlet distribution; determining, from the sampler result data, one or more sets of correlated words. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A computer system comprising:
-
one or more processors; and one or more computer-readable media storing instructions which, when executed by the one or more processors, cause performance of; receiving information for a set of documents; wherein the set of documents comprises a plurality of words; running a partially-collapsed Gibbs sampler over a Dirichlet distribution of the plurality of words in the set of documents to produce sampler result data, further comprising; calculating a mean of the Dirichlet distribution; determining, from the sampler result data, one or more sets of correlated words. - View Dependent Claims (16, 17, 18, 19)
-
Specification