Method and system for latent dirichlet allocation computation using approximate counters
First Claim
1. A method for identifying sets of correlated words comprising:
- running an uncollapsed Gibbs sampler over a Dirichlet distribution of a plurality of words in a set of documents to produce sampler result data, further comprising;
representing one or more counts in the uncollapsed Gibbs sampler using one or more approximate counters, andusing one or more probabilistic techniques to increment the one or more approximate counters; and
determining, from the sampler result data, one or more sets of correlated words;
wherein the method is performed by one or more computing devices.
1 Assignment
0 Petitions
Accused Products
Abstract
Herein is described a data-parallel algorithm for topic modeling in which the memory requirements are streamlined for implementation on a highly-parallel architecture, such as a GPU. Specifically, approximate counters are used in a large mixture model or clustering algorithm (e.g., an uncollapsed Gibbs sampler) to decrease memory usage over what is required when conventional counters are used. The decreased memory usage of the approximate counters allows a highly-parallel architecture with limited memory to process more computations for the large mixture model more efficiently. Embodiments describe binary Morris approximate counters, general Morris approximate counters, and Csrös approximate counters in the context of an uncollapsed Gibbs sampler, and, more specifically, for a Greedy Gibbs sampler.
22 Citations
20 Claims
-
1. A method for identifying sets of correlated words comprising:
-
running an uncollapsed Gibbs sampler over a Dirichlet distribution of a plurality of words in a set of documents to produce sampler result data, further comprising; representing one or more counts in the uncollapsed Gibbs sampler using one or more approximate counters, and using one or more probabilistic techniques to increment the one or more approximate counters; and determining, from the sampler result data, one or more sets of correlated words; wherein the method is performed by one or more computing devices. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. One or more non-transitory computer-readable media storing one or more sequences of instructions for identifying sets of correlated words, wherein said one or more sequences of instructions, when executed by one or more processors, cause:
-
running an uncollapsed Gibbs sampler over a Dirichlet distribution of a plurality of words in a set of documents to produce sampler result data, further comprising; representing one or more counts in the uncollapsed Gibbs sampler using one or more approximate counters, and using one or more probabilistic techniques to increment the one or more approximate counters; and determining, from the sampler result data, one or more sets of correlated words. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
-
Specification