Learning topics by simulation of a stochastic cellular automaton
First Claim
Patent Images
1. A method for identifying sets of correlated words comprising:
- receiving information for a set of documents;
wherein the set of documents comprises a plurality of words;
wherein a particular document of the set of documents comprises a particular word of the plurality of words;
running an inference algorithm over a Dirichlet distribution of the plurality of words in the set of documents to produce sampler result data, further comprising;
retrieving a first counter value from a first data structure,based, at least in part, on the first counter value, assigning a particular topic, of a plurality of topics, to the particular word in the particular document to produce a topic assignment for the particular word,after assigning the particular topic to the particular word, updating a second counter value in a second data structure to produce an updated second counter value,wherein the updated second counter value reflects the topic assignment, andwherein the first data structure is stored and accessed independently from the second data structure; and
determining, from the sampler result data, one or more sets of correlated words;
wherein the method is performed by one or more computing devices.
1 Assignment
0 Petitions
Accused Products
Abstract
Herein is described an unsupervised learning method to discover topics and reduce the dimensionality of documents by designing and simulating a stochastic cellular automaton. A key formula that appears in many inference methods for LDA is used as the local update rule of the cellular automaton. Approximate counters may be used to represent counter values being tracked by the inference algorithms. Also, sparsity may be used to reduce the amount of computation needed for sampling a topic for particular words in the corpus being analyzed.
19 Citations
14 Claims
-
1. A method for identifying sets of correlated words comprising:
-
receiving information for a set of documents; wherein the set of documents comprises a plurality of words; wherein a particular document of the set of documents comprises a particular word of the plurality of words; running an inference algorithm over a Dirichlet distribution of the plurality of words in the set of documents to produce sampler result data, further comprising; retrieving a first counter value from a first data structure, based, at least in part, on the first counter value, assigning a particular topic, of a plurality of topics, to the particular word in the particular document to produce a topic assignment for the particular word, after assigning the particular topic to the particular word, updating a second counter value in a second data structure to produce an updated second counter value, wherein the updated second counter value reflects the topic assignment, and wherein the first data structure is stored and accessed independently from the second data structure; and determining, from the sampler result data, one or more sets of correlated words; wherein the method is performed by one or more computing devices. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. One or more non-transitory computer-readable media storing instructions, which, when executed by one or more processors, cause:
-
receiving information for a set of documents; wherein the set of documents comprises a plurality of words; wherein a particular document of the set of documents comprises a particular word of the plurality of words; running an inference algorithm over a Dirichlet distribution of the plurality of words in the set of documents to produce sampler result data, further comprising; retrieving a first counter value from a first data structure, based, at least in part, on the first counter value, assigning a particular topic, of a plurality of topics, to the particular word in the particular document to produce a topic assignment for the particular word, after assigning the particular topic to the particular word, updating a second counter value in a second data structure to produce an updated second counter value, wherein the updated second counter value reflects the topic assignment, and wherein the first data structure is stored and accessed independently from the second data structure; and determining, from the sampler result data, one or more sets of correlated words. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
Specification