Learning topics by simulation of a stochastic cellular automaton

US 10,394,872 B2
Filed: 11/04/2015
Issued: 08/27/2019
Est. Priority Date: 05/29/2015
Status: Active Grant

First Claim

Patent Images

1. A method for identifying sets of correlated words comprising:

receiving information for a set of documents;

wherein the set of documents comprises a plurality of words;

wherein a particular document of the set of documents comprises a particular word of the plurality of words;

running an inference algorithm over a Dirichlet distribution of the plurality of words in the set of documents to produce sampler result data, further comprising;

retrieving a first counter value from a first data structure,based, at least in part, on the first counter value, assigning a particular topic, of a plurality of topics, to the particular word in the particular document to produce a topic assignment for the particular word,after assigning the particular topic to the particular word, updating a second counter value in a second data structure to produce an updated second counter value,wherein the updated second counter value reflects the topic assignment, andwherein the first data structure is stored and accessed independently from the second data structure; and

determining, from the sampler result data, one or more sets of correlated words;

wherein the method is performed by one or more computing devices.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Herein is described an unsupervised learning method to discover topics and reduce the dimensionality of documents by designing and simulating a stochastic cellular automaton. A key formula that appears in many inference methods for LDA is used as the local update rule of the cellular automaton. Approximate counters may be used to represent counter values being tracked by the inference algorithms. Also, sparsity may be used to reduce the amount of computation needed for sampling a topic for particular words in the corpus being analyzed.

19 Citations

View as Search Results

14 Claims

1. A method for identifying sets of correlated words comprising:
- receiving information for a set of documents;
  
  wherein the set of documents comprises a plurality of words;
  
  wherein a particular document of the set of documents comprises a particular word of the plurality of words;
  
  running an inference algorithm over a Dirichlet distribution of the plurality of words in the set of documents to produce sampler result data, further comprising;
  
  retrieving a first counter value from a first data structure,based, at least in part, on the first counter value, assigning a particular topic, of a plurality of topics, to the particular word in the particular document to produce a topic assignment for the particular word,after assigning the particular topic to the particular word, updating a second counter value in a second data structure to produce an updated second counter value,wherein the updated second counter value reflects the topic assignment, andwherein the first data structure is stored and accessed independently from the second data structure; and
  
  determining, from the sampler result data, one or more sets of correlated words;
  
  wherein the method is performed by one or more computing devices.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein the inference algorithm is one of:
    - Mean-for-Mode Gibbs sampling, and collapsed variational Bayesian inference.
  - 3. The method of claim 1, further comprising:
    - performing one or more of a particular group of steps in parallel with one or more of the steps of retrieving, assigning, and updating;
      
      wherein the particular group of steps comprises;
      
      retrieving a third counter value from the first data structure;
      
      based, at least in part, on the third counter value, assigning a second topic, of the plurality of topics, to a second word, of the plurality of words, to produce a topic assignment for the second word; and
      
      after assigning the second topic to the second word, updating a fourth counter value, in the second data structure, to reflect the topic assignment for the second word;
      
      wherein the fourth counter value reflects the topic assignment for the second word.
  - 4. The method of claim 1, wherein one or more of said first counter value and said second counter value are represented as approximate counter values.
  - 5. The method of claim 1, further comprising:
    - prior to performing an iteration of the inference algorithm that includes said steps of retrieving, assigning, and updating;
      
      calculating values for one or more tables based on counts in the first data structure;
      
      wherein assigning the particular topic to the particular word comprises performing Walker'"'"'s “
      
      alias method”
      
      based on the one or more tables.
  - 6. The method of claim 1, wherein assigning the particular topic to the particular word is performed without respect to one or more of parameters theta and phi.
  - 7. The method of claim 1, wherein assigning the particular topic to the particular word is performed without respect to stored values indicating previous assignments of words to topics.

8. One or more non-transitory computer-readable media storing instructions, which, when executed by one or more processors, cause:
- receiving information for a set of documents;
  
  wherein the set of documents comprises a plurality of words;
  
  wherein a particular document of the set of documents comprises a particular word of the plurality of words;
  
  running an inference algorithm over a Dirichlet distribution of the plurality of words in the set of documents to produce sampler result data, further comprising;
  
  retrieving a first counter value from a first data structure,based, at least in part, on the first counter value, assigning a particular topic, of a plurality of topics, to the particular word in the particular document to produce a topic assignment for the particular word,after assigning the particular topic to the particular word, updating a second counter value in a second data structure to produce an updated second counter value,wherein the updated second counter value reflects the topic assignment, andwherein the first data structure is stored and accessed independently from the second data structure; and
  
  determining, from the sampler result data, one or more sets of correlated words.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The one or more non-transitory computer-readable media of claim 8, wherein the inference algorithm is one of:
    - Mean-for-Mode Gibbs sampling, and collapsed variational Bayesian inference.
  - 10. The one or more non-transitory computer-readable media of claim 8, wherein the instructions further comprise instructions, which, when executed by one or more processors, cause:
    - performing one or more of a particular group of steps in parallel with one or more of the steps of retrieving, assigning, and updating;
      
      wherein the particular group of steps comprises;
      
      retrieving a third counter value from the first data structure;
      
      based, at least in part, on the third counter value, assigning a second topic, of the plurality of topics, to a second word, of the plurality of words, to produce a topic assignment for the second word; and
      
      after assigning the second topic to the second word, updating a fourth counter value, in the second data structure, to reflect the topic assignment for the second word;
      
      wherein the fourth counter value reflects the topic assignment for the second word.
  - 11. The one or more non-transitory computer-readable media of claim 8, wherein one or more of said first counter value and said second counter value are represented as approximate counter values.
  - 12. The one or more non-transitory computer-readable media of claim 8, wherein the instructions further comprise instructions, which, when executed by one or more processors, cause:
    - prior to performing an iteration of the inference algorithm that includes said steps of retrieving, assigning, and updating;
      
      calculating values for one or more tables based on counts in the first data structure;
      
      wherein assigning the particular topic to the particular word comprises performing Walker'"'"'s “
      
      alias method”
      
      based on the one or more tables.
  - 13. The one or more non-transitory computer-readable media of claim 8, wherein assigning the particular topic to the particular word is performed without respect to one or more of parameters theta and phi.
  - 14. The one or more non-transitory computer-readable media of claim 8, wherein assigning the particular topic to the particular word is performed without respect to stored values indicating previous assignments of words to topics.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Oracle International Corporation (Oracle Corporation)
Original Assignee
Oracle International Corporation (Oracle Corporation)
Inventors
Tristan, Jean-Baptiste, Green, Stephen J., Steele, Jr., Guy L., Zaheer, Manzil
Primary Examiner(s)
Reyes, Mariela
Assistant Examiner(s)
Harmon, Courtney

Application Number

US14/932,825
Publication Number

US 20160350411A1
Time in Patent Office

1,392 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/3344   using natural language anal...

G06F 16/353   into predefined classes

G06F 16/951   Indexing; Web crawling tech...

Learning topics by simulation of a stochastic cellular automaton

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

19 Citations

14 Claims

Specification

Use Cases

Quick Links

Others

Learning topics by simulation of a stochastic cellular automaton

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

19 Citations

14 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others