DATA-PARALLEL PARAMETER ESTIMATION OF THE LATENT DIRICHLET ALLOCATION MODEL BY GREEDY GIBBS SAMPLING

US 20160210718A1
Filed: 01/16/2015
Published: 07/21/2016
Est. Priority Date: 01/16/2015
Status: Active Grant

First Claim

Patent Images

1. A method for identifying sets of correlated words comprising:

receiving information for a set of documents;

wherein the set of documents comprises a plurality of words;

running a partially-collapsed Gibbs sampler over a Dirichlet distribution of the plurality of words in the set of documents to produce sampler result data, further comprising;

calculating a mean of the Dirichlet distribution;

determining, from the sampler result data, one or more sets of correlated words;

wherein the method is performed by one or more computing devices.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A novel data-parallel algorithm is presented for topic modeling on a highly-parallel hardware architectures. The algorithm is a Markov-Chain Monte Carlo algorithm used to estimate the parameters of the LDA topic model. This algorithm is based on a highly parallel partially-collapsed Gibbs sampler, but replaces a stochastic step that draws from a distribution with an optimization step that computes the mean of the distribution directly and deterministically. This algorithm is correct, it is statistically performant, and it is faster than state-of-the art algorithms because it can exploit the massive amounts of parallelism by processing the algorithm on a highly-parallel architecture, such as a GPU. Furthermore, the partially-collapsed Gibbs sampler converges about as fast as the collapsed Gibbs sampler and identifies solutions that are as good, or even better, as the collapsed Gibbs sampler.

14 Citations

View as Search Results

19 Claims

1. A method for identifying sets of correlated words comprising:
- receiving information for a set of documents;
  
  wherein the set of documents comprises a plurality of words;
  
  running a partially-collapsed Gibbs sampler over a Dirichlet distribution of the plurality of words in the set of documents to produce sampler result data, further comprising;
  
  calculating a mean of the Dirichlet distribution;
  
  determining, from the sampler result data, one or more sets of correlated words;
  
  wherein the method is performed by one or more computing devices.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein running the partially-collapsed Gibbs sampler over the Dirichlet distribution further comprises calculating the mean of the Dirichlet distribution as a substitute value for drawing a value from the Dirichlet distribution.
  - 3. The method of claim 2, further comprising performing one or more calculations for running the partially-collapsed Gibbs sampler over the Dirichlet distribution using single-precision arithmetic.
  - 4. The method of claim 1, wherein running the partially-collapsed Gibbs sampler over the Dirichlet distribution comprises computing a plurality of values, for the partially-collapsed Gibbs sampler, in parallel.
  - 5. The method of claim 4, wherein the plurality of values that are computed in a plurality of parallel Single Program Multiple Data (SPMD) units on a graphics processing unit (GPU).
  - 6. The method of claim 5, wherein, during a particular iteration of a plurality of iterations of the partially-collapsed Gibbs sampler, each SPMD unit of the plurality of SPMD units performs calculations for words in a particular document of the set of documents.
  - 7. The method of claim 1, wherein the partially-collapsed Gibbs sampler is collapsed with respect to a θ
    - variable.

8. One or more computer-readable media storing instructions which, when executed by one or more processors, cause performance of:
- receiving information for a set of documents;
  
  wherein the set of documents comprises a plurality of words;
  
  running a partially-collapsed Gibbs sampler over a Dirichlet distribution of the plurality of words in the set of documents to produce sampler result data, further comprising;
  
  calculating a mean of the Dirichlet distribution;
  
  determining, from the sampler result data, one or more sets of correlated words.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The one or more computer-readable media of claim 8, wherein running the partially-collapsed Gibbs sampler over the Dirichlet distribution further comprises calculating the mean of the Dirichlet distribution as a substitute value for drawing a value from the Dirichlet distribution.
  - 10. The one or more computer-readable media of claim 9, the instructions further comprising instructions for performing one or more calculations for running the partially-collapsed Gibbs sampler over the Dirichlet distribution using single-precision arithmetic.
  - 11. The one or more computer-readable media of claim 8, wherein running the partially-collapsed Gibbs sampler over the Dirichlet distribution comprises computing a plurality of values, for the partially-collapsed Gibbs sampler, in parallel.
  - 12. The one or more computer-readable media of claim 11, wherein the plurality of values are computed in parallel Single Program Multiple Data (SPMD) units on a graphics processing unit (GPU).
  - 13. The one or more computer-readable media of claim 12, wherein, during a particular iteration of a plurality of iterations of the partially-collapsed Gibbs sampler, each SPMD unit of the plurality of SPMD units performs calculations for words in a particular document of the set of documents.
  - 14. The one or more computer-readable media of claim 8, wherein the partially-collapsed Gibbs sampler is collapsed with respect to a θ
    - variable.

15. A computer system comprising:
- one or more processors; and
  
  one or more computer-readable media storing instructions which, when executed by the one or more processors, cause performance of;
  
  receiving information for a set of documents;
  
  wherein the set of documents comprises a plurality of words;
  
  running a partially-collapsed Gibbs sampler over a Dirichlet distribution of the plurality of words in the set of documents to produce sampler result data, further comprising;
  
  calculating a mean of the Dirichlet distribution;
  
  determining, from the sampler result data, one or more sets of correlated words.
- View Dependent Claims (16, 17, 18, 19)
- - 16. The computer system of claim 15, wherein running the partially-collapsed Gibbs sampler over the Dirichlet distribution further comprises calculating the mean of the Dirichlet distribution as a substitute value for drawing a value from the Dirichlet distribution.
  - 17. The computer system of claim 16, wherein the instructions further comprise instructions for performing one or more calculations for running the partially-collapsed Gibbs sampler over the Dirichlet distribution using single-precision arithmetic.
  - 18. The computer system of claim 15, wherein:
    - a particular processor of the one or more processors is a graphics processing unit (GPU) comprising a plurality of Single Program Multiple Data (SPMD) units;
      
      running the partially-collapsed Gibbs sampler over the Dirichlet distribution comprises computing a plurality of values, for the partially-collapsed Gibbs sampler, in parallel; and
      
      the plurality of values are computed on the plurality of SPMD units on the GPU.
  - 19. The computer system of claim 18, wherein, during a particular iteration of a plurality of iterations of the partially-collapsed Gibbs sampler, each SPMD unit of the plurality of SPMD units performs calculations for words in a particular document of the set of documents.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Oracle International Corporation (Oracle Corporation)
Original Assignee
Oracle International Corporation (Oracle Corporation)
Inventors
Tristan, Jean-Baptiste, Steele, Guy

Granted Patent

US 10,860,829 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/93   Document management systems

G06F 18/00   Pattern recognition

G06V 10/955   using specific electronic p...

DATA-PARALLEL PARAMETER ESTIMATION OF THE LATENT DIRICHLET ALLOCATION MODEL BY GREEDY GIBBS SAMPLING

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

14 Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

DATA-PARALLEL PARAMETER ESTIMATION OF THE LATENT DIRICHLET ALLOCATION MODEL BY GREEDY GIBBS SAMPLING

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

14 Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links