Collapsed gibbs sampler for sparse topic models and discrete matrix factorization

US 8,510,257 B2
Filed: 10/19/2010
Issued: 08/13/2013
Est. Priority Date: 10/19/2010
Status: Active Grant

First Claim

Patent Images

1. A non-transitory storage medium storing instructions executable by a processor to perform a method comprising:

generating feature representations comprising distributions over a set of features corresponding to objects of a training corpus of objects; and

inferring a topic model defining a set of topics by performing latent Dirichlet allocation (LDA) with an Indian Buffet Process (IBP) compound Dirichlet prior probability distribution, the inferring being performed using a collapsed Gibbs sampling algorithm by iteratively sampling (1) topic allocation variables of the LDA and (2) binary activation variables of the IBP compound Dirichlet prior probability distribution;

wherein the inferring performed using a collapsed Gibbs sampling algorithm does not iteratively sample any parameters other than topic allocation variables of the LDA and binary activation variables of the IBP compound Dirichlet prior probability distribution.

View all claims

6 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In an inference system for organizing a corpus of objects, feature representations are generated comprising distributions over a set of features corresponding to the objects. A topic model defining a set of topics is inferred by performing latent Dirichlet allocation (LDA) with an Indian Buffet Process (IBP) compound Dirichlet prior probability distribution. The inference is performed using a collapsed Gibbs sampling algorithm by iteratively sampling (1) topic allocation variables of the LDA and (2) binary activation variables of the IBP compound Dirichlet prior. In some embodiments the inference is configured such that each inferred topic model is a clean topic model with topics defined as distributions over sub-sets of the set of features selected by the prior. In some embodiments the inference is configured such that the inferred topic model associates a focused sub-set of the set of topics to each object of the training corpus.

12 Citations

View as Search Results

21 Claims

1. A non-transitory storage medium storing instructions executable by a processor to perform a method comprising:
- generating feature representations comprising distributions over a set of features corresponding to objects of a training corpus of objects; and
  
  inferring a topic model defining a set of topics by performing latent Dirichlet allocation (LDA) with an Indian Buffet Process (IBP) compound Dirichlet prior probability distribution, the inferring being performed using a collapsed Gibbs sampling algorithm by iteratively sampling (1) topic allocation variables of the LDA and (2) binary activation variables of the IBP compound Dirichlet prior probability distribution;
  
  wherein the inferring performed using a collapsed Gibbs sampling algorithm does not iteratively sample any parameters other than topic allocation variables of the LDA and binary activation variables of the IBP compound Dirichlet prior probability distribution.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The non-transitory storage medium as set forth in claim 1, wherein the objects comprise documents including text, the feature representations comprise bag-of-words representations, and the set of features comprises a set of vocabulary words.
  - 3. The non-transitory storage medium as set forth in claim 2, wherein the performing LDA comprises:
    - performing LDA with an IBP compound Dirichlet prior probability distribution configured such that the inferred topic model associates a focused sub-set of the set of topics to each document of the training corpus.
  - 4. The non-transitory storage medium as set forth in claim 3, wherein the method further comprises:
    - selecting a document of the training corpus of documents; and
      
      identifying at least one other document of the training corpus of documents that is similar to the selected document based on the focused sub-sets of the set of topics associated to the documents of the training corpus by the inferred topic model.
  - 5. The non-transitory storage medium as set forth in claim 4, wherein the method further comprises:
    - generating a human-viewable representation of the identified at least one other document.
  - 6. The non-transitory storage medium as set forth in claim 2, wherein the performing LDA comprises:
    - performing LDA with an IBP compound Dirichlet prior probability distribution configured such that the inferred topic model is a clean topic model in which each topic is defined as a distribution over a sub-set of the set of vocabulary words selected by the IBP compound Dirichlet prior probability distribution.
  - 7. The non-transitory storage medium as set forth in claim 6, wherein the method further comprises:
    - generating an input bag-of-words representation comprising a distribution over the set of vocabulary words for an input document which is not part of the training corpus; and
      
      classifying the input document respective to the topics of the set of topics using the clean topic model.
  - 8. The non-transitory storage medium as set forth in claim 1, wherein the performing LDA comprises:
    - performing LDA with an IBP compound Dirichlet prior probability distribution configured such that the inferred topic model associates a focused sub-set of the set of topics to each object of the training corpus.
  - 9. The non-transitory storage medium as set forth in claim 8, wherein the method further comprises:
    - selecting an object of the training corpus of objects; and
      
      identifying at least one other object of the training corpus of object that is similar to the selected object based on the focused sub-sets of the set of topics associated to the objects of the training corpus by the inferred topic model.
  - 10. The non-transitory storage medium as set forth in claim 1, wherein the performing LDA comprises:
    - performing LDA with an IBP compound Dirichlet prior probability distribution configured such that the inferred topic model is a clean topic model in which each topic is defined as a distribution over a sub-set of the set of features selected by the IBP compound Dirichlet prior probability distribution.
  - 11. The non-transitory storage medium as set forth in claim 10, wherein the method further comprises:
    - generating an input feature representation comprising a distribution over the set of features for an input object which is not part of the training corpus; and
      
      classifying the input object respective to the topics of the set of topics using the clean topic model.
  - 12. The non-transitory storage medium as set forth in claim 1, wherein the performing LDA comprises:
    - performing LDA with an IBP compound Dirichlet prior probability distribution configured such that (i) the inferred topic model associates a focused sub-set of the set of topics to each object of the training corpus and (ii) the inferred topic model is a clean topic model defining each topic as a distribution over a sub-set of the set of features selected by the IBP compound Dirichlet prior probability distribution.

13. A method comprising:
- generating feature representations comprising distributions over a set of features corresponding to objects d=1, . . . , D of a training corpus of D objects; and
  
  inferring a generative topic model defining a set of topics k=1, . . . , K by performing a latent generative topic model allocation using a collapsed Gibbs sampling algorithm with an Indian Buffet Process (IBP) compound prior probability distribution having binary activation variables θ
  
  ε
  
  ^K×
  
  Dand object-specific topic proportions θ
  
  _d| θ
  
  _d˜
  
  Dirichlet(α
  
  θ
  
  _d) with weights α
  
  that are the same for all the objects d=1, . . . , D,wherein the inferring includes iterative sampling of (1) topic allocation variables of the generative topic model allocation and (2) the binary activation variables θ
  
  of the IBP compound prior probability distribution; and
  
  wherein the generating and inferring are performed by a digital processor.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
- - 14. The method as set forth in claim 13, wherein the objects d=1, . . . , D comprise documents including text, the feature representations comprise bag-of-words representations, and the set of features comprises a set of vocabulary words.
  - 15. The method as set forth in claim 13, wherein the inferring comprises:
    - inferring the generative topic model by performing the latent generative topic model allocation using a collapsed Gibbs sampling algorithm with an IBP prior probability distribution configured such that the inferred generative topic model associates a focused sub-set of the set of topics to each object of the training corpus.
  - 16. The method as set forth in claim 15, further comprising:
    - selecting an object of the training corpus of objects; and
      
      identifying at least one other object of the training corpus of object that is similar to the selected object based on the focused sub-sets of the set of topics associated to the objects of the training corpus by the inferred topic model.
  - 17. The method as set forth in claim 13, wherein the inferring comprises:
    - inferring the generative topic model by performing the latent generative topic model allocation using a collapsed Gibbs sampling algorithm with an IBP prior probability distribution configured such that the inferred topic model is a clean topic model in which each topic is represented by a distribution over a sub-set of the set of features selected by the IBP prior probability distribution.
  - 18. The method as set forth in claim 17, further comprising:
    - generating an input feature representation comprising a distribution over the set of features for an input object which is not part of the training corpus; and
      
      classifying the input object respective to the topics of the set of topics using the clean topic model.
  - 19. The method as set forth in claim 13, wherein the latent generative topic model allocation comprises a latent Dirichlet allocation (LDA), the IBP compound prior probability distribution comprises an IBP compound Dirichlet prior probability distribution, and the inferring includes iterative sampling of (1) topic allocation variables of the LDA and (2) the binary activation variables θ
    - of the IBP compound Dirichlet prior probability distribution.
  - 20. The method as set forth in claim 13, wherein the latent generative topic model allocation is selected from the group consisting of (i) a latent Dirichlet allocation (LDA) and (ii) a probabilistic latent semantic analysis (PLSA) allocation.

21. An apparatus comprising:
- a digital processor configured to perform a method including;
  
  generating feature representations comprising distributions over a set of features corresponding to documents d=1, . . . , D of a training corpus of D documents; and
  
  inferring a generative latent Dirichlet allocation (LDA) or probabilistic latent semantic analysis (PLSA) topic model defining a set of topics k=1, . . . , K by performing a latent generative topic model allocation using a collapsed Gibbs sampling algorithm with an Indian Buffet Process (IBP) compound prior probability distribution having binary activation variables θ
  
  ε
  
  ^K×
  
  Dand document-specific topic proportions θ
  
  _d| θ
  
  _d˜
  
  Dirichlet(α
  
  θ
  
  _d) with weights α
  
  that are the same for all the documents d=1, . . . , D;
  
  wherein the inferring includes iterative sampling of (1) topic allocation variables of the generative topic model allocation and (2) the binary activation variables θ
  
  of the IBP compound prior probability distribution.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Xerox Corporation (Xerox Holdings Corp.)
Original Assignee
Xerox Corporation (Xerox Holdings Corp.)
Inventors
Archambeau, Cedric P. C. J. G., Bouchard, Guillaume M.
Primary Examiner(s)
Gaffin, Jeffrey A
Assistant Examiner(s)
Smith, Paulinho E

Application Number

US12/907,219
Publication Number

US 20120095952A1
Time in Patent Office

1,029 Days
Field of Search

None
US Class Current

706/52
CPC Class Codes

G06F 16/3331   Query processing

G06F 16/355   Class or cluster creation o...

G06F 18/2136   based on sparsity criteria,...

G06F 18/28   Determining representative ...

G06N 7/01   Probabilistic graphical mod...

Collapsed gibbs sampler for sparse topic models and discrete matrix factorization

First Claim

6 Assignments

0 Petitions

Accused Products

Abstract

12 Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Collapsed gibbs sampler for sparse topic models and discrete matrix factorization

First Claim

6 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

12 Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links