COLLAPSED GIBBS SAMPLER FOR SPARSE TOPIC MODELS AND DISCRETE MATRIX FACTORIZATION

US 20120095952A1
Filed: 10/19/2010
Published: 04/19/2012
Est. Priority Date: 10/19/2010
Status: Active Grant

First Claim

Patent Images

1. A storage medium storing instructions executable by a processor to perform a method comprising:

generating feature representations comprising distributions over a set of features corresponding to objects of a training corpus of objects; and

inferring a topic model defining a set of topics by performing latent Dirichlet allocation (LDA) with an Indian Buffet Process (IBP) compound Dirichlet prior probability distribution, the inferring being performed using a collapsed Gibbs sampling algorithm by iteratively sampling (1) topic allocation variables of the LDA and (2) binary activation variables of the IBP compound Dirichlet prior probability distribution.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In an inference system for organizing a corpus of objects, feature representations are generated comprising distributions over a set of features corresponding to the objects. A topic model defining a set of topics is inferred by performing latent Dirichlet allocation (LDA) with an Indian Buffet Process (IBP) compound Dirichlet prior probability distribution. The inference is performed using a collapsed Gibbs sampling algorithm by iteratively sampling (1) topic allocation variables of the LDA and (2) binary activation variables of the IBP compound Dirichlet prior In some embodiments the inference is configured such that each inferred topic model is a clean topic model with topics defined as distributions over sub-sets of the set of features selected by the prior. In some embodiments the inference is configured such that the inferred topic model associates a focused sub-set of the set of topics to each object of the training corpus.

Citations

22 Claims

1. A storage medium storing instructions executable by a processor to perform a method comprising:
- generating feature representations comprising distributions over a set of features corresponding to objects of a training corpus of objects; and
  
  inferring a topic model defining a set of topics by performing latent Dirichlet allocation (LDA) with an Indian Buffet Process (IBP) compound Dirichlet prior probability distribution, the inferring being performed using a collapsed Gibbs sampling algorithm by iteratively sampling (1) topic allocation variables of the LDA and (2) binary activation variables of the IBP compound Dirichlet prior probability distribution.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The storage medium as set forth in claim 1, wherein the objects comprise documents including text, the feature representations comprise bag-of-words representations, and the set of features comprises a set of vocabulary words.
  - 3. The storage medium as set forth in claim 2, wherein the performing LDA comprises:
    - performing LDA with an IBP compound Dirichlet prior probability distribution configured such that the inferred topic model associates a focused sub-set of the set of topics to each document of the training corpus.
  - 4. The storage medium as set forth in claim 3, wherein the method further comprises:
    - selecting a document of the training corpus of documents; and
      
      identifying at least one other document of the training corpus of documents that is similar to the selected document based on the focused sub-sets of the set of topics associated to the documents of the training corpus by the inferred topic model.
  - 5. The storage medium as set forth in claim 4, wherein the method further comprises:
    - generating a human-viewable representation of the identified at least one other document.
  - 6. The storage medium as set forth in claim 2, wherein the performing LDA comprises:
    - performing LDA with an IBP compound Dirichlet prior probability distribution configured such that the inferred topic model is a clean topic model in which each topic is defined as a distribution over a sub-set of the set of vocabulary words selected by the IBP compound Dirichlet prior probability distribution.
  - 7. The storage medium as set forth in claim 6, wherein the method further comprises:
    - generating an input bag-of-words representation comprising a distribution over the set of vocabulary words for an input document which is not part of the training corpus; and
      
      classifying the input document respective to the topics of the set of topics using the clean topic model.
  - 8. The storage medium as set forth in claim 1, wherein the performing LDA comprises:
    - performing LDA with an IBP compound Dirichlet prior probability distribution configured such that the inferred topic model associates a focused sub-set of the set of topics to each object of the training corpus.
  - 9. The storage medium as set forth in claim 8, wherein the method further comprises:
    - selecting an object of the training corpus of objects; and
      
      identifying at least one other object of the training corpus of object that is similar to the selected object based on the focused sub-sets of the set of topics associated to the objects of the training corpus by the inferred topic model.
  - 10. The storage medium as set forth in claim 1, wherein the performing LDA comprises:
    - performing LDA with an IBP compound Dirichlet prior probability distribution configured such that the inferred topic model is a clean topic model in which each topic is defined as a distribution over a sub-set of the set of features selected by the IBP compound Dirichlet prior probability distribution.
  - 11. The storage medium as set forth in claim 10, wherein the method further comprises:
    - generating an input feature representation comprising a distribution over the set of features for an input object which is not part of the training corpus; and
      
      classifying the input object respective to the topics of the set of topics using the clean topic model.
  - 12. The storage medium as set forth in claim 1, wherein the performing LDA comprises:
    - performing LDA with an IBP compound Dirichlet prior probability distribution configured such that (i) the inferred topic model associates a focused sub-set of the set of topics to each object of the training corpus and (ii) the inferred topic model is a clean topic model defining each topic as a distribution over a sub-set of the set of features selected by the IBP compound Dirichlet prior probability distribution.
  - 13. The storage medium as set forth in claim 1, wherein:
    - the inferring performed using a collapsed Gibbs sampling algorithm does not iteratively sample any parameters other than topic allocation variables of the LDA and binary activation variables of the IBP compound Dirichlet prior probability distribution.

14. A method comprising:
- generating feature representations comprising distributions over a set of features corresponding to objects of a training corpus of objects; and
  
  inferring a generative topic model defining a set of topics by performing a latent generative topic model allocation using a collapsed Gibbs sampling algorithm with an Indian Buffet Process (IBP) compound prior probability distribution;
  
  wherein the inferring includes iterative sampling of (1) topic allocation variables of the generative topic model allocation and (2) binary activation variables of the IBP compound prior probability distribution; and
  
  wherein the generating and inferring are performed by a digital processor.
- View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22)
- - 15. The method as set forth in claim 14, wherein the objects comprise documents including text, the feature representations comprise bag-of-words representations, and the set of features comprises a set of vocabulary words.
  - 16. The method as set forth in claim 14, wherein the inferring comprises:
    - inferring the generative topic model by performing the latent generative topic model allocation using a collapsed Gibbs sampling algorithm with an IBP prior probability distribution configured such that the inferred generative topic model associates a focused sub-set of the set of topics to each object of the training corpus.
  - 17. The method as set forth in claim 16, further comprising:
    - selecting an object of the training corpus of objects; and
      
      identifying at least one other object of the training corpus of object that is similar to the selected object based on the focused sub-sets of the set of topics associated to the objects of the training corpus by the inferred topic model.
  - 18. The method as set forth in claim 14, wherein the inferring comprises:
    - inferring the generative topic model by performing the latent generative topic model allocation using a collapsed Gibbs sampling algorithm with an IBP prior probability distribution configured such that the inferred topic model is a clean topic model in which each topic is represented by a distribution over a sub-set of the set of features selected by the IBP prior probability distribution.
  - 19. The method as set forth in claim 18, further comprising:
    - generating an input feature representation comprising a distribution over the set of features for an input object which is not part of the training corpus; and
      
      classifying the input object respective to the topics of the set of topics using the clean topic model.
  - 20. The method as set forth in claim 14, wherein the latent generative topic model allocation comprises a latent Dirichlet allocation (LDA), the IBP compound prior probability distribution comprises an IBP compound Dirichlet prior probability distribution, and the inferring includes iterative sampling of (1) topic allocation variables of the LDA and (2) binary activation variables of the IBP compound Dirichlet prior probability distribution.
  - 21. The method as set forth in claim 14, wherein the latent generative topic model allocation is selected from the group consisting of (i) a latent Dirichlet allocation (LDA) and (ii) a probabilistic latent semantic analysis (PLSA) allocation.
  - 22. An apparatus comprising:
    - a processor configured to perform a method as set forth in claim 14.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Xerox Corporation (Xerox Holdings Corp.)
Original Assignee
Xerox Corporation (Xerox Holdings Corp.)
Inventors
Bouchard, Guillaume M., Archambeau, Cedric P.C.J.G.

Granted Patent

US 8,510,257 B2
Time in Patent Office

Days
Field of Search
US Class Current

706/52
CPC Class Codes

G06F 16/3331   Query processing

G06F 16/355   Class or cluster creation o...

G06F 18/2136   based on sparsity criteria,...

G06F 18/28   Determining representative ...

G06N 7/01   Probabilistic graphical mod...

COLLAPSED GIBBS SAMPLER FOR SPARSE TOPIC MODELS AND DISCRETE MATRIX FACTORIZATION

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

COLLAPSED GIBBS SAMPLER FOR SPARSE TOPIC MODELS AND DISCRETE MATRIX FACTORIZATION

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links