COLLAPSED GIBBS SAMPLER FOR SPARSE TOPIC MODELS AND DISCRETE MATRIX FACTORIZATION

0Associated
Cases 
0Associated
Defendants 
0Accused
Products 
49Forward
Citations 
0
Petitions 
1
Assignment
First Claim
1. A storage medium storing instructions executable by a processor to perform a method comprising:
 generating feature representations comprising distributions over a set of features corresponding to objects of a training corpus of objects; and
inferring a topic model defining a set of topics by performing latent Dirichlet allocation (LDA) with an Indian Buffet Process (IBP) compound Dirichlet prior probability distribution, the inferring being performed using a collapsed Gibbs sampling algorithm by iteratively sampling (1) topic allocation variables of the LDA and (2) binary activation variables of the IBP compound Dirichlet prior probability distribution.
1 Assignment
0 Petitions
Accused Products
Abstract
In an inference system for organizing a corpus of objects, feature representations are generated comprising distributions over a set of features corresponding to the objects. A topic model defining a set of topics is inferred by performing latent Dirichlet allocation (LDA) with an Indian Buffet Process (IBP) compound Dirichlet prior probability distribution. The inference is performed using a collapsed Gibbs sampling algorithm by iteratively sampling (1) topic allocation variables of the LDA and (2) binary activation variables of the IBP compound Dirichlet prior In some embodiments the inference is configured such that each inferred topic model is a clean topic model with topics defined as distributions over subsets of the set of features selected by the prior. In some embodiments the inference is configured such that the inferred topic model associates a focused subset of the set of topics to each object of the training corpus.
56 Citations
View as Search Results
Parallel Processing Of Data Sets  
Patent #
US 20120117008A1
Filed 11/09/2010

Current Assignee
Microsoft Technology Licensing LLC

Sponsoring Entity
Microsoft Technology Licensing LLC

Automated Contextual Information Retrieval Based on MultiTiered User Modeling and Dynamic Retrieval Strategy  
Patent #
US 20120209871A1
Filed 02/10/2011

Current Assignee
International Business Machines Corporation

Sponsoring Entity
International Business Machines Corporation

MULTITIERED APPROACH TO EMAIL PRIORITIZATION  
Patent #
US 20130212047A1
Filed 06/15/2012

Current Assignee
International Business Machines Corporation

Sponsoring Entity
International Business Machines Corporation

APPARATUS FOR CLUSTERING A PLURALITY OF DOCUMENTS  
Patent #
US 20130212106A1
Filed 02/14/2013

Current Assignee
International Business Machines Corporation

Sponsoring Entity
International Business Machines Corporation

System, method and apparatus for increasing speed of hierarchial latent dirichlet allocation model  
Patent #
US 8,527,448 B2
Filed 12/20/2012

Current Assignee
Huawei Technologies Co. Ltd.

Sponsoring Entity
Huawei Technologies Co. Ltd.

MULTITIERED APPROACH TO EMAIL PRIORITIZATION  
Patent #
US 20130339276A1
Filed 06/20/2012

Current Assignee
International Business Machines Corporation

Sponsoring Entity
International Business Machines Corporation

Knowledge discovery from citation networks  
Patent #
US 8,630,975 B1
Filed 12/02/2011

Current Assignee
The Research Foundation for The State University of New York

Sponsoring Entity
The Research Foundation for The State University of New York

KNOWLEDGE DISCOVERY FROM CITATION NETWORKS  
Patent #
US 20140188780A1
Filed 01/14/2014

Current Assignee
The Research Foundation for The State University of New York

Sponsoring Entity
The Research Foundation for The State University of New York

SYSTEM AND METHOD FOR COMPUTERIZED SEMANTIC PROCESSING OF ELECTRONIC DOCUMENTS INCLUDING THEMES  
Patent #
US 20140207782A1
Filed 01/22/2014

Current Assignee
Microsoft Israel Research And Development 2002 Ltd

Sponsoring Entity
Equivio Ltd.

MAILBOX SEARCH ENGINE USING QUERY MULTIMODAL EXPANSION AND COMMUNITYBASED SMOOTHING  
Patent #
US 20140280207A1
Filed 03/15/2013

Current Assignee
Xerox Corporation

Sponsoring Entity
Xerox Corporation

Parallel processing of data sets  
Patent #
US 8,868,470 B2
Filed 11/09/2010

Current Assignee
Microsoft Technology Licensing LLC

Sponsoring Entity
Microsoft Technology Licensing LLC

Knowledge discovery from citation networks  
Patent #
US 8,930,304 B2
Filed 01/14/2014

Current Assignee
The Research Foundation for The State University of New York

Sponsoring Entity
The Research Foundation for The State University of New York

SYSTEM AND METHOD FOR IDENTIFYING AND VISUALISING TOPICS AND THEMES IN COLLECTIONS OF DOCUMENTS  
Patent #
US 20150046151A1
Filed 03/22/2013

Current Assignee
BAE Systems Australia Limited

Sponsoring Entity
BAE Systems Australia Limited

METHOD OF AUTOMATED DISCOVERY OF TOPICS RELATEDNESS  
Patent #
US 20150154305A1
Filed 12/02/2014

Current Assignee
Qbase LLC

Sponsoring Entity
Qbase LLC

KNOWLEDGE DISCOVERY FROM CITATION NETWORKS  
Patent #
US 20150186789A1
Filed 12/29/2014

Current Assignee
The Research Foundation for The State University of New York

Sponsoring Entity
The Research Foundation for The State University of New York

Multitiered approach to Email prioritization  
Patent #
US 9,152,953 B2
Filed 06/15/2012

Current Assignee
International Business Machines Corporation

Sponsoring Entity
International Business Machines Corporation

DYNAMIC CREATION OF DOMAIN SPECIFIC CORPORA  
Patent #
US 20150347467A1
Filed 05/27/2014

Current Assignee
International Business Machines Corporation

Sponsoring Entity
International Business Machines Corporation

Multitiered approach to Email prioritization  
Patent #
US 9,256,862 B2
Filed 06/20/2012

Current Assignee
International Business Machines Corporation

Sponsoring Entity
International Business Machines Corporation

Topic Model For Comments Analysis And Use Thereof  
Patent #
US 20160048768A1
Filed 08/15/2014

Current Assignee
HERE Global B.V.

Sponsoring Entity
HERE Global B.V.

Knowledge discovery from citation networks  
Patent #
US 9,269,051 B2
Filed 12/29/2014

Current Assignee
The Research Foundation for The State University of New York

Sponsoring Entity
The Research Foundation for The State University of New York

Mailbox search engine using query multimodal expansion and communitybased smoothing  
Patent #
US 9,280,587 B2
Filed 03/15/2013

Current Assignee
Xerox Corporation

Sponsoring Entity
Xerox Corporation

Labeling product identifiers and navigating products  
Patent #
US 9,323,838 B2
Filed 09/03/2013

Current Assignee
Alibaba Group Services Limited

Sponsoring Entity
Alibaba Group Services Limited

Apparatus for clustering a plurality of documents  
Patent #
US 9,342,591 B2
Filed 02/14/2013

Current Assignee
International Business Machines Corporation

Sponsoring Entity
International Business Machines Corporation

JOINT APPROACH TO FEATURE AND DOCUMENT LABELING  
Patent #
US 20160203209A1
Filed 01/12/2015

Current Assignee
Xerox Corporation

Sponsoring Entity
Xerox Corporation

DATAPARALLEL PARAMETER ESTIMATION OF THE LATENT DIRICHLET ALLOCATION MODEL BY GREEDY GIBBS SAMPLING  
Patent #
US 20160210718A1
Filed 01/16/2015

Current Assignee
Oracle International Corporation

Sponsoring Entity
Oracle International Corporation

INFORMATION PROCESSING APPARATUS, INFORMATION PROCESSING METHOD, AND NONTRANSITORY COMPUTER READABLE MEDIUM  
Patent #
US 20160259774A1
Filed 08/19/2015

Current Assignee
Fuji Xerox Company Limited

Sponsoring Entity
Fuji Xerox Company Limited

Pluggable storage system for distributed file systems  
Patent #
US 9,454,548 B1
Filed 03/15/2013

Current Assignee
EMC Corporation

Sponsoring Entity
EMC Corporation

Access and presentation of files based on semantic proximity to current interests  
Patent #
US 9,460,191 B1
Filed 04/14/2016

Current Assignee
International Business Machines Corporation

Sponsoring Entity
International Business Machines Corporation

Method of automated discovery of topics relatedness  
Patent #
US 9,542,477 B2
Filed 12/02/2014

Current Assignee
Qbase LLC

Sponsoring Entity
Qbase LLC

ACCESS AND PRESENTATION OF FILES BASED ON SEMANTIC PROXIMITY TO CURRENT INTERESTS  
Patent #
US 20170026489A1
Filed 07/22/2015

Current Assignee
International Business Machines Corporation

Sponsoring Entity
International Business Machines Corporation

Access and presentation of files based on semantic proximity to current interests  
Patent #
US 9,576,043 B2
Filed 09/01/2016

Current Assignee
International Business Machines Corporation

Sponsoring Entity
International Business Machines Corporation

Automated contextual information retrieval based on multitiered user modeling and dynamic retrieval strategy  
Patent #
US 9,633,140 B2
Filed 02/10/2011

Current Assignee
International Business Machines Corporation

Sponsoring Entity
International Business Machines Corporation

Pluggable storage system for parallel query engines  
Patent #
US 9,805,053 B1
Filed 03/15/2013

Current Assignee
Emc IP Holding Company LLC

Sponsoring Entity
Emc IP Holding Company LLC

Access and presentation of files based on semantic proximity to current interests  
Patent #
US 9,824,097 B2
Filed 12/05/2016

Current Assignee
International Business Machines Corporation

Sponsoring Entity
International Business Machines Corporation

Knowledge discovery from citation networks  
Patent #
US 9,892,367 B2
Filed 02/22/2016

Current Assignee
The Research Foundation for The State University of New York

Sponsoring Entity
The Research Foundation for The State University of New York

Tiering with pluggable storage system for parallel query engines  
Patent #
US 9,898,475 B1
Filed 03/15/2013

Current Assignee
Emc IP Holding Company LLC

Sponsoring Entity
Emc IP Holding Company LLC

Information analysis system, information analysis method, and information analysis program  
Patent #
US 9,940,319 B2
Filed 05/25/2015

Current Assignee
Nippon Telegraph and Telephone Corporation

Sponsoring Entity
Nippon Telegraph and Telephone Corporation

Pluggable storage system for parallel query engines across nonnative file systems  
Patent #
US 9,984,083 B1
Filed 03/29/2013

Current Assignee
Emc IP Holding Company LLC

Sponsoring Entity
Emc IP Holding Company LLC

System and method for computerized identification and effective presentation of semantic themes occurring in a set of electronic documents  
Patent #
US 10,002,182 B2
Filed 01/22/2014

Current Assignee
Microsoft Israel Research And Development 2002 Ltd

Sponsoring Entity
Microsoft Israel Research And Development 2002 Ltd

Access and presentation of files based on semantic proximity to current interests  
Patent #
US 10,025,799 B2
Filed 07/22/2015

Current Assignee
International Business Machines Corporation

Sponsoring Entity
International Business Machines Corporation

Joint approach to feature and document labeling  
Patent #
US 10,055,479 B2
Filed 01/12/2015

Current Assignee
Xerox Corporation

Sponsoring Entity
Xerox Corporation

Method and system for distributed latent dirichlet allocation computation using addition of approximate counters  
Patent #
US 10,140,281 B2
Filed 08/07/2015

Current Assignee
Oracle International Corporation

Sponsoring Entity
Oracle International Corporation

Method and system for latent dirichlet allocation computation using approximate counters  
Patent #
US 10,147,044 B2
Filed 08/06/2015

Current Assignee
Oracle International Corporation

Sponsoring Entity
Oracle International Corporation

Parallel Gibbs sampler using butterflypatterned partial sums  
Patent #
US 10,157,346 B2
Filed 05/15/2015

Current Assignee
Oracle International Corporation

Sponsoring Entity
Oracle International Corporation

KNOWLEDGE DISCOVERY FROM CITATION NETWORKS  
Patent #
US 20160171391A1
Filed 02/22/2016

Current Assignee
The Research Foundation for The State University of New York

Sponsoring Entity
The Research Foundation for The State University of New York

Learning topics by simulation of a stochastic cellular automaton  
Patent #
US 10,394,872 B2
Filed 11/04/2015

Current Assignee
Oracle International Corporation

Sponsoring Entity
Oracle International Corporation

Apparatus and method for extracting semantic topic  
Patent #
US 10,423,723 B2
Filed 06/03/2015

Current Assignee
Korea University Reserch and Business Foundation

Sponsoring Entity
Korea University Reserch and Business Foundation

Pluggable storage system for distributed file systems  
Patent #
US 10,459,917 B2
Filed 05/09/2016

Current Assignee
Emc IP Holding Company LLC

Sponsoring Entity
Emc IP Holding Company LLC

Dataparallel probabilistic inference  
Patent #
US 10,496,929 B2
Filed 06/26/2014

Current Assignee
Oracle International Corporation

Sponsoring Entity
Oracle International Corporation

MATERIAL RECOGNITION FROM AN IMAGE  
Patent #
US 20110243450A1
Filed 04/01/2010

Current Assignee
Microsoft Technology Licensing LLC

Sponsoring Entity
Microsoft Corporation

Mining Multilingual Topics  
Patent #
US 20110258229A1
Filed 04/15/2010

Current Assignee
Microsoft Technology Licensing LLC

Sponsoring Entity
Microsoft Technology Licensing LLC

METHOD AND SYSTEM TO PREDICT THE LIKELIHOOD OF TOPICS  
Patent #
US 20100280985A1
Filed 01/13/2009

Current Assignee
United States Navy

Sponsoring Entity
United States Navy

CLUSTERING USING NONNEGATIVE MATRIX FACTORIZATION ON SPARSE GRAPHS  
Patent #
US 20090271433A1
Filed 04/25/2008

Current Assignee
Xerox Corporation

Sponsoring Entity
Xerox Corporation

Interactive cleaning for automatic document clustering and categorization  
Patent #
US 20080249999A1
Filed 04/06/2007

Current Assignee
Xerox Corporation

Sponsoring Entity
Xerox Corporation

METHODS AND SYSTEMS FOR UTILIZING CONTENT, DYNAMIC PATTERNS, AND/OR RELATIONAL INFORMATION FOR DATA ANALYSIS  
Patent #
US 20070118498A1
Filed 11/22/2006

Current Assignee
NEC Corporation

Sponsoring Entity
NEC Corporation

Apparatus and method for retrieving and grouping images representing text files based on the relevance of key words extracted from a selected file to the text files  
Patent #
US 5,598,557 A
Filed 09/22/1992

Current Assignee
Nuance Communications Inc.

Sponsoring Entity
Caere Corp. Mexico

22 Claims
 1. A storage medium storing instructions executable by a processor to perform a method comprising:
generating feature representations comprising distributions over a set of features corresponding to objects of a training corpus of objects; and inferring a topic model defining a set of topics by performing latent Dirichlet allocation (LDA) with an Indian Buffet Process (IBP) compound Dirichlet prior probability distribution, the inferring being performed using a collapsed Gibbs sampling algorithm by iteratively sampling (1) topic allocation variables of the LDA and (2) binary activation variables of the IBP compound Dirichlet prior probability distribution.  View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
 14. A method comprising:
generating feature representations comprising distributions over a set of features corresponding to objects of a training corpus of objects; and inferring a generative topic model defining a set of topics by performing a latent generative topic model allocation using a collapsed Gibbs sampling algorithm with an Indian Buffet Process (IBP) compound prior probability distribution; wherein the inferring includes iterative sampling of (1) topic allocation variables of the generative topic model allocation and (2) binary activation variables of the IBP compound prior probability distribution; and wherein the generating and inferring are performed by a digital processor.  View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22)
1 Specification
The following relates to the document or object processing arts, clustering arts, classification arts, retrieval arts, and so forth.
In document (or more generally, object, a general term intended to encompass text documents, images, audio/video content, or so forth) processing, it is useful to generate a statistical topic model defining a set of topics. For text documents represented using “bagofwords” representations, a topic of the topic model is suitably represented as a statistical distribution over words that are typical of the topic. A new document can be associated with topics with varying association strengths based on similarities between the topics and the distribution of words in the document. As another application, given an input document selected from a corpus of documents already modeled using the topic models, similar documents can be rapidly identified by comparison of the topic association strengths.
The word distributions of a text document can be considered features of the text document, and the topics of the topic model are statistical distributions of these features that are typical of the topics. For other types of objects, features of the objects are derived and topics of the topic model are generated as statistical distributions of features that are typical of the topic. As an example, an image can be characterized by visual features extracted from spatial regions, or “patches”, of the image.
Various approaches can be employed for generating the topic model. Nonnegative matrix factorization techniques such as Latent Dirichlet Allocation (LDA) or probabilistic latent semantic analysis (PLSA) are known approaches, and have been used in applications such as text clustering, dimensionality reduction of large sparse arrays, or so forth. Underlying the LDA and PLSA models is the observation that a large matrix containing positive values can be approximated by a sum of rankone positive matrices. Compared to more classical matrix factorization techniques such as Singular Value Decomposition that are rotationally invariant, the lowrank matrices obtained by nonnegative decompositions are often nearly sparse, i.e. they contain few large positive values and many small values close to zero. The large values correspond in general to clusters of rows and columns of the original matrices, and are identifiable with topics. These topic models can be formalized as generative models for large sparse positive matrices, e.g. large sets of documents. Topic models are typically used to organize these documents according to themes (that is, topics) in an unsupervised way, that is, without reliance upon document topic annotations or other a priori information about the topics. In this framework, topics are defined as discrete distributions over vocabulary words (for text documents; more generally, distributions over features of objects) and topics are associated to each document according to a relative weighting (proportions).
The following sets forth improved methods and apparatuses.
In some illustrative embodiments disclosed as illustrative examples herein, a storage medium is disclosed storing instructions executable by a processor to perform a method comprising: generating feature representations comprising distributions over a set of features corresponding to objects of a training corpus of objects; and inferring a topic model defining a set of topics by performing latent Dirichlet allocation (LDA) with an Indian Buffet Process (IBP) compound Dirichlet prior probability distribution (that is, LIDA), the inferring being performed using a collapsed Gibbs sampling algorithm by iteratively sampling (1) topic allocation variables of the LDA and (2) binary activation variables of the IBP compound Dirichlet prior probability distribution.
In some embodiments as set forth in the immediately preceding paragraph, the LIDA is performed with an IBP compound Dirichlet prior configured such that the inferred topic model is a clean topic model in which each topic is defined as a distribution over a subset of the set of features selected by the IBP compound Dirichlet prior probability distribution. In some embodiments as set forth in the immediately preceding paragraph, the LDA is performed with an IBP compound Dirichlet prior configured such that the inferred topic model associates a focused subset of the set of topics to each object of the training corpus.
In some illustrative embodiments disclosed as illustrative examples herein, a method comprises: generating feature representations comprising distributions over a set of features corresponding to objects of a training corpus of objects; and inferring a generative topic model defining a set of topics by performing a latent generative topic model allocation using a collapsed Gibbs sampling algorithm with an Indian Buffet Process (IBP) compound prior probability distribution; wherein the inferring includes iterative sampling of (1) topic allocation variables of the generative topic model allocation and (2) binary activation variables of the IBP compound prior probability distribution; and wherein the generating and inferring are performed by a digital processor.
In some illustrative embodiments disclosed as illustrative examples herein, a processor is disclosed, which is configured to perform a method as set forth in the immediately preceding paragraph.
It is recognized herein that the degree of sparsity of the decomposition is a parameter of interest in nonnegative matrix factorization applications. For example, in the context of document clustering, clusters correspond to themes or topics, which in turn are implicitly defined by discrete distribution over a set of vocabulary words used in bagofwords document representations (or more generally, over a set of features used in feature representations of objects). Hence, topics can be interpreted by considering the nonzero entries, that is the set of words to which positive weight is associated.
However, it is recognized herein that topics obtained by latent generative topic model allocation techniques such as PLSA or LDA are, in many practical applications, not sufficiently sparse. Stated another way, the topics are not “clean” because they contain low (but still finite) values for infrequent words (or, more generally, features). The overall collection of topics is also sometimes not focused, meaning that all topics are assigned positive weight in each document and infrequently occurring topics contribute less to the organization of documents based on the set of topics—yet, the occurrence of an infrequently occurring topic for a specific document may be highly informative for that specific document.
Disclosed herein are approaches that overcome these difficulties by a latent generative topic model allocation that is tunable to yield clean topics comprising a distribution over a subset of the set of vocabulary words (or, more generally, over a subset of the set of features). The latent generative topic model allocation is also independently (and optionally concurrently) tunable such that the inferred topic model produces document models that associate a focused subset of the set of topics to each document (or, more generally, object) of a training corpus. The approaches employ latent generative topic model allocation using a collapsed Gibbs sampling algorithm with an Indian Buffet Process (IBP) compound prior probability distribution. The prior has a tunable parameter for tuning the “cleanness” of the clean topics by suitably adjusting the sparsity of the prior, and hence the sparsity of the generated clean topics. The prior also has an independent tunable parameter for tuning “how focused” is the focused subset of the set of topics (i.e., the document model) associated to each document (or object), again by suitably adjusting the sparsity of the prior and hence the sparsity of the focused topic allocations.
The foregoing is disclosed herein with reference to an illustrative embodiment in which the objects undergoing processing comprise documents including text, the feature representations comprise bagofwords representations, and the set of features comprises a set of vocabulary words. More generally, however, the objects may be any type of object (e.g., text documents, images, multimodal documents, audiovideo content units, or so forth) in which each object is characterized by values for a set of features (i.e., a feature representation).
In the illustrative embodiment latent Dirichlet allocation (LDA) is used as the latent generative topic model, and the LDA is implemented using a collapsed (also known as RaoBlackwellised) Gibbs sampling algorithm with a Indian Buffet Process (IBP) compound Dirichlet prior probability distribution. More generally, however, other latent generative topic model allocation approaches, such as probabilistic latent semantic analysis (PLSA) allocation, can be employed in conjunction with a suitable IBP compound prior probability distribution. Implementing the latent generative topic model allocation using a Gibbs sampling algorithm with a compound prior probability distribution is convenient; however, other implementations of the latent generative topic model allocation are also contemplated. The latent generative topic model allocation infers a generative topic model defining a set of topics. Each generative topic model comprises a distribution over the set of vocabulary words (or, more generally, the set of features used in the feature representations of the objects). In embodiments in which the cleanness tuning parameter is used, the topics of the inferred topic model are clean topics each comprising a distribution over a subset of the set of vocabulary words. In embodiments in which the focussedness tuning parameter is used, the inferred topic model is focused such that documents are represented by subsets of the set of topics.
With reference to
A modeling module 20 processes the training corpus of BOW representations 16 to generate a topic model. The illustrative modeling module 20 employs latent IBP compound Dirichlet allocation (LIDA) as disclosed herein. Unlike conventional LDA which is performed with a conventional Dirichlet prior probability distribution, the LDA of the modeling module 20 is performed with Indian Buffet Process (IBP) compound Dirichlet prior probability distribution. (Note that the term “prior probability distribution” is sometimes shortened to “prior” herein, in accordance with conventional notation used in the art). The technique disclosed herein of performing LDA with an IBP compound Dirichlet prior is sometimes referred to herein as latent IBP Dirichlet allocation (LIDA). In the illustrative example of
The modeling module 20 outputs a topic model 22 (which is optionally clean with each topic comprising a distribution over a subset of the set of vocabulary words 14). The vocabulary word subsets of the clean topics of the clean topic model are selected by the IBP compound Dirichlet prior probability distribution, with the degree of “cleanness” or sparsity of the clean topics being controlled by a tuning parameter of the LIDA. The topic model 22 defines a set of topics, in that each topic corresponds to a semantic representation by the distribution over the set of vocabulary words 14 (or over the subset of vocabulary words in the case of clean topics).
The modeling module 20 also outputs document models 24 for the documents of the training corpus 10. The document models 24 represent the strength of association of the document with each topic defined by the topic model 22. In some embodiments the document models 24 are focused document models in which the inferred topic model 22 associates a focused subset of the set of topics to each document of the training corpus 10. In other words, in the case of a focused document model for a document there are some topics of the set of topics having zero association with the document. As with the cleanness of the topics of the clean topic model, the focusing of the focused document models 24 is selected by the IBP compound Dirichlet prior probability distribution, with the degree of focus or sparsity of the document models being controlled by a tuning parameter of the LIDA.
It is to be understood that the tuning parameter for tuning the cleanness or sparsity of the clean topics is different from and independent of the tuning parameter for tuning the focus or sparsity of the focused document models. The tuning parameter for tuning the cleanness or sparsity of the clean topics is denoted herein as y (that is, the lowercase Greek letter “gamma”). The tuning parameter for tuning the focus or sparsity of the focused document models is denoted herein as (that is, the lowercase Greek letter “eta”).
With continuing reference to
It will also be appreciated that the disclosed inference methods for generating the topic model 22 for organizing a training corpus 10 of documents can be embodied as a storage medium storing instructions executable by the computer C or another digital processor or digital processing device to perform the disclosed inference methods. The storage medium may, by way of illustrative example, comprise a hard disk or other magnetic storage medium, an optical disk or other optical storage medium, random access memory (RAM), readonly memory (ROM), flash memory, or another electronic storage medium, various combinations thereof, or so forth.
In the illustrative example of
In the following, an illustrative embodiment of the modeling module 20 is described, which operates on objects comprising documents including text, and employs LDA implemented using a collapsed Gibbs sampler with an Indian Buffet Process (IBP) compound Dirichlet prior probability distribution.
Consider a corpus of D documents. LDA models these documents as a mixture of K discrete distributions, called topics, over vocabulary words, called tokens. The model employs a bagofwords as the feature representation.
Formally, let w_{id }∈, {1, . . . , V} denote the token corresponding to the i^{th }word observed in document d and let z_{id }∈{1, . . . K} be the topic associated to this word. The generative model of LDA is defined as follows:
z_{id}
w_{id}z_{id}, {φ_{k}}˜Discrete(φ_{z}_{id}),
where d={1, . . . , D}, k={1, . . . , K} and i={1, . . . , N_{d}}. It is assumed for this illustrative example that the number of words in document d, denoted N_{d}, is generated from a Poisson distribution.
With brief reference to
The generative process for LDA using a conventional Dirichlet prior can be described as follows. For each document, draw a set of topic proportions according to a (symmetric) Dirichlet distribution. For each word in the document: select a topic by drawing it from a discrete distribution with parameter equal to the set of document specific topic proportions; and select a word from the vocabulary by drawing from the discrete distribution defining the selected topic. The parameters of this discrete distribution is also drawn from a (symmetric) Dirichlet distribution. The following will be noted for LDA using a conventional Dirichlet: (i) topics are defined by distributions over all vocabulary words; and (ii) nonzero proportions are associated to all topics in each document. Condition (i) corresponds to the topics not being sparse, while condition (ii) corresponds to the document models not being sparse.
The LDA with an Indian Buffet Process (IBP) compound Dirichlet prior probability distribution is next described. This approach can be used to alleviate the topic nonsparsity issue (i), the document model nonsparsity issue (ii), or both nonsparsity issues at the same time. Toward this end, disclosed herein is a family of sparse topic models able to learn (i) clean topics, that is, topics defined by discrete distributions over a subset of vocabulary words, and (ii) associate a focused set of topics to each document, that is, only a subset of topics have nonzero proportions for a given document. Both features can be combined. All parameters are suitably estimated from the training corpus 10.
Prior to describing the collapsed Gibbs sampler algorithm, the IBP compound Dirichlet distribution and the degenerate Dirichlet distribution are introduced. Next, the latent IBP compound Dirichlet allocation (LIDA) model is described. This model is a hierarchical extension of LDA. All parameters are suitably estimated from the data.
The illustrative IBP compound Dirichlet prior is described as follows. Let
where the Dirichlet distribution is degenerate; it is defined over the simplex of dimension Σ_{k}
where Γ(·) is the gamma function. By convention it is assumed that θ_{kd}=0 if it does not belong to the support (that is, if
It is assumed that α is shared by all rows; its value is made relatively small such that the individual Dirichlet priors will be peaked. The IBP compound Dirichlet prior probability distribution is given by:
From Equation (5) it is seen that the IBP compound Dirichlet prior is a mixture of degenerate Dirichlet distributions over simplices of different dimensions. Used as a prior, the compound distribution can be viewed as an instantiation of the spike and slab, which from a Bayesian perspective is the ideal sparsityinducing prior. For η→∞ the nondegenerate Dirichlet distribution is recovered.
The latent IBP compound Dirichlet allocation (LIDA) model is next described. Implementing LIDA entails factorization of V≈ΦΘ, where Φ ∈ R^{V×K }and Θ ∈ ^{K×D }with K<<D. A (truncated) IBP compound Dirichlet prior probability distribution is imposed on both matrices, yielding:
The remainder of the generative model is suitably defined as:
z_{id}θ_{d}˜Discrete(θ_{d}), w_{id}z_{id}, {φ_{k}}˜Discrete(φ_{z}_{id}) (8).
To see that a sparse probabilistic matrix is obtained, the topic indicator variables {z_{id}} are integrated out, leading to:
If we assume the columns of V, denoted by {v_{d}}, are iid, it follows that v_{d}θ_{d}, Φ˜Multinomial(Φθ_{d}, N_{d}), where N_{d }is the number of words in document d. From this expression it is evident that the model corresponds to a sparse nonnegative matrix factorization as many of the entries of {θ_{d}} and {φ_{k}} are zero and the remaining entries are normalized to form discrete probability distributions.
The intuition behind the foregoing generative model is appealing. For every document a pool of topics is selected from the topics appearing in other documents and a small number (e.g., couple) of new topics. When generating words, first a topic is selected among the selected pool and then a word is drawn from the set of vocabulary words previously associated to this topic or from a couple of new candidate words to be associated to this topic.
Integrating out π=(π_{1}, . . . , π_{K}), κ=(κ_{1}, . . . , κ_{V}), {θ_{d}}, and {φ_{k}} leads to the following marginals:
where n_{νkd }is the number of times ν was assigned to topic k in document d. The notation • means that the sum is taken over the corresponding index. Note that n_{•kd}=0 and n_{νk•}=0 when respectively
With reference to
where ∝ indicates “proportional to”, B(•,•) is the beta function and ∥{•} is the indicator function. The variables
The processing block 30 is tuned respective to the focus or sparsity of the focused document models using a tuning parameter η (that is, the lowercase Greek letter “eta”) 32. The processing block 30 is tuned respective to the cleanness or sparsity of the topics using a tuning parameter γ (that is, the lowercase Greek letter “gamma”) 34.
When eta 32 approaches infinity (i.e., η→∞), sampling of {
On the other hand, when gamma 34 approaches infinity (i.e., γ→∞), sampling of {
If both eta 32 and gamma 34 are finite, then Equation (11) applies and the resulting modeling implements both the clean topics and focused document model aspects. Finally, for completeness, if both γ→∞ and η→∞ then the collapsed Gibbs sampler for standard LDA is recovered.
The processing block 30 performs the collapsed Gibbs sampling of Equations (9)(11), or of Equations (9), (10), and (12) in the case of a clean topicsonly model, or of Equations (9), (10), and (13) in the case of the focused topicsonly model. Additionally, the processing block 30 updates the parameters. Type 2 maximum likelihood (ML) estimation of the parameters α and β can be computed by imposing Gamma priors on α and β and using Minka'"'"'s fixt point iteration (see T. Minka, “Estimating a Dirichlet Distribution”, Technical Report, Microsoft Research, Cambridge, 2000). This leads to:
Similarly, Gamma priors may be imposed on the parameters eta 32 and gamma 34 in the case of the truncated IBP, leading to the following updates:
With continuing reference to
The resulting model can be variously utilized.
With reference to
With reference to
Again, the applications of
The disclosed inference system of
Perplexity is a standard measure of quality, and is closely related to the heldout likelihood. The lower the perplexity, the better. The posterior expectations are approximated as follows:
During the test,
With reference to
Topic models are dimensionality reduction techniques having a clustering effect. In general, the sparsityenforcing priors disclosed herein improve the quality of the clustering by reducing the amount of noisy topics. Disclosed herein are probabilistic topic modeling embodiments using IBP compound Dirichlet priors and having a closed form formulation of the degenerate Dirichlet distribution. Collapsed Gibbs sampling algorithms based on these reformulations are also disclosed.
Embodiments of the generative model for sparse topic modeling disclosed herein can be described as follows. For each document, choose a subset of active topics. This is achieved by drawing the set of topics from a truncated IBP. Given this set of topics, draw the topic proportions for that document according to a degenerate Dirichlet distribution, that is, a Dirichlet distribution defined on the simplex of dimension K_{d}−1, where K_{d }is the number of active topics for document d. Each word in the document is then drawn from one of the selected topics, which are defined as discrete distributions over a subset of words previously associated to this topic. Again, the active words are drawn from a truncated IBP and the word probabilities from a degenerate Dirichlet distribution, that is, a Dirichlet distribution defined on the simplex of dimension V_{d}−1, where V_{d }is the number of active words for the selected topic.
Equations (9)(11) herein set forth a collapsed Gibbs sampling algorithm for learning sparse topic models assuming a generative process. Equations (12) and (13) set forth the special cases of a clean topic model and a focused topic model, respectively. As per Equation (11), it is also possible to combine these sparsity aspects. The parameters α and β and eta 32 and gamma 34 can be estimated from the data using Equations (14)(17). The latter two parameters 32, 34 enable tuning or turning off the topic sparsity and document model sparsity, respectively.
It will be appreciated that various of the abovedisclosed and other features and functions, or alternatives thereof, may be desirably combined into many other different systems or applications. Also that various presently unforeseen or unanticipated alternatives, modifications, variations or improvements therein may be subsequently made by those skilled in the art which are also intended to be encompassed by the following claims.