Regularized latent semantic indexing for topic modeling
First Claim
1. A topic modeling system, comprising:
- at least one calculating unit; and
at least one computer readable medium in communication with the at least one calculating unit and having instructions and a first equation stored therein, the first equation having terms including a term-document matrix D, a term-topic matrix U, a topic-document matrix V, a regularization of vectors of the term-topic matrix U and a regularization of vectors of the topic-document matrix V, the term-document matrix D having N columns, N>
1, each column of the term-document matrix D representing a respective document and having M (M>
1) members in which each member represents a respective term of the respective document, the term-topic matrix U and the topic-document matrix V are related such that the term-document matrix D is approximated by a matrix multiplication of the term-topic matrix U and the topic-document matrix V, when executed by the at least one calculating unit, cause the at least one calculating unit to perform acts comprising;
for a number of iterations,minimizing the first equation while holding the topic-document matrix V fixed;
updating the term-topic matrix U based at least on values of the topic-document matrix V calculated in a most recent minimization of the first equation;
minimizing the first equation while holding the term-topic matrix U fixed; and
updating the topic-document matrix V based at least on values of the term-topic matrix U calculated in a most recent minimization of the first equation.
2 Assignments
0 Petitions
Accused Products
Abstract
Electronic documents are retrieved from a database and/or from a network of servers. The documents are topic modeled in accordance with a Regularized Latent Semantic Indexing approach. The Regularized Latent Semantic Indexing approach may allow an equation involving an approximation of a term-document matrix to be solved in parallel by multiple calculating units. The equation may include terms that are regularized via either l1 norm and/or via l2 norm. The Regularized Latent Semantic Indexing approach may be applied to a set, or a fixed number, of documents such that the set of documents is topic modeled. Alternatively, the Regularized Latent Semantic Indexing approach may be applied to a variable number of documents such that, over time, the variable of number of documents is topic modeled.
-
Citations
20 Claims
-
1. A topic modeling system, comprising:
-
at least one calculating unit; and at least one computer readable medium in communication with the at least one calculating unit and having instructions and a first equation stored therein, the first equation having terms including a term-document matrix D, a term-topic matrix U, a topic-document matrix V, a regularization of vectors of the term-topic matrix U and a regularization of vectors of the topic-document matrix V, the term-document matrix D having N columns, N>
1, each column of the term-document matrix D representing a respective document and having M (M>
1) members in which each member represents a respective term of the respective document, the term-topic matrix U and the topic-document matrix V are related such that the term-document matrix D is approximated by a matrix multiplication of the term-topic matrix U and the topic-document matrix V, when executed by the at least one calculating unit, cause the at least one calculating unit to perform acts comprising;for a number of iterations, minimizing the first equation while holding the topic-document matrix V fixed; updating the term-topic matrix U based at least on values of the topic-document matrix V calculated in a most recent minimization of the first equation; minimizing the first equation while holding the term-topic matrix U fixed; and updating the topic-document matrix V based at least on values of the term-topic matrix U calculated in a most recent minimization of the first equation. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A computer-implemented method for topic modeling, comprising:
-
defining a first equation having terms including a term-document matrix D, a term-topic matrix U, a topic-document matrix V, a regularization of vectors of the term-topic matrix U and a regularization of vectors of the topic-document matrix V, the term-document matrix D having N columns, N>
1, each column of the term-document matrix D representing a respective document and having M (M>
1) members in which each member represents a respective term of the respective document, the term-topic matrix U and the topic-document matrix V are related such that a matrix multiplication of the term-topic matrix U and the topic-document matrix V is approximated as the term-document matrix D;retrieving a number (N) of electronic documents; representing each retrieved document as a respective vector of the term-document matrix D; and for a number of iterations, minimizing the first equation while holding the topic-document matrix V fixed, updating the term-topic matrix U based at least on values of the topic-document matrix V calculated in a most recent minimization of the first equation, minimizing the first equation while holding the term-topic matrix U fixed, and updating the topic-document matrix V based at least on values of the term-topic matrix U calculated in a most recent minimization of the first equation; and storing, at a computer readable storage medium, at least one of the most recently updated term-topic matrix U and the topic-document matrix V. - View Dependent Claims (13, 14, 15, 16, 17)
-
-
18. One or more computer-readable storage media storing computer-executable instructions that, when executed on one or more processors, causes the one or more processors to perform acts comprising:
-
retrieving a number (N) of electronic documents; representing each retrieved document as a respective vector of a term-document matrix D; defining a first equation having terms including the term-document matrix D, a term-topic matrix U, a topic-document matrix V, the term-topic matrix U and the topic-document matrix V are related such that a matrix multiplication of the term-topic matrix U and the topic-document matrix V is approximated as the term-document matrix D; independently solving in parallel for vectors of the topic-document matrix V and the term-topic matrix U; and updating the topic-document matrix V and the term-topic matrix U based at least in part of the solved vectors of the document matrix V and the term-topic matrix U; and storing at least one of the document matrix V and the term-topic matrix U. - View Dependent Claims (19, 20)
-
Specification