Scalable probabilistic latent semantic analysis
First Claim
1. A computer-readable medium storing computer-executable instructions for performing operations comprising:
- clustering a set of data objects into a plurality of groups;
performing a first pass of performing probabilistic latent semantic analysis on the groups;
identifying a plurality of latent classes of the set of data objects;
calculating a first conditional probability of a data object of the set of data objects given a latent class of the plurality of latent classes;
estimating a ranking of each latent class;
eliminating low probability links between the set of data objects and the latent classes based on the rankings, the low probability links being determined based on a predetermined probability threshold;
determining remaining links between the set of data objects and the latent classes;
performing a second pass of probabilistic latent semantic analysis on a result of the first pass based on the remaining links between the set of data objects and the latent classes; and
calculating a second conditional probability of a data object of the set of data objects given the remaining links between the set of data objects and the latent classes.
2 Assignments
0 Petitions
Accused Products
Abstract
A scalable two-pass scalable probabilistic latent semantic analysis (PLSA) methodology is disclosed that may perform more efficiently, and in some cases more accurately, than traditional PLSA, especially where large and/or sparse data sets are provided for analysis. The improved methodology can greatly reduce the storage and/or computational costs of training a PLSA model. In the first pass of the two-pass methodology, objects are clustered into groups, and PLSA is performed on the groups instead of the original individual objects. In the second pass, the conditional probability of a latent class, given an object, is obtained. This may be done by extending the training results of the first pass. During the second pass, the most likely latent classes for each object are identified.
49 Citations
10 Claims
-
1. A computer-readable medium storing computer-executable instructions for performing operations comprising:
-
clustering a set of data objects into a plurality of groups; performing a first pass of performing probabilistic latent semantic analysis on the groups; identifying a plurality of latent classes of the set of data objects; calculating a first conditional probability of a data object of the set of data objects given a latent class of the plurality of latent classes; estimating a ranking of each latent class; eliminating low probability links between the set of data objects and the latent classes based on the rankings, the low probability links being determined based on a predetermined probability threshold; determining remaining links between the set of data objects and the latent classes; performing a second pass of probabilistic latent semantic analysis on a result of the first pass based on the remaining links between the set of data objects and the latent classes; and calculating a second conditional probability of a data object of the set of data objects given the remaining links between the set of data objects and the latent classes. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A method of performing two-pass probabilistic latent semantic analysis, the method comprising:
-
receiving a set of data objects at an input of a computing device; clustering the set of data objects into a plurality of groups; performing a first pass of performing probabilistic latent semantic analysis on the groups; estimating a ranking of each latent class; eliminating low probability links between the set of data objects and the latent classes based on the rankings, the low probability links being determined based on a predetermined probability threshold; determining remaining links between the set of data objects and the latent classes; and performing a second pass of probabilistic latent semantic analysis, using a processor of the computing device, on a result of the first pass based on the remaining links between the set of data objects and the latent classes. - View Dependent Claims (7, 8, 9, 10)
-
Specification