HYBRID TENSOR-BASED CLUSTER ANALYSIS
First Claim
1. A method for identifying clusters within data sets in a document processing system, the method comprising:
- receiving, from an electronic document storage system, a plurality of digital documents for which multi-dimensional probabilistic relationships are to be determined;
parsing, with a computer processor, the plurality of digital documents to identify multi-dimensional count data within each of the plurality of digital documents, the multi-dimensional count data comprising at least three dimensions with each dimension comprising a respective data class comprising one of metadata associated with a respective document and text within a respective document;
producing a data set comprising at least a three dimensional tensor representing the multi-dimensional count data;
defining cluster definition matrices comprising estimated cluster membership probability of each element of each dimension of the multi-dimensional count data within the data set, the estimated cluster membership probability indicating a probability of membership of each element in a respective data cluster;
setting the cluster definition matrices to initial cluster definition matrices comprising random entry values for the data set;
setting a pre-defined convergence criteria for iterative cluster definition refinement processing;
iteratively processing the cluster definition matrices until the convergence criteria has been satisfied, the iteratively processing comprising;
processing the data set and the cluster definition matrices using a first tensor factorization model of a first tensor factorization technique to produce an updated cluster definition matrices; and
processing the data set and the updated cluster definition matrices using a second tensor factorization model of a second tensor factorization technique to refine the updated cluster definition matrices, wherein the second tensor factorization model is used to refine the results of the first tensor factorization model, and wherein the second factorization technique is different than the first factorization technique;
determining at least one likely cluster membership for any of the multi-dimensional count data within the data set based upon refinements made to the cluster definition matrices, a likely cluster membership comprising an indicator of membership of an element of the multi-dimensional count data in a respective cluster; and
outputting the at least one likely cluster membership.
1 Assignment
0 Petitions
Accused Products
Abstract
What is disclosed is a novel system and method for analyzing multi-dimensional cluster data sets to identify clusters of related documents in an electronic document storage system. Digital documents, for which multi-dimensional probabilistic relationships are to be determined, are received and then parsed to identify multi-dimensional count data with at least three dimensions. Multi-dimensional tensors representing the count data and estimated cluster membership probabilities are created. The tensors are then iteratively processed using a first and a complementary second tensor factorization model to refine the cluster definition matrices until a convergence criteria has been satisfied. Likely cluster memberships for the count data are determined based upon the refinements made to the cluster definition matrices by the alternating tensor factorization models. The present method advantageously extends to the field of tensor analysis a combination of Non-negative Matrix Factorization and Probabilistic Latent Semantic Analysis to decompose non-negative data.
-
Citations
19 Claims
-
1. A method for identifying clusters within data sets in a document processing system, the method comprising:
-
receiving, from an electronic document storage system, a plurality of digital documents for which multi-dimensional probabilistic relationships are to be determined; parsing, with a computer processor, the plurality of digital documents to identify multi-dimensional count data within each of the plurality of digital documents, the multi-dimensional count data comprising at least three dimensions with each dimension comprising a respective data class comprising one of metadata associated with a respective document and text within a respective document; producing a data set comprising at least a three dimensional tensor representing the multi-dimensional count data; defining cluster definition matrices comprising estimated cluster membership probability of each element of each dimension of the multi-dimensional count data within the data set, the estimated cluster membership probability indicating a probability of membership of each element in a respective data cluster; setting the cluster definition matrices to initial cluster definition matrices comprising random entry values for the data set; setting a pre-defined convergence criteria for iterative cluster definition refinement processing; iteratively processing the cluster definition matrices until the convergence criteria has been satisfied, the iteratively processing comprising; processing the data set and the cluster definition matrices using a first tensor factorization model of a first tensor factorization technique to produce an updated cluster definition matrices; and processing the data set and the updated cluster definition matrices using a second tensor factorization model of a second tensor factorization technique to refine the updated cluster definition matrices, wherein the second tensor factorization model is used to refine the results of the first tensor factorization model, and wherein the second factorization technique is different than the first factorization technique; determining at least one likely cluster membership for any of the multi-dimensional count data within the data set based upon refinements made to the cluster definition matrices, a likely cluster membership comprising an indicator of membership of an element of the multi-dimensional count data in a respective cluster; and outputting the at least one likely cluster membership. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A system for identifying clusters within data sets in a document processing system, the system comprising:
-
a memory; a storage medium for storing data; and a processor in communication with said storage medium and said memory, said processor executing machine readable instructions for performing the method of; receiving, from an electronic document storage system, a plurality of digital documents for which multi-dimensional probabilistic relationships are to be determined; parsing, with a computer processor, the plurality of digital documents to identify multi-dimensional count data within each of the plurality of digital documents, the multi-dimensional count data comprising at least three dimensions with each dimension comprising a respective data class comprising one of metadata associated with a respective document and text within a respective document; producing a data set comprising at least a three dimensional tensor representing the multi-dimensional count data; defining cluster definition matrices comprising estimated cluster membership probability of each element of each dimension of the multi-dimensional count data within the data set, the estimated cluster membership probability indicating a probability of membership of each element in a respective data cluster; setting the cluster definition matrices to initial cluster definition matrices comprising random entry values for the data set; setting a pre-defined convergence criteria for iterative cluster definition refinement processing; iteratively processing the cluster definition matrices until the convergence criteria has been satisfied, the iteratively processing comprising; processing the data set and the cluster definition matrices using a first tensor factorization model of a first tensor factorization technique to produce updated cluster definition matrices; and processing the data set and the updated cluster definition matrices using a second tensor factorization model of a second tensor factorization technique to refine the updated cluster definition matrices, wherein the second tensor factorization model is a complementary model of the first tensor factorization model, and wherein the second factorization technique is different than the first factorization technique; determining at least one likely cluster membership for any of the multi-dimensional count data within the data set based upon refinements made to the cluster definition matrices, a likely cluster membership comprising an indicator of membership of an element of the multi-dimensional count data in a respective cluster; and outputting the at least one likely cluster membership. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A method for identifying clusters within data sets in a document processing system, the method comprising:
-
receiving, from an electronic document storage system, a plurality of digital documents for which multi-dimensional probabilistic relationships are to be determined; parsing, with a computer processor, the plurality of digital documents to identify multi-dimensional count data within each of the plurality of digital documents, the multi-dimensional count data comprising at least three dimensions with each dimension comprising a respective data class comprising one of metadata associated with a respective document and text within a respective document; categorizing the multi-dimensional count data into at least three categories; producing a data set comprising at least a three dimensional tensor representing the multi-dimensional count data; defining cluster definition matrices comprising estimated cluster membership probability of each element of each dimension of the multi-dimensional count data within the data set, the estimated cluster membership probability indicating a probability of membership of each element in a respective data cluster; populating the at least three dimensional tensor within the data set with the multi-dimensional count data, wherein each category within the at least three categories populates a respective dimension of the at least three dimensional tensor within the data set; setting the cluster definition matrices to initial cluster definition matrices comprising random entry values for the data set; setting a pre-defined convergence criteria for iterative cluster definition refinement processing; iteratively processing the cluster definition matrices until the convergence criteria has been satisfied, the iteratively processing comprising; processing the data set and the cluster definition matrices using one of an NParafac Factorization Model and an NTucker3 Tensor Decomposition Model to produce updated cluster definition matrices; and processing the data set and the updated cluster definition matrices using one of a ParaAspect Factorization model and a TuckAspect Tensor Decomposition Model to refine the updated cluster definition matrices, wherein the one of the ParaAspect Factorization model and the TuckAspect Tensor Decomposition Model is a complementary model of the NParafac Factorization Model and the NTucker3 Tensor Factorization Model; determining at least one likely cluster membership for any of the multi-dimensional count data within the data set based upon refinements made to the cluster definition matrices, a likely cluster membership comprising an indicator of membership of an element of the multi-dimensional count data in a respective cluster; and outputting the at least one likely cluster membership. - View Dependent Claims (18, 19)
-
Specification