HYBRID TENSOR-BASED CLUSTER ANALYSIS

US 20100312797A1
Filed: 06/05/2009
Published: 12/09/2010
Est. Priority Date: 06/05/2009
Status: Active Grant

First Claim

Patent Images

1. A method for identifying clusters within data sets in a document processing system, the method comprising:

receiving, from an electronic document storage system, a plurality of digital documents for which multi-dimensional probabilistic relationships are to be determined;

parsing, with a computer processor, the plurality of digital documents to identify multi-dimensional count data within each of the plurality of digital documents, the multi-dimensional count data comprising at least three dimensions with each dimension comprising a respective data class comprising one of metadata associated with a respective document and text within a respective document;

producing a data set comprising at least a three dimensional tensor representing the multi-dimensional count data;

defining cluster definition matrices comprising estimated cluster membership probability of each element of each dimension of the multi-dimensional count data within the data set, the estimated cluster membership probability indicating a probability of membership of each element in a respective data cluster;

setting the cluster definition matrices to initial cluster definition matrices comprising random entry values for the data set;

setting a pre-defined convergence criteria for iterative cluster definition refinement processing;

iteratively processing the cluster definition matrices until the convergence criteria has been satisfied, the iteratively processing comprising;

processing the data set and the cluster definition matrices using a first tensor factorization model of a first tensor factorization technique to produce an updated cluster definition matrices; and

processing the data set and the updated cluster definition matrices using a second tensor factorization model of a second tensor factorization technique to refine the updated cluster definition matrices, wherein the second tensor factorization model is used to refine the results of the first tensor factorization model, and wherein the second factorization technique is different than the first factorization technique;

determining at least one likely cluster membership for any of the multi-dimensional count data within the data set based upon refinements made to the cluster definition matrices, a likely cluster membership comprising an indicator of membership of an element of the multi-dimensional count data in a respective cluster; and

outputting the at least one likely cluster membership.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

What is disclosed is a novel system and method for analyzing multi-dimensional cluster data sets to identify clusters of related documents in an electronic document storage system. Digital documents, for which multi-dimensional probabilistic relationships are to be determined, are received and then parsed to identify multi-dimensional count data with at least three dimensions. Multi-dimensional tensors representing the count data and estimated cluster membership probabilities are created. The tensors are then iteratively processed using a first and a complementary second tensor factorization model to refine the cluster definition matrices until a convergence criteria has been satisfied. Likely cluster memberships for the count data are determined based upon the refinements made to the cluster definition matrices by the alternating tensor factorization models. The present method advantageously extends to the field of tensor analysis a combination of Non-negative Matrix Factorization and Probabilistic Latent Semantic Analysis to decompose non-negative data.

Citations

19 Claims

1. A method for identifying clusters within data sets in a document processing system, the method comprising:
- receiving, from an electronic document storage system, a plurality of digital documents for which multi-dimensional probabilistic relationships are to be determined;
  
  parsing, with a computer processor, the plurality of digital documents to identify multi-dimensional count data within each of the plurality of digital documents, the multi-dimensional count data comprising at least three dimensions with each dimension comprising a respective data class comprising one of metadata associated with a respective document and text within a respective document;
  
  producing a data set comprising at least a three dimensional tensor representing the multi-dimensional count data;
  
  defining cluster definition matrices comprising estimated cluster membership probability of each element of each dimension of the multi-dimensional count data within the data set, the estimated cluster membership probability indicating a probability of membership of each element in a respective data cluster;
  
  setting the cluster definition matrices to initial cluster definition matrices comprising random entry values for the data set;
  
  setting a pre-defined convergence criteria for iterative cluster definition refinement processing;
  
  iteratively processing the cluster definition matrices until the convergence criteria has been satisfied, the iteratively processing comprising;
  
  processing the data set and the cluster definition matrices using a first tensor factorization model of a first tensor factorization technique to produce an updated cluster definition matrices; and
  
  processing the data set and the updated cluster definition matrices using a second tensor factorization model of a second tensor factorization technique to refine the updated cluster definition matrices, wherein the second tensor factorization model is used to refine the results of the first tensor factorization model, and wherein the second factorization technique is different than the first factorization technique;
  
  determining at least one likely cluster membership for any of the multi-dimensional count data within the data set based upon refinements made to the cluster definition matrices, a likely cluster membership comprising an indicator of membership of an element of the multi-dimensional count data in a respective cluster; and
  
  outputting the at least one likely cluster membership.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, wherein the first and second tensor factorization models each comprise one of:
    - a NParafac Factorization Model and a ParaAspect Factorization Model, the first tensor factorization model being different from the second tensor factorization model.
  - 3. The method of claim 1, wherein the first and second tensor factorization models each comprise one of:
    - a NTucker3 Tensor Decomposition Model and a TuckAspect Tensor Decomposition Model, the first tensor factorization model being different from the second tensor factorization model.
  - 4. The method of claim 1, wherein parsing the plurality of digital documents to identify multi-dimensional count data within each of the plurality of digital documents comprises performing one of data mining and text mining on each digital document within the plurality of digital documents.
  - 5. The method of claim 1, wherein producing a data set comprising at least a three dimensional tensor representing the multi-dimensional count data comprises:
    - categorizing the multi-dimensional count data into at least three categories; and
      
      populating the at least three dimensional tensor within the data set with the multi-dimensional count data, wherein each category within the at least three categories populates a respective dimension of the at least three dimensional tensor within the data set.
  - 6. The method of claim 1, wherein determining at least one likely cluster membership for any of the multi-dimensional count data within the data set based upon refinements made to the cluster definition matrices comprises determining a highest probability of cluster membership for the any of the multi-dimensional count data.
  - 7. The method of claim 1, further comprising scanning, prior to the receiving, a plurality of printed documents into the electronic document storage system.
  - 8. The method of claim 1, further comprising:
    - crawling, prior to the receiving, a plurality of electronic documents available over a computer network; and
      
      storing the plurality of electronic documents into the electronic document storage system.

9. A system for identifying clusters within data sets in a document processing system, the system comprising:
- a memory;
  
  a storage medium for storing data; and
  
  a processor in communication with said storage medium and said memory, said processor executing machine readable instructions for performing the method of;
  
  receiving, from an electronic document storage system, a plurality of digital documents for which multi-dimensional probabilistic relationships are to be determined;
  
  parsing, with a computer processor, the plurality of digital documents to identify multi-dimensional count data within each of the plurality of digital documents, the multi-dimensional count data comprising at least three dimensions with each dimension comprising a respective data class comprising one of metadata associated with a respective document and text within a respective document;
  
  producing a data set comprising at least a three dimensional tensor representing the multi-dimensional count data;
  
  defining cluster definition matrices comprising estimated cluster membership probability of each element of each dimension of the multi-dimensional count data within the data set, the estimated cluster membership probability indicating a probability of membership of each element in a respective data cluster;
  
  setting the cluster definition matrices to initial cluster definition matrices comprising random entry values for the data set;
  
  setting a pre-defined convergence criteria for iterative cluster definition refinement processing;
  
  iteratively processing the cluster definition matrices until the convergence criteria has been satisfied, the iteratively processing comprising;
  
  processing the data set and the cluster definition matrices using a first tensor factorization model of a first tensor factorization technique to produce updated cluster definition matrices; and
  
  processing the data set and the updated cluster definition matrices using a second tensor factorization model of a second tensor factorization technique to refine the updated cluster definition matrices, wherein the second tensor factorization model is a complementary model of the first tensor factorization model, and wherein the second factorization technique is different than the first factorization technique;
  
  determining at least one likely cluster membership for any of the multi-dimensional count data within the data set based upon refinements made to the cluster definition matrices, a likely cluster membership comprising an indicator of membership of an element of the multi-dimensional count data in a respective cluster; and
  
  outputting the at least one likely cluster membership.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. The system of claim 9, wherein the first and second tensor factorization models each comprise one of:
    - a NParafac Factorization Model and a ParaAspect Factorization Model, the first tensor factorization model being different from the second tensor factorization model.
  - 11. The system of claim 9, wherein the first and second tensor factorization models each comprise one of:
    - a NTucker3 Tensor Decomposition Model and a TuckAspect Tensor Decomposition Model, the first tensor factorization model being different from the second tensor factorization model.
  - 12. The system of claim 9, wherein parsing the plurality of digital documents to identify multi-dimensional count data within each of the plurality of digital documents comprises performing one of data mining and text mining on each digital document within the plurality of digital documents.
  - 13. The system of claim 9, wherein producing a data set comprising at least a three dimensional tensor representing the multi-dimensional count data comprises:
    - categorizing the multi-dimensional count data into at least three categories; and
      
      populating the at least three dimensional tensor within the data set with the multi-dimensional count data, wherein each category within the at least three categories populates a respective dimension of the at least three dimensional tensor within the data set.
  - 14. The system of claim 9, wherein determining at least one likely cluster membership for any of the multi-dimensional count data within the data set based upon refinements made to the cluster definition matrices cluster definition matrices comprises determining a highest probability of cluster membership for the any of the multi-dimensional count data.
  - 15. The system of claim 9, wherein the method further comprises scanning, prior to the receiving, a plurality of printed documents into the electronic document storage system.
  - 16. The system of claim 9, wherein the method further comprises:
    - crawling, prior to the receiving, a plurality of electronic documents available over a computer network; and
      
      storing, prior to the receiving, the plurality of electronic documents into the electronic document storage system.

17. A method for identifying clusters within data sets in a document processing system, the method comprising:
- receiving, from an electronic document storage system, a plurality of digital documents for which multi-dimensional probabilistic relationships are to be determined;
  
  parsing, with a computer processor, the plurality of digital documents to identify multi-dimensional count data within each of the plurality of digital documents, the multi-dimensional count data comprising at least three dimensions with each dimension comprising a respective data class comprising one of metadata associated with a respective document and text within a respective document;
  
  categorizing the multi-dimensional count data into at least three categories;
  
  producing a data set comprising at least a three dimensional tensor representing the multi-dimensional count data;
  
  defining cluster definition matrices comprising estimated cluster membership probability of each element of each dimension of the multi-dimensional count data within the data set, the estimated cluster membership probability indicating a probability of membership of each element in a respective data cluster;
  
  populating the at least three dimensional tensor within the data set with the multi-dimensional count data, wherein each category within the at least three categories populates a respective dimension of the at least three dimensional tensor within the data set;
  
  setting the cluster definition matrices to initial cluster definition matrices comprising random entry values for the data set;
  
  setting a pre-defined convergence criteria for iterative cluster definition refinement processing;
  
  iteratively processing the cluster definition matrices until the convergence criteria has been satisfied, the iteratively processing comprising;
  
  processing the data set and the cluster definition matrices using one of an NParafac Factorization Model and an NTucker3 Tensor Decomposition Model to produce updated cluster definition matrices; and
  
  processing the data set and the updated cluster definition matrices using one of a ParaAspect Factorization model and a TuckAspect Tensor Decomposition Model to refine the updated cluster definition matrices, wherein the one of the ParaAspect Factorization model and the TuckAspect Tensor Decomposition Model is a complementary model of the NParafac Factorization Model and the NTucker3 Tensor Factorization Model;
  
  determining at least one likely cluster membership for any of the multi-dimensional count data within the data set based upon refinements made to the cluster definition matrices, a likely cluster membership comprising an indicator of membership of an element of the multi-dimensional count data in a respective cluster; and
  
  outputting the at least one likely cluster membership.
- View Dependent Claims (18, 19)
- - 18. The method of claim 17, further comprising scanning, prior to the receiving, a plurality of printed documents into the electronic document storage system.
  - 19. The method of claim 17, further comprising:
    - crawling, prior to the receiving, a plurality of electronic documents available over a computer network; and
      
      storing, prior to the receiving, the plurality of electronic documents into the electronic document storage system.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Xerox Corporation (Xerox Holdings Corp.)
Original Assignee
Xerox Corporation (Xerox Holdings Corp.)
Inventors
PENG, WEI

Granted Patent

US 8,060,512 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/776
CPC Class Codes

G06F 16/93   Document management systems

G06N 20/00   Machine learning

G06N 7/01   Probabilistic graphical mod...

HYBRID TENSOR-BASED CLUSTER ANALYSIS

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

HYBRID TENSOR-BASED CLUSTER ANALYSIS

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links