×

HYBRID TENSOR-BASED CLUSTER ANALYSIS

  • US 20100312797A1
  • Filed: 06/05/2009
  • Published: 12/09/2010
  • Est. Priority Date: 06/05/2009
  • Status: Active Grant
First Claim
Patent Images

1. A method for identifying clusters within data sets in a document processing system, the method comprising:

  • receiving, from an electronic document storage system, a plurality of digital documents for which multi-dimensional probabilistic relationships are to be determined;

    parsing, with a computer processor, the plurality of digital documents to identify multi-dimensional count data within each of the plurality of digital documents, the multi-dimensional count data comprising at least three dimensions with each dimension comprising a respective data class comprising one of metadata associated with a respective document and text within a respective document;

    producing a data set comprising at least a three dimensional tensor representing the multi-dimensional count data;

    defining cluster definition matrices comprising estimated cluster membership probability of each element of each dimension of the multi-dimensional count data within the data set, the estimated cluster membership probability indicating a probability of membership of each element in a respective data cluster;

    setting the cluster definition matrices to initial cluster definition matrices comprising random entry values for the data set;

    setting a pre-defined convergence criteria for iterative cluster definition refinement processing;

    iteratively processing the cluster definition matrices until the convergence criteria has been satisfied, the iteratively processing comprising;

    processing the data set and the cluster definition matrices using a first tensor factorization model of a first tensor factorization technique to produce an updated cluster definition matrices; and

    processing the data set and the updated cluster definition matrices using a second tensor factorization model of a second tensor factorization technique to refine the updated cluster definition matrices, wherein the second tensor factorization model is used to refine the results of the first tensor factorization model, and wherein the second factorization technique is different than the first factorization technique;

    determining at least one likely cluster membership for any of the multi-dimensional count data within the data set based upon refinements made to the cluster definition matrices, a likely cluster membership comprising an indicator of membership of an element of the multi-dimensional count data in a respective cluster; and

    outputting the at least one likely cluster membership.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×