×

Document clustering method and apparatus based on common information of documents

  • US 7,499,923 B2
  • Filed: 03/04/2004
  • Issued: 03/03/2009
  • Est. Priority Date: 03/05/2003
  • Status: Expired due to Fees
First Claim
Patent Images

1. A computer implemented method of clustering documents, each clustering document having one or plural document segments in an input document, said method comprising steps:

  • (a) obtaining a co-occurrence matrix for the input document by using a computer, the co-occurrence matrix is a matrix reflecting occurrence frequencies of terms and co-occurrence frequencies of term pairs, and obtaining an input document frequency matrix for a set of input documents based on occurrence frequencies of terms or term pairs appearing in the set of input documents wherein said step (a) further includes;

    generating an input document segment vector for each input document segment of said input document segments based on occurrence frequencies of terms appearing in said each input document segment;

    obtaining the co-occurrence matrix for the input document from input document segment vectors; and

    obtaining the input document frequency matrix from the co-occurrence matrix for each document;

    (b) selecting a seed document from a set of remaining documents that are not included in any cluster existing, and constructing a current cluster of an initial state based on the seed document, wherein said selecting and said constructing comprises;

    constructing a remaining document common co-occurrence matrix for the set of the remaining documents based on a product of corresponding components of co-occurrence matrices of all documents in the set of remaining documents; and

    obtaining a document commonality of each remaining document to the set of the remaining documents based on a product sum between every component of the co-occurrence matrix of each remaining document and the corresponding component of the remaining document common co-occurrence matrix;

    extracting a document having highest document commonality to the set of the remaining documents; and

    constructing initial cluster by including the seed document and neighbor documents similar to the seed document;

    (c) making documents, which have document commonality to a current cluster higher than a threshold, belong temporarily to the current cluster;

    wherein said making comprising;

    constructing a current cluster common co-occurrence matrix for the current cluster and a current cluster document frequency matrix of the current cluster based on occurrence frequencies of terms or term pairs appearing in the documents of the current cluster;

    obtaining a distinctiveness value of each term and each term pair for the current cluster by comparing the input document frequency matrix with the current cluster document frequency matrix;

    obtaining weights of each term and each term pair from the distinctiveness values;

    obtaining a document commonality to the current cluster for each document in a input document set based on a product sum between every component of the co-occurrence matrix of the input document and the corresponding component of the current cluster common co-occurrence matrix while applying the weights to said components; and

    making the documents having document commonality to the current cluster higher than the threshold belong temporarily to the current cluster;

    (d) repeating step (c) until number of documents temporarily belonging to the current cluster does not increase;

    (e) repeating steps (b) through (d) until a given convergence condition is satisfied; and

    (f) deciding, on a basis of the document commonality of each document to each cluster, a cluster to which each document belongs and outputting said cluster.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×