Document clustering method and apparatus based on common information of documents
First Claim
1. A computer implemented method of clustering documents, each clustering document having one or plural document segments in an input document, said method comprising steps:
- (a) obtaining a co-occurrence matrix for the input document by using a computer, the co-occurrence matrix is a matrix reflecting occurrence frequencies of terms and co-occurrence frequencies of term pairs, and obtaining an input document frequency matrix for a set of input documents based on occurrence frequencies of terms or term pairs appearing in the set of input documents wherein said step (a) further includes;
generating an input document segment vector for each input document segment of said input document segments based on occurrence frequencies of terms appearing in said each input document segment;
obtaining the co-occurrence matrix for the input document from input document segment vectors; and
obtaining the input document frequency matrix from the co-occurrence matrix for each document;
(b) selecting a seed document from a set of remaining documents that are not included in any cluster existing, and constructing a current cluster of an initial state based on the seed document, wherein said selecting and said constructing comprises;
constructing a remaining document common co-occurrence matrix for the set of the remaining documents based on a product of corresponding components of co-occurrence matrices of all documents in the set of remaining documents; and
obtaining a document commonality of each remaining document to the set of the remaining documents based on a product sum between every component of the co-occurrence matrix of each remaining document and the corresponding component of the remaining document common co-occurrence matrix;
extracting a document having highest document commonality to the set of the remaining documents; and
constructing initial cluster by including the seed document and neighbor documents similar to the seed document;
(c) making documents, which have document commonality to a current cluster higher than a threshold, belong temporarily to the current cluster;
wherein said making comprising;
constructing a current cluster common co-occurrence matrix for the current cluster and a current cluster document frequency matrix of the current cluster based on occurrence frequencies of terms or term pairs appearing in the documents of the current cluster;
obtaining a distinctiveness value of each term and each term pair for the current cluster by comparing the input document frequency matrix with the current cluster document frequency matrix;
obtaining weights of each term and each term pair from the distinctiveness values;
obtaining a document commonality to the current cluster for each document in a input document set based on a product sum between every component of the co-occurrence matrix of the input document and the corresponding component of the current cluster common co-occurrence matrix while applying the weights to said components; and
making the documents having document commonality to the current cluster higher than the threshold belong temporarily to the current cluster;
(d) repeating step (c) until number of documents temporarily belonging to the current cluster does not increase;
(e) repeating steps (b) through (d) until a given convergence condition is satisfied; and
(f) deciding, on a basis of the document commonality of each document to each cluster, a cluster to which each document belongs and outputting said cluster.
2 Assignments
0 Petitions
Accused Products
Abstract
In document (or pattern) clustering, the correct number of clusters and accurate assignment of each document (or pattern) to the correct cluster are attained. Documents (or patterns) describing the same topic (or object) are grouped, so a document (or pattern) group belonging to the same cluster has some commonality. Each topic (or object) has distinctive terms (or object features) or term (or object feature) pairs. When the closeness of each document (or pattern) to a given cluster is obtained, common information about the given cluster is extracted and used while the influence of terms (or object features) or term (or object feature) pairs not distinctive to the given cluster is excluded.
-
Citations
9 Claims
-
1. A computer implemented method of clustering documents, each clustering document having one or plural document segments in an input document, said method comprising steps:
-
(a) obtaining a co-occurrence matrix for the input document by using a computer, the co-occurrence matrix is a matrix reflecting occurrence frequencies of terms and co-occurrence frequencies of term pairs, and obtaining an input document frequency matrix for a set of input documents based on occurrence frequencies of terms or term pairs appearing in the set of input documents wherein said step (a) further includes; generating an input document segment vector for each input document segment of said input document segments based on occurrence frequencies of terms appearing in said each input document segment; obtaining the co-occurrence matrix for the input document from input document segment vectors; and obtaining the input document frequency matrix from the co-occurrence matrix for each document; (b) selecting a seed document from a set of remaining documents that are not included in any cluster existing, and constructing a current cluster of an initial state based on the seed document, wherein said selecting and said constructing comprises; constructing a remaining document common co-occurrence matrix for the set of the remaining documents based on a product of corresponding components of co-occurrence matrices of all documents in the set of remaining documents; and obtaining a document commonality of each remaining document to the set of the remaining documents based on a product sum between every component of the co-occurrence matrix of each remaining document and the corresponding component of the remaining document common co-occurrence matrix; extracting a document having highest document commonality to the set of the remaining documents; and constructing initial cluster by including the seed document and neighbor documents similar to the seed document; (c) making documents, which have document commonality to a current cluster higher than a threshold, belong temporarily to the current cluster;
wherein said making comprising;constructing a current cluster common co-occurrence matrix for the current cluster and a current cluster document frequency matrix of the current cluster based on occurrence frequencies of terms or term pairs appearing in the documents of the current cluster; obtaining a distinctiveness value of each term and each term pair for the current cluster by comparing the input document frequency matrix with the current cluster document frequency matrix; obtaining weights of each term and each term pair from the distinctiveness values; obtaining a document commonality to the current cluster for each document in a input document set based on a product sum between every component of the co-occurrence matrix of the input document and the corresponding component of the current cluster common co-occurrence matrix while applying the weights to said components; and making the documents having document commonality to the current cluster higher than the threshold belong temporarily to the current cluster; (d) repeating step (c) until number of documents temporarily belonging to the current cluster does not increase; (e) repeating steps (b) through (d) until a given convergence condition is satisfied; and (f) deciding, on a basis of the document commonality of each document to each cluster, a cluster to which each document belongs and outputting said cluster. - View Dependent Claims (2, 3, 4)
-
-
5. A computer implemented method of clustering documents each having one or plural document segments in an input document, said method comprising steps:
-
(a) using a computer to obtain a co-occurrence matrix for the input document, obtaining a co-occurrence matrix Sr for a input document Dr based on occurrence frequencies of terms or term pairs appearing in the set of input documents; wherein in step (a), each mn component Srmn of the co-occurrence matrix Sr of the document Dr is determined in accordance with; where; m and n denote mth and nth terms, respectively, among M terms appearing in the set of input documents, Dr is rth document in a document set D consisting of R documents; Yr is number of document segments in the document Dr, wherein said drym and dryn denote existence or absence of the mth and nth terms, respectively, in yth document segment of the document Dr, and Srmm represents number of document segments in which the mth term occurs and Srmm represents co-occurrence counts of document segments in which the mth and nth terms co-occur; (b) selecting a seed document from a set of remaining documents that are not included in any cluster existing, and constructing a current cluster of an initial state based on the seed document, wherein said selecting and said constructing comprise; constructing a remaining document common co-occurrence matrix TA for a set of the remaining documents based on co-occurrence matrices of all documents in the set of remaining documents; obtaining a document commonality of each remaining document to the set of the remaining documents based on the co-occurrence matrix Sr of each remaining document and the remaining document common co-occurrence matrix TA; extracting the document having a highest document commonality to the set of the remaining documents; and constructing a initial cluster by including the seed document and neighbor documents similar to the seed document; (c) making documents having document commonality higher than a threshold belong temporarily to the current cluster; (d) repeating step (c) until a number of documents temporarily belonging to the current cluster does not increase; (e) repeating steps (b) through (d) until a given convergence condition is satisfied; and (f) deciding, on basis of the document commonality of each document to each cluster, a cluster to which each document belongs and outputting said cluster. - View Dependent Claims (6, 7, 8, 9)
the matrix TA has an mn component determined by
TAmn=Tmn when Umn>
A,
TAmn=0 otherwise,where Umn represents an mn component of a document frequency matrix of the set of remaining documents wherein Umm denotes the number of remaining documents in which the mth term occurs and Umn denotes the number of remaining documents in which the mth and nth terms co-occur; and A denotes a predetermined threshold.
-
-
7. The method according to claim 6, further comprising:
-
determining a modified common co-occurrence matrix QA on the basis of TA; and
in step (b), obtaining the document commonality of each remaining document to the set of the remaining documents based on the co-occurrence matrix Sr of each remaining document and the modified common co-occurrence matrix QA;the matrix QA having an mn component determined by
QAmn=log TAmn when TAmn>
1,
QAmn=0 otherwise.
-
-
8. The method according to claim 7, wherein in step (b), the document commonality of each remaining document P having a co-occurrence matrix SP with respect to the set of remaining documents is given by
-
( D ′ , P ; Q A ) = ∑ m = 1 M ∑ n = 1 M Q mn A S mn P ∑ m = 1 M ∑ n = 1 M ( Q mn A ) 2 ∑ m = 1 M ∑ n = 1 M ( S mn P ) 2 .
-
-
9. The method according to claim 7, wherein in step (b), the document commonality of each remaining document P having a co-occurrence matrix SP with respect to the set of remaining documents is given by
-
( D ′ , P ; Q A ) = ∑ m = 1 M ∑ n = 1 M T mn A S mn P ∑ m = 1 M ∑ n = 1 M ( T mn A ) 2 ∑ m = 1 M ∑ n = 1 M ( S mn P ) 2 .
-
Specification