Document and pattern clustering method and apparatus
First Claim
1. A method of clustering documents (or patterns) each having one or plural document (or pattern) segments in an input document (or pattern) set, based on a relation among them, comprising, (a) obtaining a document (or pattern) frequency matrix for the set of input documents (or patterns), based on occurrence frequencies of terms appearing in each document (or pattern);
- (b) selecting a seed document (or pattern) from remaining documents (or patterns) that are not included in any cluster existing at that moment and constructing a current cluster of the initial state using the seed document (or pattern);
(c) obtaining the document (or pattern) commonality to the current cluster for each document (or pattern) in the input document (or pattern) set by using information based on the document (or pattern) frequency matrix for the input document (or pattern) set, information based on the document (or pattern) frequency matrix for documents (or patterns) in the current cluster and information based on the common co-occurrence matrix of the current cluster, and making documents (or patterns) having the document commonality higher than a threshold belong temporarily to the current cluster;
(d) repeating step (c) until the number of documents (or patterns) temporarily belonging to the current cluster becomes the same as that in the previous repetition;
(e) repeating steps (b) through (d) until a given convergence condition is satisfied; and
(f) deciding, on the basis of the document (or pattern) commonality of each document (or pattern) to each cluster, a cluster to which each document (or pattern) belongs.
2 Assignments
0 Petitions
Accused Products
Abstract
In document (or pattern) clustering, the correct number of clusters and accurate assignment of each document (or pattern) to the correct cluster are attained. Documents (or patterns) describing the same topic (or object) are grouped, so a document (or pattern) group belonging to the same cluster has some commonality. Each topic (or object) has distinctive terms (or object features) or term (or object feature) pairs. When the closeness of each document (or pattern) to a given cluster is obtained, common information about the given cluster is extracted and used while the influence of terms (or object features) or term (or object feature) pairs not distinctive to the given cluster is excluded.
-
Citations
29 Claims
-
1. A method of clustering documents (or patterns) each having one or plural document (or pattern) segments in an input document (or pattern) set, based on a relation among them, comprising,
(a) obtaining a document (or pattern) frequency matrix for the set of input documents (or patterns), based on occurrence frequencies of terms appearing in each document (or pattern); -
(b) selecting a seed document (or pattern) from remaining documents (or patterns) that are not included in any cluster existing at that moment and constructing a current cluster of the initial state using the seed document (or pattern);
(c) obtaining the document (or pattern) commonality to the current cluster for each document (or pattern) in the input document (or pattern) set by using information based on the document (or pattern) frequency matrix for the input document (or pattern) set, information based on the document (or pattern) frequency matrix for documents (or patterns) in the current cluster and information based on the common co-occurrence matrix of the current cluster, and making documents (or patterns) having the document commonality higher than a threshold belong temporarily to the current cluster;
(d) repeating step (c) until the number of documents (or patterns) temporarily belonging to the current cluster becomes the same as that in the previous repetition;
(e) repeating steps (b) through (d) until a given convergence condition is satisfied; and
(f) deciding, on the basis of the document (or pattern) commonality of each document (or pattern) to each cluster, a cluster to which each document (or pattern) belongs. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28)
-
-
29. A clustering apparatus for clustering documents (or patterns) each having one or plural document (or pattern) segments in an input document (or pattern) set based on the relation among them, the apparatus comprising:
-
(a) means for obtaining a document (or pattern) frequency matrix for the set of input documents (or patterns), based on occurrence frequencies of terms appearing in each document (or pattern);
(b) means for selecting a seed document (or pattern) from remaining documents (or patterns) that are not included in any cluster existing at that moment and constructing a current cluster of the initial state using the seed document (or pattern);
(c) means for obtaining the document (or pattern) commonality to the current cluster for each document (or pattern) in the input document (or pattern) set using information based on the document (or pattern) frequency matrix for the input document (or pattern) set, information based on the document (or pattern) frequency matrix for documents (or patterns) in the current cluster and information based on the common co-occurrence matrix of the current cluster and means for making documents (or patterns) having the document (or pattern) commonality higher than a threshold belong temporarily to the current cluster;
(d) means for repeating the operations of means (c) until the number of documents (or patterns) temporarily belonging to the current cluster becomes the same as that in the previous repetition;
(e) means for repeating the operations of means (b) through (d) until given convergence conditions are satisfied; and
(f) means for deciding, on the basis of the document (or pattern) commonality of each document (or pattern) to each cluster, a cluster to which each document (or pattern) belongs.
-
Specification