Document clustering with cluster refinement and model selection capabilities
First Claim
1. A method for clustering a plurality of documents into a specified number of clusters, comprising the steps of:
- (a) using a set of features to represent each document;
(b) generating a set of the specified number of document clusters from the plurality of documents using a Gaussian Mixture Model and an Expectation-Maximization algorithm and said set of features.
2 Assignments
0 Petitions
Accused Products
Abstract
A document partitioning (flat clustering) method clusters documents with high accuracy and accurately estimates the number of clusters in the document corpus (i.e. provides a model selection capability). To accurately cluster the given document corpus, a richer feature set is employed to represent each document, and the Gaussian Mixture Model (GMM) together with the Expectation-Maximization (EM) algorithm is used to conduct an initial document clustering. From this initial result, a set of discriminative features is identified for each cluster, and the initially obtained document clusters are refined by voting on the cluster label for each document using this discriminative feature set. This self refinement process of discriminative feature identification and cluster label voting is iteratively applied until the convergence of document clusters. Furthermore, a model selection capability is achieved by introducing randomness in the cluster initialization stage, and then discovering a value C for the number of clusters N by which running the document clustering process for a fixed number of times yields sufficiently similar results.
102 Citations
21 Claims
-
1. A method for clustering a plurality of documents into a specified number of clusters, comprising the steps of:
-
(a) using a set of features to represent each document;
(b) generating a set of the specified number of document clusters from the plurality of documents using a Gaussian Mixture Model and an Expectation-Maximization algorithm and said set of features. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A method for clustering a plurality of documents into a specified number of clusters, comprising the steps of:
-
(a) using a set of features to represent each document; and
(b) generating the specified number of document clusters from the plurality of documents using any method of document clustering and said set of features;
wherein said set of features comprises at least two of the following;
a frequency of one or more names within the document, a frequency of one or more word pairs within the document, and a frequency of all unique terms within the document.
-
-
7. A method for clustering a plurality of documents into a specified number of clusters, comprising the steps of:
-
(a) using a set of features to represent each document; and
(b) generating the specified number of document clusters from the plurality of documents using any method of document clustering and said set of features;
wherein said set of features comprises a frequency of one or more names within the document, a frequency of one or more word pairs within the document, and a frequency of one or more unique terms within the document.
-
-
8. A method for refining a document clustering accuracy, comprising the steps of:
-
(a) obtaining a current set of a specified number of document clusters for a plurality of documents;
(b) determining a set of discriminative features from the current set of document clusters;
(c) refining the current set of document clusters using the set of discriminative features; and
(d) repeating steps (b) and (c) until a predetermined measure of the document clustering accuracy is achieved. - View Dependent Claims (9, 10, 11)
-
-
12. A method for refining a document clustering accuracy, comprising the steps of:
-
(a) obtaining a current set of a specified number of document clusters for a plurality of documents;
(b) determining a set of discriminative features from the current set of document clusters;
(c) performing a document clustering using the set of discriminative features to obtain a refined set of the specified number of document clusters;
(d) computing a change between the current set of document clusters and the refined set of document clusters, and when the change is below a predefined threshold, terminating the process, otherwise setting the refined set of document clusters as the current set of document clusters and returning to step (b). - View Dependent Claims (13)
-
-
14. A method for determining a number of clusters in an unknown data corpus, comprising the ordered steps of:
-
(a) obtaining from a user, an input range within which to guess a number of document clusters;
(b) guessing the number of document clusters is the lowest value of the input range;
(c) clustering the documents into the guessed number of document clusters;
(d) repeating step (c) with a different cluster initialization for a specified number of times;
(e) measuring a similarity between each pair of generated document cluster sets;
(f) when the guessed number of document clusters is less than the maximum value of the input range, incrementing the guessed number of document clusters by one, and returning to step (c); and
(g) when the guessed number of document clusters equals the maximum value of the input range, selecting the guessed number of document clusters that yielded the greatest measured similarity between generated document cluster sets. - View Dependent Claims (15, 16, 17, 18)
-
-
19. A computer program product for enabling a computer to cluster a plurality of documents into a specified number of clusters, comprising:
-
software instructions for enabling the computer to perform predetermined operations, and a computer readable medium bearing the software instructions;
wherein the predetermined operations include the steps of;
(a) using a set of features to represent each document; and
(b) generating a set of a specified number of document clusters using a Gaussian Mixture Model and an Expectation-Maximization algorithm and said set of features.
-
-
20. A computer program product for enabling a computer to refine a document clustering accuracy, comprising:
-
software instructions for enabling the computer to perform predetermined operations, and a computer readable medium bearing the software instructions;
wherein the predetermined operations include the steps of;
(a) obtaining a current set of a specified number of document clusters for a plurality of documents;
(b) determining a set of discriminative features from the current set of document clusters;
(c) refining the current set of document clusters using the set of discriminative features; and
(d) repeating steps (b) and (c) until a predetermined measure of the document clustering accuracy is achieved.
-
-
21. A computer program product for determining a number of clusters in an unknown data corpus, comprising:
-
software instructions for enabling the computer to perform predetermined operations, and a computer readable medium bearing the software instructions;
wherein the predetermined operations include the ordered steps of;
(a) obtaining from a user, an input range within which to guess the number of clusters;
(b) guessing the number of clusters is the lowest value of the input range;
(c) clustering the data corpus into a set of the guessed number of document clusters;
(d) repeating step (c) with a different cluster initialization for a specified number of times;
(e) measuring a similarity between each pair of generated document cluster sets;
(f) when the guessed number of document clusters is less than the maximum value of the input range, incrementing the guessed number of document clusters by one, and returning to step (c); and
(g) when the guessed number of document clusters equals the maximum value of the input range, selecting the guessed number of document clusters that yielded the greatest measured similarity between generated document cluster sets.
-
Specification