Method and apparatus for automatically generating hierarchical categories from large document collections
First Claim
1. A method for automatically generating a cluster hierarchy from a large number of documents, the method comprising the steps of:
- A. generating a set of unique tokens from the documents;
B. modeling each document in a cluster with one or more of the tokens;
C. extracting features from the modeled documents in the cluster;
D. clustering the documents using the extracted features so that the documents in the cluster are subdivided into further clusters; and
E. repeating steps B, C and D for each cluster generated in step D until a predetermined limit is reached.
3 Assignments
0 Petitions
Accused Products
Abstract
A top-down clustering method and apparatus recursively processes clusters of documents by first extracting features from the documents comprising the cluster, then using the extracted features to generate sub-clusters and finally using the generated sub-clusters to develop topics and identifiers for each sub-cluster. This process is repeated for each cluster and sub-cluster in a recursive manner so that clustering is performed using features extracted from each document in a cluster to perform sub-clustering. Feature extraction is performed by using frequency counts of terms taken from each document in a cluster and discarding terms falling outside of predetermined boundaries computed based on the total number of documents in the cluster. After bounding, the number of tokens is reduced prior to clustering by means of a correlation technique, such as a PCA model.
-
Citations
46 Claims
-
1. A method for automatically generating a cluster hierarchy from a large number of documents, the method comprising the steps of:
-
A. generating a set of unique tokens from the documents; B. modeling each document in a cluster with one or more of the tokens; C. extracting features from the modeled documents in the cluster; D. clustering the documents using the extracted features so that the documents in the cluster are subdivided into further clusters; and E. repeating steps B, C and D for each cluster generated in step D until a predetermined limit is reached. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A method for automatically generating a cluster hierarchy from a large number of documents, the method comprising the steps of:
-
A. generating a set of unique tokens from the documents; B. preprocessing the set of unique tokens to remove tokens according to predetermined rules; C. forming a token frequency count for each token used in the documents in a cluster and removing tokens whose frequency count falls outside of upper and lower bounds that are functions of the number of documents in the cluster; D. modeling the documents in the cluster with the remaining tokens; E. using a PCA analysis to extract features from the modeled documents; F. clustering the extracted features so that documents in the cluster are apportioned to additional clusters; and G. repeating steps C-F for each cluster generated in step F until a predetermined limit is reached. - View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22, 23)
-
-
24. Apparatus for automatically generating a cluster hierarchy from a large number of documents, the apparatus comprising:
-
means for generating a set of unique tokens from the documents; means for modeling each document in a cluster with one or more of the tokens; means for extracting features from the modeled documents in the cluster; means for clustering the documents using the extracted features so that the documents in the cluster are subdivided into further clusters; and a mechanism for controlling the modeling means, the extracting means and the clustering means to process each cluster generated by the clustering means until a predetermined limit is reached. - View Dependent Claims (25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36)
-
-
37. A computer program product for automatically generating a cluster hierarchy from a large number of documents, the computer program product comprising a computer usable medium having computer readable program code thereon including:
-
program code for generating a set of unique tokens from the documents; program code for preprocessing the set of unique tokens to remove tokens according to predetermined rules; program code for forming a token frequency count for each token used in the documents in a cluster and removing tokens whose frequency count falls outside of upper and lower bounds that are functions of the number of documents in the cluster; program code for modeling the documents in the cluster with the remaining tokens; program code for using a PCA analysis to extract features from the modeled documents; program code for clustering the extracted features so that documents in the cluster are apportioned to additional clusters; and program code for controlling the forming program code, modeling program code, extraction program code, and clustering program code to process each cluster generated by the clustering program code until a predetermined limit is reached. - View Dependent Claims (38, 39, 40, 41, 42, 43, 44, 45, 46)
-
Specification