Method and apparatus for automatically generating hierarchical categories from large document collections

US 5,819,258 A
Filed: 03/07/1997
Issued: 10/06/1998
Est. Priority Date: 03/07/1997
Status: Expired due to Term

First Claim

Patent Images

1. A method for automatically generating a cluster hierarchy from a large number of documents, the method comprising the steps of:

A. generating a set of unique tokens from the documents;

B. modeling each document in a cluster with one or more of the tokens;

C. extracting features from the modeled documents in the cluster;

D. clustering the documents using the extracted features so that the documents in the cluster are subdivided into further clusters; and

E. repeating steps B, C and D for each cluster generated in step D until a predetermined limit is reached.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A top-down clustering method and apparatus recursively processes clusters of documents by first extracting features from the documents comprising the cluster, then using the extracted features to generate sub-clusters and finally using the generated sub-clusters to develop topics and identifiers for each sub-cluster. This process is repeated for each cluster and sub-cluster in a recursive manner so that clustering is performed using features extracted from each document in a cluster to perform sub-clustering. Feature extraction is performed by using frequency counts of terms taken from each document in a cluster and discarding terms falling outside of predetermined boundaries computed based on the total number of documents in the cluster. After bounding, the number of tokens is reduced prior to clustering by means of a correlation technique, such as a PCA model.

Citations

46 Claims

1. A method for automatically generating a cluster hierarchy from a large number of documents, the method comprising the steps of:
- A. generating a set of unique tokens from the documents;
  
  B. modeling each document in a cluster with one or more of the tokens;
  
  C. extracting features from the modeled documents in the cluster;
  
  D. clustering the documents using the extracted features so that the documents in the cluster are subdivided into further clusters; and
  
  E. repeating steps B, C and D for each cluster generated in step D until a predetermined limit is reached.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The method according to claim 1 wherein step A comprises the steps of:
    - A1. separating each document into tokens with a predetermined set of delimiters to generate a pool of tokens;
      
      A2. removing duplicates from the pool of tokens; and
      
      A3. preprocessing the pool of tokens to eliminate selected tokens which do not represent meaningful data.
  - 3. The method according to claim 1 wherein step B comprises the steps of:
    - B1. generating a token frequency count for tokens used in documents in the cluster;
      
      B2. eliminating tokens whose frequency count falls above a predetermined upper limit and below a predetermined lower limit; and
      
      B3. modeling the documents in the cluster using the remaining tokens.
  - 4. The method according to claim 3 wherein the predetermined upper limit is a function of a number of documents in the cluster.
  - 5. The method according to claim 3 wherein the predetermined lower limit is a function of a number of documents in the cluster.
  - 6. The method according to claim 1 wherein step C comprises the steps of:
    - C1. performing a PCA analysis on the modeled documents.
  - 7. The method according to claim 6 wherein step C1 comprises the steps of:
    - C1a. computing eigenvectors from a matrix derived from the modeled documents to form a matrix; and
      
      C1b. reducing the dimensions of the matrix.
  - 8. The method according to claim 1 wherein step D comprises the steps of:
    - D1. applying a clustering algorithm to the extracted features to generate one or more clusters each containing documents from the cluster; and
      
      D2. selecting topic tokens for each of the clusters determined in step D1 from the tokens associated with the each cluster.
  - 9. The method according to claim 8 wherein step D1 comprises the step of:
    - D1a. applying a clustering algorithm to the extracted features to generate one or more clusters each containing documents from the cluster.
  - 10. The method according to claim 1 wherein step E comprises the step of:
    - E1. repeating steps B, C and D until the number of documents in each cluster reaches or falls below a predetermined threshold.
  - 11. The method according to claim 1 wherein step E comprises the step of:
    - E2. repeating steps B, C and D until the total number of clusters exceeds a predetermined threshold.
  - 12. The method according to claim 1 further comprising the step of:
    - F. calculating nearest neighbors of each document in a cluster.
  - 13. The method according to claim 1 further comprising the step of:
    - G. calculating distances between clusters.

14. A method for automatically generating a cluster hierarchy from a large number of documents, the method comprising the steps of:
- A. generating a set of unique tokens from the documents;
  
  B. preprocessing the set of unique tokens to remove tokens according to predetermined rules;
  
  C. forming a token frequency count for each token used in the documents in a cluster and removing tokens whose frequency count falls outside of upper and lower bounds that are functions of the number of documents in the cluster;
  
  D. modeling the documents in the cluster with the remaining tokens;
  
  E. using a PCA analysis to extract features from the modeled documents;
  
  F. clustering the extracted features so that documents in the cluster are apportioned to additional clusters; and
  
  G. repeating steps C-F for each cluster generated in step F until a predetermined limit is reached.
- View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22, 23)
- - 15. The method according to claim 14 wherein step A comprises the step of:
    - A1. separating each document into tokens using a predetermined set of delimiters.
  - 16. The method according to claim 14 wherein step B comprises the step of:
    - B1. removing tokens with numerical characters and tokens in a predefined list of terms.
  - 17. The method according to claim 14 wherein step C comprises the step of:
    - C1. removing tokens whose frequency count is higher than an upper bound equal to the number of documents in the cluster divided by ten.
  - 18. The method according to claim 14 wherein step C comprises the step of:
    - C2. removing tokens whose frequency count is lower than a lower bound equal to the number of documents in the cluster divided by one hundred.
  - 19. The method according to claim 14 wherein step D comprises the steps of:
    - D1. forming a vector space model of each document in the cluster with the remaining tokens.
  - 20. The method according to claim 19 wherein step E comprises the steps of:
    - E1. computing eigenvectors from the sum-squared-of-products matrix, covariance matrix or correlation matrix of the modelled documents to form a matrix; and
      
      E2. reducing the dimensions of the matrix to generate extracted features.
  - 21. The method according to claim 14 wherein step F comprises the step of:
    - F1. applying a k-means clustering algorithm to the extracted features to generate one or more clusters each containing documents from the cluster.
  - 22. The method according to claim 14 further comprising the step of:
    - H. calculating nearest neighbors of each document in a cluster and all other documents in the cluster.
  - 23. The method according to claim 14 further comprising the steps of:
    - I. selecting topic tokens for each of cluster from the tokens associated with the each cluster;
      
      J. collecting all topic tokens into a list and eliminating duplicates;
      
      K. modeling each cluster using a predetermined number of the remaining topic tokens; and
      
      L. using a distance measure to identify related clusters.

24. Apparatus for automatically generating a cluster hierarchy from a large number of documents, the apparatus comprising:
- means for generating a set of unique tokens from the documents;
  
  means for modeling each document in a cluster with one or more of the tokens;
  
  means for extracting features from the modeled documents in the cluster;
  
  means for clustering the documents using the extracted features so that the documents in the cluster are subdivided into further clusters; and
  
  a mechanism for controlling the modeling means, the extracting means and the clustering means to process each cluster generated by the clustering means until a predetermined limit is reached.
- View Dependent Claims (25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36)
- - 25. The apparatus according to claim 24 wherein the token generating means comprises:
    - means for separating each document into tokens with a predetermined set of delimiters to generate a pool of tokens;
      
      means for removing duplicates from the pool of tokens; and
      
      means for preprocessing the pool of tokens to eliminate selected tokens which do not represent meaningful data.
  - 26. The apparatus according to claim 24 wherein the modeling means comprises:
    - means for generating a token frequency count for tokens used in documents in the cluster;
      
      means for eliminating tokens whose frequency count falls above a predetermined upper limit and below a predetermined lower limit; and
      
      means for modeling the documents in the cluster using the remaining tokens.
  - 27. The apparatus according to claim 26 wherein the predetermined upper limit is a function of a number of documents in the cluster.
  - 28. The apparatus according to claim 26 wherein the predetermined lower limit is a function of a number of documents in the cluster.
  - 29. The apparatus according to claim 24 wherein the extracting means comprises:
    - means for performing a PCA analysis on the modeled documents.
  - 30. The apparatus according to claim 29 wherein the performing means comprises:
    - means for computing eigenvectors from sum-squared-of-products matrix, covariance matrix or correlation matrix of the modeled documents to form a matrix; and
      
      means for reducing the dimensions of the matrix.
  - 31. The apparatus according to claim 24 wherein the clustering means comprises:
    - means for applying a clustering algorithm to the extracted features to generate one or more clusters each containing documents from the cluster; and
      
      means for selecting topic tokens for each of the clusters determined by the applying means from the tokens associated with the each cluster.
  - 32. The apparatus according to claim 31 wherein the applying means comprises:
    - means for applying a k-means clustering algorithm to the extracted features to generate one or more clusters each containing documents from the cluster.
  - 33. The apparatus according to claim 24 wherein the controlling mechanism comprises:
    - means for controlling the modeling means, the extracting means, and the clustering means to process clusters until the number of documents in each cluster reaches or falls below a predetermined threshold.
  - 34. The apparatus according to claim 24 wherein the controlling mechanism comprises:
    - means for controlling the modeling means, the extracting means, and the clustering means to process clusters until the total number of clusters exceeds a predetermined threshold.
  - 35. The apparatus according to claim 24 further comprising:
    - means for calculating nearest neighbors of each document in a cluster.
  - 36. The apparatus according to claim 24 further comprising:
    - means for calculating distances between clusters.

37. A computer program product for automatically generating a cluster hierarchy from a large number of documents, the computer program product comprising a computer usable medium having computer readable program code thereon including:
- program code for generating a set of unique tokens from the documents;
  
  program code for preprocessing the set of unique tokens to remove tokensaccording to predetermined rules;
  
  program code for forming a token frequency count for each token used in the documents in a cluster and removing tokens whose frequency count falls outside of upper and lower bounds that are functions of the number of documents in the cluster;
  
  program code for modeling the documents in the cluster with the remaining tokens;
  
  program code for using a PCA analysis to extract features from the modeled documents;
  
  program code for clustering the extracted features so that documents in the cluster are apportioned to additional clusters; and
  
  program code for controlling the forming program code, modeling program code, extraction program code, and clustering program code to process each cluster generated by the clustering program code until a predetermined limit is reached.
- View Dependent Claims (38, 39, 40, 41, 42, 43, 44, 45, 46)
- - 38. The computer program product according to claim 37 wherein the generating program code comprises:
    - program code for separating each document into tokens using a predetermined set of delimiters.
  - 39. The computer program product according to claim 37 wherein the preprocessing program code comprises:
    - program code for removing tokens with numerical characters and tokens in a predefined list of terms.
  - 40. The computer program product according to claim 37 wherein the forming program code comprises:
    - program code for removing tokens whose frequency count is higher than an upper bound equal to the number of documents in the cluster divided by ten.
  - 41. The computer program product according to claim 40 wherein the forming program code comprises:
    - program code for removing tokens whose frequency count is lower than a lower bound equal to the number of documents in the cluster divided by one hundred.
  - 42. The computer program product according to claim 37 wherein the modeling program code comprises:
    - program code for forming a vector space model of each document in the cluster with the remaining tokens.
  - 43. The computer program product according to claim 42 wherein the using program code comprises:
    - program code for computing eigenvectors from the vector space models to form a matrix; and
      
      program code for reducing the dimensions of the matrix to generate extracted features.
  - 44. The computer program product according to claim 37 wherein the clustering program code comprises:
    - program code for applying a k-means clustering algorithm to the extracted features to generate one or more clusters each containing documents from the cluster.
  - 45. The computer program product according to claim 37 further comprising:
    - program code for calculating nearest neighbors between each document in a cluster and all other documents in the cluster.
  - 46. The computer program product according to claim 37 further comprising:
    - program code for selecting topic tokens for each of cluster from the tokens associated with the each cluster;
      
      program code for collecting all topic tokens into a list and eliminating duplicates;
      
      program code for modeling each cluster using a predetermined number of the remaining topic tokens; and
      
      program code for using a distance measure to identify related clusters.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Hewlett-Packard Development Company, L.P. (HP Inc.)
Original Assignee
Digital Equipment Corporation (HP Inc.)
Inventors
Travis, Robert, Prakash, Mayank, Vaithyanathan, Shivakumar
Primary Examiner(s)
Black, Thomas G.
Assistant Examiner(s)
ALAM, SHAHID AL

Application Number

US08/847,734
Time in Patent Office

578 Days
Field of Search

707/5, 707/7, 707/8, 707/2
US Class Current

707/692
CPC Class Codes

G06F 16/355   Class or cluster creation o...

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99937   Sorting

Y10S 707/99938   Concurrency, e.g. lock mana...

Method and apparatus for automatically generating hierarchical categories from large document collections

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

46 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for automatically generating hierarchical categories from large document collections

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

46 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links