Taxonomy generation for document collections

US 6,446,061 B1
Filed: 06/30/1999
Issued: 09/03/2002
Est. Priority Date: 07/31/1998
Status: Expired due to Term

First Claim

Patent Images

1. A computer-executable method of generating a content taxonomy of a multitude of documents (210) stored on a computer system, said method comprising:

a subset-selection-step (201), for selecting a subset of said multitude of documents;

a taxonomy-generation-step (202 to 205), for generating a taxonomy for said subset, wherein said taxonomy is a tree-structured taxonomy-hierarchy, and wherein said subset is divided into a set of clusters with largest intra-similarity, and wherein each of said clusters of largest intra-similarity is assigned to a leaf-node of said taxonomy-hierarchy as outer-clusters, and wherein inner-nodes of said taxonomy-hierarchy order said subset, starting with said outer-clusters, into inner-clusters with increasing cluster size and decreasing similarity, and wherein said taxonomy-generation-step further comprises a first-feature-extraction-step (202) for extracting for each document of said subset its features, and for computing its feature statistics in a feature-vector (212) as a representation of said document; and

a routing-selection-step (206), for computing, for each unprocessed document of said multitude of documents not belonging to said subset, similarities with said outer-clusters, and for assigning said document to the leaf-node of said taxonomy-hierarchy being the outer-cluster with largest similarty.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

This mechanism relates to a method within the area of information mining within a multitude of documents stored on computer systems. More particularly, this mechanism relates to a computerized method of generating a content taxonomy of a multitude of electronic documents. The technique proposed by the current invention is able to improve at the same time the scalability and the coherence and selectivity of taxonomy generation. The fundamental approach of the current invention comprises a subset selection step, wherein a subset of a multitude of documents is being selected. In a taxonomy generation step a taxonomy is generated for that selected subset of documents, the taxonomy being a tree structured taxonomy hierarchy. Moreover this method comprises a routing selection step assigning each unprocessed document to the taxonomy hierarchy based on largest similarity.

Citations

31 Claims

1. A computer-executable method of generating a content taxonomy of a multitude of documents (210) stored on a computer system, said method comprising:
- a subset-selection-step (201), for selecting a subset of said multitude of documents;
  
  a taxonomy-generation-step (202 to 205), for generating a taxonomy for said subset, wherein said taxonomy is a tree-structured taxonomy-hierarchy, and wherein said subset is divided into a set of clusters with largest intra-similarity, and wherein each of said clusters of largest intra-similarity is assigned to a leaf-node of said taxonomy-hierarchy as outer-clusters, and wherein inner-nodes of said taxonomy-hierarchy order said subset, starting with said outer-clusters, into inner-clusters with increasing cluster size and decreasing similarity, and wherein said taxonomy-generation-step further comprises a first-feature-extraction-step (202) for extracting for each document of said subset its features, and for computing its feature statistics in a feature-vector (212) as a representation of said document; and
  
  a routing-selection-step (206), for computing, for each unprocessed document of said multitude of documents not belonging to said subset, similarities with said outer-clusters, and for assigning said document to the leaf-node of said taxonomy-hierarchy being the outer-cluster with largest similarty.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The method of claim 1, wherein said taxonomy-generation-step further comprises a clustering-step (203) using a hierarchical clustering algorithm for generating said taxonomy-hierarchy, and using said feature-vectors for determining similarity.
  - 3. The method of claim 1, wherein said features are extracted based on lexical affinities within said documents.
  - 4. The method of claim 3, wherein said lexical affinities are extracted with a window of M words to identify co-occurring words.
  - 5. The method of claim 4, wherein M is a natural number with 1<
    - M≦
      
      5.
  - 6. The method of claim 1, wherein said features are extracted based on linguistic features within said documents.
  - 7. The method of claim 1, wherein extracted features are selectively ignored based on statistical frequency extremes.
  - 8. The method of claim 1, wherein the depth of the taxonomy-hierarchy is limited to L levels by using a slicing technique to merge most similar clusters into one cluster until said taxonomy-hierarchy includes L levels.
  - 9. The method of claim 8, wherein L is a natural number from the range 1≦
    - L≦
      
      12.
  - 10. The method of claim 1, wherein said taxonomy-generation-step further comprises a labeling-step (204) labeling each node in the taxonomy-hierarchy.
  - 11. The method of claim 10, wherein the N most frequent distinguishing features of a cluster of a node in the taxonomy-hierarchy are used as labels.
  - 12. The method of claim 11, wherein N is a natural number with 1≦
    - N≦
      
      10.
  - 13. The method of claim 1, wherein said subset of said multitude of documents is determined by random selection.
  - 14. The method of claim 13, wherein the range of the document dates is divided into equally sized sub-ranges and said random selection is performed separately for documents with document dates from said sub-ranges.
  - 15. The method of claim 1, wherein said subset comprises up to 10% of said multitude of said documents.

16. A computer program product comprising a computer usable medium having computer readable program code means embodied in said medium for generating a content taxonomy of a multitude of documents stored on a computer system, said computer readable program code means comprising:
- a subset selector for selecting a subset of said multitude of documents;
  
  a taxonomy generator for generating a taxonomy for said subset, wherein said taxonomy generator further comprises a first-feature-extractor for extracting for each document of said subset its features, and for computing its feature statistics in a feature-vector as a representation of said document; and
  
  a routing selector for computing, for each unprocessed document of said multitude of documents not belonging to said subset, similarities with said outer-clusters and for assigning said document to the leaf-node of said taxonomy-hierarchy being the outer-cluster with largest similarity, wherein said taxonomy is a tree-structured taxonomy-hierarchy, and wherein said subset is divided into a set of clusters with largest intra-similarity, and wherein each of said clusters of largest intra-similarity is assigned to a leaf-node of said taxonomy-hierarchy as outer-clusters, and wherein inner-nodes of said taxonomy-hierarchy order said subset, starting with said outer-clusters, into inner-clusters with increasing cluster size and decreasing similarity.

17. A system for generating a content taxonomy of a multitude of documents stored on a computer system, said system comprising:
- means for selecting a subset of said multitude of documents;
  
  means for generating a taxonomy for said subset, wherein said taxonomy is a tree-structured taxonomy-hierarchy, and wherein said subset is divided into a set of clusters with largest intra-similarity, and wherein each of said clusters of largest intra-similarity is assigned to a leaf-node of said taxonomy-hierarchy as outer-clusters, and wherein inner-nodes of said taxonomy-hierarchy order said subset, starting with said outer-clusters, into inner-clusters with increasing cluster size and decreasing similarity, and wherein said means for generating further comprises a first-feature-extractor means for extracting for each document of said subset its features, and for computing its feature statistics in a feature-vector as a representation of said document; and
  
  means for computing, for each unprocessed document of said multitude of documents not belonging to said subset, similarities with said outer-clusters, and for assigning said document to the leaf-node of said taxonomy-hierarchy being the outer-cluster with largest similarity.
- View Dependent Claims (18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31)
- - 18. The system of claim 17, wherein said means for generating further comprises a clustering-tool using a hierarchical clustering algorithm for generating said taxonomy-hierarchy, and using said feature-vectors for determining similarity.
  - 19. The system of claim 17, wherein said features are extracted based on lexical affinities within said documents.
  - 20. The system of claim 19, wherein said lexical affinities are extracted with a window of M words to identify co-occuring words.
  - 21. The system of claim 20, wherein M is a natural number with 1<
    - M≦
      
      5.
  - 22. The system of claim 17, wherein said features are extracted based on linguistic features within said documents.
  - 23. The system of claim 17, wherein extracted features are selectively ignored based on statistical frequency extremes.
  - 24. The system of claim 17, wherein the depth of the taxonomy-hierarchy is limited to L levels by using a slicing technique to merge most similar clusters into one cluster until said taxonomy-hierarchy includes L levels.
  - 25. The system of claim 24, wherein L is a natural number from the range 1<
    - L≦
      
      12.
  - 26. The system of claim 17, wherein said taxonomy-generation-tool further comprises a labeling-tool labeling each node in the taxonomy-hierarchy.
  - 27. The system of claim 26, wherein the N most frequent distinguishing features of a cluster of a node in the taxonomy-hierarchy are used as labels.
  - 28. The system of claim 27, wherein N is a natural number with 1≦
    - N≦
      
      10.
  - 29. The system of claim 17, wherein said subset of said multitude of documents is determined by random selection.
  - 30. The system of claim 29, wherein the range of the document dates is divided into equally sized sub-ranges and said random selection is performed separately for documents with document dates from said sub-ranges.
  - 31. The system of claim 17, wherein said subset comprises up to 10% of said multitude of said documents.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Gerstl, Peter, Doerre, Jochen, Seiffert, Roland, Mueller, Adrian, Goeser, Sebastian
Primary Examiner(s)
Mizrahi, Diane D.
Assistant Examiner(s)
MOFIZ, APU M

Application Number

US09/345,260
Time in Patent Office

1,161 Days
Field of Search

707/101, 707/6, 707/3, 705/10, 706/50
US Class Current

707/738
CPC Class Codes

G06F 16/355   Class or cluster creation o...

Y10S 707/914   Video

Y10S 707/915   Image

Y10S 707/916   Audio

Y10S 707/917   Text

Y10S 707/99933   Query processing, i.e. sear...

Taxonomy generation for document collections

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

31 Claims

Specification

Solutions

Use Cases

Quick Links

Taxonomy generation for document collections

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

31 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links