Method for categorizing documents by multilevel feature selection and hierarchical clustering based on parts of speech tagging

US 20030236659A1
Filed: 06/20/2002
Published: 12/25/2003
Est. Priority Date: 06/20/2002
Status: Active Grant

First Claim

Patent Images

1. A method for categorizing documents comprising:

tagging parts of speech of words comprising said documents;

selecting a first set of features based on a first one of said parts of speech;

grouping said documents into clusters according to their semantic affinity to said first set of features and to each other; and

refining said clusters into a hierarchy of progressively refined clusters wherein subsequent sets of features are selected based on corresponding said parts of speech.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for categorizing documents is disclosed. The words composing the documents are tagged according to their parts of speech. A first group of features is selected corresponding to one of the parts of speech. The documents are grouped into clusters according to their semantic affinity to the first set of features and to each other. The clusters are refined into a hierarchy of progressively refined clusters, the features of which are selected based on corresponding parts of speech.

Citations

21 Claims

1. A method for categorizing documents comprising:
- tagging parts of speech of words comprising said documents;
  
  selecting a first set of features based on a first one of said parts of speech;
  
  grouping said documents into clusters according to their semantic affinity to said first set of features and to each other; and
  
  refining said clusters into a hierarchy of progressively refined clusters wherein subsequent sets of features are selected based on corresponding said parts of speech.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method as recited in claim 1 wherein said parts of speech are selected from the group consisting essentially of nouns, verbs, and adjectives.
  - 3. The method as recited in claim 1 wherein said refining comprises:
    - selecting a second group of features based on a second one of said parts of speech;
      
      grouping said clusters into refined clusters according to their semantic affinity to said second set of features and to each other; and
      
      determining a degree of semantic coherence of said hybrid clusters.
  - 4. The method as recited in claim 3 wherein said second one of said parts of speech comprises a part of speech that is different from said first one of said parts of speech.
  - 5. The method as recited in claim 3 further comprising further refining said refined clusters into a final set of clusters, wherein said final set of clusters comprise a defined category.
  - 6. The method as recited in claim 5 wherein said further refining comprises repeating said selecting, said grouping, and said determining recursively upon said refined clusters, wherein said selecting is based on progressively subsequent parts of speech in turn.
  - 7. The method as recited in claim 1 wherein said selecting comprises forming a feature space wherein said features comprise dimensions.
  - 8. The method as recited in claim 7 wherein said grouping further comprises:
    - transforming said documents into vectors according to the semantic weight of their vocabulary; and
      
      performing a clustering process upon said vectors.
  - 9. The method as recited in claim 8 wherein said clustering process is selected from the group consisting essentially of K-Means, K-Median, K-Harmonic Means, and Farthest Point.

10. A computer-implemented automated system for categorizing a collection of documents, wherein each document of said collection comprises a plurality of words, comprising:
- a preprocessor for tagging a word comprising a document according to its part of speech and producing a corresponding part of speech tagged document;
  
  a feature selector for generating a multidimensional feature space according to one of said parts of speech, forming a dimension of said feature space according to a semantic characteristic of said word tagged with said part of speech, and producing a corresponding part of speech specific feature;
  
  a vectorizer transforming each said document into a vector and populating said feature space with each said document according to a degree of semantic relation between each said document, and between each said document and said dimension; and
  
  a clusterizer for analyzing each said vector and generating a plurality of corresponding clusters wherein each cluster of said plurality categorizes said collection into a corresponding category, and determining the semantic sufficiency of each said corresponding category.
- View Dependent Claims (11, 12, 13, 14, 15)
- - 11. The system as recited in claim 10 wherein said preprocessor further comprises:
    - a document cleansing module for removing a part of said document that does not comprise a word;
      
      a part of speech tagging module for tagging a word comprising said document according to its part of speech; and
      
      a stop word removal module for filtering words having low semantic significance.
  - 12. The system as recited in claim 10 wherein said feature selector further comprises:
    - a queuing module for queuing each said document to be processed in said feature selector and queuing said cluster for refinement in said feature selector according to a part of speech selected in turn;
      
      a controlling module for keeping track of each said part of speech in turn;
      
      a token examination module for examining selectively each said document and each said cluster and identifying a corresponding part of speech thereof in turn to be processed accordingly, counting a frequency of occurrence of each said word, and promulgating a list of each said word tagged according to its part of speech and said frequency of occurrence thereof;
      
      a feature selection module for performing semantic analysis and choosing said feature according to a predetermined criterion and generating a corresponding part of speech specific feature; and
      
      a space generating module for forming said space specific to said part of speech selected in turn and defining a dimension of said space corresponding to said feature.
  - 13. The system as recited in claim 10 wherein said vectorizer generates said vector by assigning a weight to said document within said feature space wherein said weight is directly proportional to the frequency with which said word appears in said document and inversely proportional to the frequency with which said word appears in said collection of documents to which said document belongs.
  - 14. The system as recited in claim 10 wherein said clusterizer categorizes said collection by grouping each said document by a clustering process selected from the group consisting essentially of K-Means, K-Median, K-Harmonic Means, and Farthest Point.
  - 15. The system as recited in claim 10 wherein, upon determining that said category lacks semantic sufficiency, said feature selector, said vectorizer, and said clusterizer operate in concert to refine said cluster based upon a subsequent said part of speech.

16. A computer-implemented method for categorizing a collection of documents comprised of words, comprising:
- cleansing each said document of said collection;
  
  tagging said words according to their parts of speech to transform each said document into a corresponding part of speech tagged document;
  
  removing stop words to transform each said part of speech tagged document into a part of speech tagged document that is free of stop words;
  
  selecting from each said part of speech tagged document a first plurality of features corresponding to a first said part of speech;
  
  forming a first feature space corresponding to said first plurality of features wherein each of said first plurality of features comprises a dimension;
  
  transforming each said document into a vector in said first feature space;
  
  clustering said first plurality of vectors into first level clusters;
  
  determining the sufficiency of semantic coherence of said first level clusters; and
  
  refining said first level clusters accordingly.
- View Dependent Claims (17, 18, 19, 20, 21)
- - 17. The method as recited in claim 16 wherein said refining comprises:
    - substituting each said first level cluster for said part of speech tagged documents corresponding to said first level clusters;
      
      selecting an Nth plurality of features according to an Nth part of speech wherein said Nth part of speech is different from said first part of speech;
      
      forming an Nth feature space corresponding to said Nth plurality of features, wherein each of said Nth plurality of features is a dimension;
      
      transforming said first cluster into an Nth plurality of vectors in said Nth feature space;
      
      clustering said Nth plurality of vectors into Nth clusters; and
      
      determining the sufficiency of semantic coherence of said Nth clusters.
  - 18. The method as recited in claim 17 further comprising recursively substituting subsequent clusters in the place of said part of speech tagged document and recursively repeating said selecting, said forming, said transforming, said clustering, said determining, and said refining until sufficient semantic coherence is achieved.
  - 19. The method as recited in claim 18 wherein said selecting further comprises:
    - performing semantic analysis to choose said feature according to a predetermined criterion; and
      
      generating a corresponding part of speech specific feature.
  - 20. The method as recited in claim 18 wherein said transforming further comprises assigning a weight to said document within said feature space wherein said weight is directly proportional to the frequency with which said word appears in said document and inversely proportional to the frequency with which said word appears in said collection of documents to which said document belongs.
  - 21. The method as recited in claim 18 wherein said clustering comprises a process selected from the group consisting essentially of K-Means, K-Median, K-Harmonic Means, and Farthest Point.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Hewlett-Packard Development Company, L.P. (HP Inc.)
Original Assignee
Hewlett-Packard Development Company, L.P. (HP Inc.)
Inventors
Castellanos, Malu

Granted Patent

US 7,139,695 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/4
CPC Class Codes

G06F 40/211 Syntactic parsing, e.g. bas...

Y10S 707/99942 Manipulating data structure...

Method for categorizing documents by multilevel feature selection and hierarchical clustering based on parts of speech tagging

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Method for categorizing documents by multilevel feature selection and hierarchical clustering based on parts of speech tagging

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links