Method for categorizing documents by multilevel feature selection and hierarchical clustering based on parts of speech tagging

US 7,139,695 B2
Filed: 06/20/2002
Issued: 11/21/2006
Est. Priority Date: 06/20/2002
Status: Expired due to Fees

First Claim

Patent Images

1. A method for categorizing documents comprising:

tagging words of said documents with tags indicating respective parts of speech of the words and producing corresponding part of speech tagged documents;

selecting, by a feature selector, a first set of features based on the tagged words for a first one of said parts of speech;

generating, by the feature selector, a multidimensional feature space that has plural dimensions corresponding to the first set of features;

transforming, by a vectorizer, each of the documents into a vector and populating the feature space with each of the documents according to a degree of semantic relation between each document and the features corresponding to the dimensions of the feature space;

grouping, by a clusterizer, said documents into clusters based on analyzing each of the vectors, wherein each of the clusters corresponds to a respective category; and

determining, by the clusterizer, a semantic sufficiency of each of the categories; and

in response to determining that the categories lack semantic sufficiency, refining, with the feature selector, vectorizer, and clusterizer, said clusters using at least another set of features for at least another one of said parts of speech different from the first one of the parts of speech.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for categorizing documents is disclosed. The words composing the documents are tagged according to their parts of speech. A first group of features is selected corresponding to one of the parts of speech. The documents are grouped into clusters according to their semantic affinity to the first set of features and to each other. The clusters are refined into a hierarchy of progressively refined clusters, the features of which are selected based on corresponding parts of speech.

39 Citations

View as Search Results

15 Claims

1. A method for categorizing documents comprising:
- tagging words of said documents with tags indicating respective parts of speech of the words and producing corresponding part of speech tagged documents;
  
  selecting, by a feature selector, a first set of features based on the tagged words for a first one of said parts of speech;
  
  generating, by the feature selector, a multidimensional feature space that has plural dimensions corresponding to the first set of features;
  
  transforming, by a vectorizer, each of the documents into a vector and populating the feature space with each of the documents according to a degree of semantic relation between each document and the features corresponding to the dimensions of the feature space;
  
  grouping, by a clusterizer, said documents into clusters based on analyzing each of the vectors, wherein each of the clusters corresponds to a respective category; and
  
  determining, by the clusterizer, a semantic sufficiency of each of the categories; and
  
  in response to determining that the categories lack semantic sufficiency, refining, with the feature selector, vectorizer, and clusterizer, said clusters using at least another set of features for at least another one of said parts of speech different from the first one of the parts of speech.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method as recited in claim 1 wherein said parts of speech are selected from the group consisting of nouns, verbs, and adjectives.
  - 3. The method as recited in claim 1 wherein said refining comprises:
    - selecting the at least another set of features for the at least another one of said parts of speech;
      
      grouping said clusters into refined clusters according to their semantic affinity to said another set of features and to each other, anddetermining a degree of semantic coherence of said refined clusters.
  - 4. The method as recited in claim 3 further comprising further refining said refined clusters into a final set of clusters, wherein said final set of clusters comprise corresponding categories.
  - 5. The method as recited in claim 4 wherein said further refining comprises repeating said selecting, said grouping, and said determining recursively upon said refined clusters, wherein said selecting is based on progressively subsequent parts of speech in turn.
  - 6. The method as recited in claim 1 wherein grouping said documents into clusters uses a clustering process selected from the group consisting of K-Means, K-Median, K-Harmonic Means, and Farthest Point.

7. A computer-implemented automated system for categorizing a collection of documents, wherein each document of said collection comprises a plurality of words, comprising:
- a preprocessor for tagging words of each of the documents according to parts of speech of the words and producing corresponding part of speech tagged documents;
  
  a feature selector for generating a multidimensional feature space according to a first one of said parts of speech, wherein the feature space has plural dimensions corresponding to features selected from the words tagged according to the first part of speech;
  
  a vectorizer for transforming each said document into a vector and populating said feature space with each said document according to a degree of semantic relation between each said document and the features corresponding to the dimensions of the feature space; and
  
  a clusterizer for analyzing each said vector and generating a plurality of corresponding clusters, wherein each of the clusters corresponds to a respective category, and the clusterizer for determining a semantic sufficiency of each said corresponding category,wherein said feature selector further comprises;
  
  a queuing module for queuing each said document to be processed in said feature selector and queuing the clusters for refinement in said feature selector according to a part of speech selected in turn;
  
  a controlling module for keeping track of each said part of speech in turn;
  
  a token examination module for examining selectively each said document and each said cluster and identifying a corresponding part of speech thereof in turn to be processed accordingly, counting a frequency of occurrence of each said word, and promulgating a list of each said word tagged according to its part of speech and said frequency of occurrence thereof;
  
  a feature selection module for performing semantic analysis and choosing said feature according to a predetermined criterion and generating corresponding part of speech specific features; and
  
  a space generating module for forming said feature space specific to said part of speech selected in turn and defining dimensions of said space corresponding to said features.

8. A computer-implemented automated system for categorizing a collection of documents, wherein each document of said collection comprises a plurality of words, comprising;
- a preprocessor for tagging words of each of the documents according to parts of speech of the words and producing corresponding part of speech tagged documents;
  
  a feature selector for generating a multidimensional feature space according to a first one of said parts of speech, wherein the feature space has plural dimensions corresponding to features selected from the words tagged according to the first part of speech;
  
  a vectorizer for transforming each said document into a vector and populating said feature space with each said document according to a degree of semantic relation between each said document and the features corresponding to the dimensions of the feature space; and
  
  a clusterizer for analyzing each said vector and generating a plurality of corresponding clusters, wherein each of the clusters corresponds to a respective category, and the clusterizer for determining a semantic sufficiency of each said corresponding category,upon determining that said categories lack semantic sufficiency, said feature selector, said vectorizer, and said clusterizer operate in concert to refine said clusters based upon a subsequent said part of speech.
- View Dependent Claims (9, 10, 11)
- - 9. The system as recited in claim 8, wherein said preprocessor further comprises:
    - a document cleansing module for removing a part of each document that does not comprise a word;
      
      a part of speech tagging module for tagging each word according to its part of speech; and
      
      a stop word removal module for filtering words having low semantic significance.
  - 10. The system as recited in claim 8, wherein said vectorizer further assigns weights to the vectors within said feature space, wherein said weights are directly proportional to frequencies with which words appear in corresponding documents and inversely proportional to frequencies with which words appear in said collection of documents to which said document belongs.
  - 11. The system as recited in claim 8, wherein said clusterizer categorizes said collection by grouping each said document by a clustering process selected from the group consisting of K-Means, K-Median, K-Harmonic Means, and Farthest Point.

12. A method provided by execution by a computer of instructions on a computer-readable medium, the method for categorizing a collection of documents containing words and comprising:
- tagging said words according to their parts of speech to transform each said document into a corresponding part of speech tagged document;
  
  removing stop words to transform each said part of speech tagged document into a part of speech tagged document that is free of stop words;
  
  selecting from said part of speech tagged documents a first plurality of features corresponding to a first said part of speech;
  
  forming a first feature space corresponding to said first plurality of features, wherein each of said first plurality of features corresponds to a dimension of the first feature space;
  
  transforming each said document into a vector in said first feature space;
  
  clustering said vectors in said first feature space into first level clusters;
  
  determining a sufficiency of semantic coherence of said first level clusters; and
  
  refining said first level clusters in response to determining the semantic coherence of the first level of clusters is insufficient,wherein said refining comprises;
  
  selecting an Nth plurality of features according to an Nth part of speech wherein said Nth part of speech is different from said first part of speech;
  
  forming an Nth feature space corresponding to said Nth plurality of features, wherein each of said Nth plurality of features is a dimension of the Nth feature space;
  
  transforming said first level of clusters into an Nth plurality of vectors in said Nth feature space;
  
  clustering said Nth plurality of vectors into Nth clusters; and
  
  determining the sufficiency of semantic coherence of said Nth clusters.
- View Dependent Claims (13, 14)
- - 13. The method as recited in claim 12 further comprising recursively repeating said selecting, said forming, said transforming, said clustering, said determining, and said refining until sufficient semantic coherence is achieved.
  - 14. The method as recited in claim 12, wherein said clustering comprises a process selected from the group consisting of K-Means, K-Median, K-Harmonic Means, and Farthest Point.

15. A method provided by execution by a computer of instructions on a computer-readable medium, the method for categorizing a collection of documents containing words and comprising:
- tagging said words according to their parts of speech to transform each said document into a corresponding part of speech tagged document;
  
  removing stop words to transform each said part of speech tagged document into a part of speech tagged document that is free of stop words;
  
  selecting from said part of speech tagged documents a first plurality of features corresponding to a first said part of speech;
  
  forming a first feature space corresponding to said first plurality of features, wherein each of said first plurality of features corresponds to a dimension of the first feature space;
  
  transforming each said document into a vector in said first feature space;
  
  clustering said vectors in said firs feature space into first level clusters;
  
  determining a sufficiency of semantic coherence of said first level clusters; and
  
  refining said first level clusters in response to determining the semantic coherence of the first level of clusters is insufficient,wherein said selecting further comprises;
  
  performing semantic analysis to choose said first plurality of features according to a predetermined criterion; and
  
  assigning weights to said vectors within said feature space, wherein said weights are directly proportional to frequencies with which words appear in respective documents and inversely proportional to frequencies with which said words appear in said collection of documents to which said document belongs.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Hewlett-Packard Development Company, L.P. (HP Inc.)
Original Assignee
Hewlett-Packard Development Company, L.P. (HP Inc.)
Inventors
Castellanos, Malu
Primary Examiner(s)
Dorvil, Richemond
Assistant Examiner(s)
Shortledge, Thomas E.

Application Number

US10/177,892
Publication Number

US 20030236659A1
Time in Patent Office

1,615 Days
Field of Search

704 1- 10, 704/249, 707/2, 707/5, 707/7, 707/101, 707/204, 707/205
US Class Current

704/4
CPC Class Codes

G06F 40/211 Syntactic parsing, e.g. bas...

Y10S 707/99942 Manipulating data structure...

Method for categorizing documents by multilevel feature selection and hierarchical clustering based on parts of speech tagging

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

39 Citations

15 Claims

Specification

Solutions

Use Cases

Quick Links

Method for categorizing documents by multilevel feature selection and hierarchical clustering based on parts of speech tagging

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

39 Citations

15 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links