Method for categorizing documents by multilevel feature selection and hierarchical clustering based on parts of speech tagging
First Claim
Patent Images
1. A method for categorizing documents comprising:
- tagging parts of speech of words comprising said documents;
selecting a first set of features based on a first one of said parts of speech;
grouping said documents into clusters according to their semantic affinity to said first set of features and to each other; and
refining said clusters into a hierarchy of progressively refined clusters wherein subsequent sets of features are selected based on corresponding said parts of speech.
3 Assignments
0 Petitions
Accused Products
Abstract
A method for categorizing documents is disclosed. The words composing the documents are tagged according to their parts of speech. A first group of features is selected corresponding to one of the parts of speech. The documents are grouped into clusters according to their semantic affinity to the first set of features and to each other. The clusters are refined into a hierarchy of progressively refined clusters, the features of which are selected based on corresponding parts of speech.
-
Citations
21 Claims
-
1. A method for categorizing documents comprising:
-
tagging parts of speech of words comprising said documents;
selecting a first set of features based on a first one of said parts of speech;
grouping said documents into clusters according to their semantic affinity to said first set of features and to each other; and
refining said clusters into a hierarchy of progressively refined clusters wherein subsequent sets of features are selected based on corresponding said parts of speech. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A computer-implemented automated system for categorizing a collection of documents, wherein each document of said collection comprises a plurality of words, comprising:
-
a preprocessor for tagging a word comprising a document according to its part of speech and producing a corresponding part of speech tagged document;
a feature selector for generating a multidimensional feature space according to one of said parts of speech, forming a dimension of said feature space according to a semantic characteristic of said word tagged with said part of speech, and producing a corresponding part of speech specific feature;
a vectorizer transforming each said document into a vector and populating said feature space with each said document according to a degree of semantic relation between each said document, and between each said document and said dimension; and
a clusterizer for analyzing each said vector and generating a plurality of corresponding clusters wherein each cluster of said plurality categorizes said collection into a corresponding category, and determining the semantic sufficiency of each said corresponding category. - View Dependent Claims (11, 12, 13, 14, 15)
-
-
16. A computer-implemented method for categorizing a collection of documents comprised of words, comprising:
-
cleansing each said document of said collection;
tagging said words according to their parts of speech to transform each said document into a corresponding part of speech tagged document;
removing stop words to transform each said part of speech tagged document into a part of speech tagged document that is free of stop words;
selecting from each said part of speech tagged document a first plurality of features corresponding to a first said part of speech;
forming a first feature space corresponding to said first plurality of features wherein each of said first plurality of features comprises a dimension;
transforming each said document into a vector in said first feature space;
clustering said first plurality of vectors into first level clusters;
determining the sufficiency of semantic coherence of said first level clusters; and
refining said first level clusters accordingly. - View Dependent Claims (17, 18, 19, 20, 21)
-
Specification