Method for categorizing documents by multilevel feature selection and hierarchical clustering based on parts of speech tagging
First Claim
Patent Images
1. A method for categorizing documents comprising:
- tagging words of said documents with tags indicating respective parts of speech of the words and producing corresponding part of speech tagged documents;
selecting, by a feature selector, a first set of features based on the tagged words for a first one of said parts of speech;
generating, by the feature selector, a multidimensional feature space that has plural dimensions corresponding to the first set of features;
transforming, by a vectorizer, each of the documents into a vector and populating the feature space with each of the documents according to a degree of semantic relation between each document and the features corresponding to the dimensions of the feature space;
grouping, by a clusterizer, said documents into clusters based on analyzing each of the vectors, wherein each of the clusters corresponds to a respective category; and
determining, by the clusterizer, a semantic sufficiency of each of the categories; and
in response to determining that the categories lack semantic sufficiency, refining, with the feature selector, vectorizer, and clusterizer, said clusters using at least another set of features for at least another one of said parts of speech different from the first one of the parts of speech.
3 Assignments
0 Petitions
Accused Products
Abstract
A method for categorizing documents is disclosed. The words composing the documents are tagged according to their parts of speech. A first group of features is selected corresponding to one of the parts of speech. The documents are grouped into clusters according to their semantic affinity to the first set of features and to each other. The clusters are refined into a hierarchy of progressively refined clusters, the features of which are selected based on corresponding parts of speech.
39 Citations
15 Claims
-
1. A method for categorizing documents comprising:
-
tagging words of said documents with tags indicating respective parts of speech of the words and producing corresponding part of speech tagged documents; selecting, by a feature selector, a first set of features based on the tagged words for a first one of said parts of speech; generating, by the feature selector, a multidimensional feature space that has plural dimensions corresponding to the first set of features; transforming, by a vectorizer, each of the documents into a vector and populating the feature space with each of the documents according to a degree of semantic relation between each document and the features corresponding to the dimensions of the feature space; grouping, by a clusterizer, said documents into clusters based on analyzing each of the vectors, wherein each of the clusters corresponds to a respective category; and determining, by the clusterizer, a semantic sufficiency of each of the categories; and in response to determining that the categories lack semantic sufficiency, refining, with the feature selector, vectorizer, and clusterizer, said clusters using at least another set of features for at least another one of said parts of speech different from the first one of the parts of speech. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A computer-implemented automated system for categorizing a collection of documents, wherein each document of said collection comprises a plurality of words, comprising:
-
a preprocessor for tagging words of each of the documents according to parts of speech of the words and producing corresponding part of speech tagged documents; a feature selector for generating a multidimensional feature space according to a first one of said parts of speech, wherein the feature space has plural dimensions corresponding to features selected from the words tagged according to the first part of speech; a vectorizer for transforming each said document into a vector and populating said feature space with each said document according to a degree of semantic relation between each said document and the features corresponding to the dimensions of the feature space; and a clusterizer for analyzing each said vector and generating a plurality of corresponding clusters, wherein each of the clusters corresponds to a respective category, and the clusterizer for determining a semantic sufficiency of each said corresponding category, wherein said feature selector further comprises; a queuing module for queuing each said document to be processed in said feature selector and queuing the clusters for refinement in said feature selector according to a part of speech selected in turn; a controlling module for keeping track of each said part of speech in turn; a token examination module for examining selectively each said document and each said cluster and identifying a corresponding part of speech thereof in turn to be processed accordingly, counting a frequency of occurrence of each said word, and promulgating a list of each said word tagged according to its part of speech and said frequency of occurrence thereof; a feature selection module for performing semantic analysis and choosing said feature according to a predetermined criterion and generating corresponding part of speech specific features; and a space generating module for forming said feature space specific to said part of speech selected in turn and defining dimensions of said space corresponding to said features.
-
-
8. A computer-implemented automated system for categorizing a collection of documents, wherein each document of said collection comprises a plurality of words, comprising;
-
a preprocessor for tagging words of each of the documents according to parts of speech of the words and producing corresponding part of speech tagged documents; a feature selector for generating a multidimensional feature space according to a first one of said parts of speech, wherein the feature space has plural dimensions corresponding to features selected from the words tagged according to the first part of speech; a vectorizer for transforming each said document into a vector and populating said feature space with each said document according to a degree of semantic relation between each said document and the features corresponding to the dimensions of the feature space; and a clusterizer for analyzing each said vector and generating a plurality of corresponding clusters, wherein each of the clusters corresponds to a respective category, and the clusterizer for determining a semantic sufficiency of each said corresponding category, upon determining that said categories lack semantic sufficiency, said feature selector, said vectorizer, and said clusterizer operate in concert to refine said clusters based upon a subsequent said part of speech. - View Dependent Claims (9, 10, 11)
-
-
12. A method provided by execution by a computer of instructions on a computer-readable medium, the method for categorizing a collection of documents containing words and comprising:
-
tagging said words according to their parts of speech to transform each said document into a corresponding part of speech tagged document; removing stop words to transform each said part of speech tagged document into a part of speech tagged document that is free of stop words; selecting from said part of speech tagged documents a first plurality of features corresponding to a first said part of speech; forming a first feature space corresponding to said first plurality of features, wherein each of said first plurality of features corresponds to a dimension of the first feature space; transforming each said document into a vector in said first feature space; clustering said vectors in said first feature space into first level clusters; determining a sufficiency of semantic coherence of said first level clusters; and refining said first level clusters in response to determining the semantic coherence of the first level of clusters is insufficient, wherein said refining comprises; selecting an Nth plurality of features according to an Nth part of speech wherein said Nth part of speech is different from said first part of speech; forming an Nth feature space corresponding to said Nth plurality of features, wherein each of said Nth plurality of features is a dimension of the Nth feature space; transforming said first level of clusters into an Nth plurality of vectors in said Nth feature space; clustering said Nth plurality of vectors into Nth clusters; and determining the sufficiency of semantic coherence of said Nth clusters. - View Dependent Claims (13, 14)
-
-
15. A method provided by execution by a computer of instructions on a computer-readable medium, the method for categorizing a collection of documents containing words and comprising:
-
tagging said words according to their parts of speech to transform each said document into a corresponding part of speech tagged document; removing stop words to transform each said part of speech tagged document into a part of speech tagged document that is free of stop words; selecting from said part of speech tagged documents a first plurality of features corresponding to a first said part of speech; forming a first feature space corresponding to said first plurality of features, wherein each of said first plurality of features corresponds to a dimension of the first feature space; transforming each said document into a vector in said first feature space; clustering said vectors in said firs feature space into first level clusters; determining a sufficiency of semantic coherence of said first level clusters; and refining said first level clusters in response to determining the semantic coherence of the first level of clusters is insufficient, wherein said selecting further comprises; performing semantic analysis to choose said first plurality of features according to a predetermined criterion; and assigning weights to said vectors within said feature space, wherein said weights are directly proportional to frequencies with which words appear in respective documents and inversely proportional to frequencies with which said words appear in said collection of documents to which said document belongs.
-
Specification