×

Method for categorizing documents by multilevel feature selection and hierarchical clustering based on parts of speech tagging

  • US 7,139,695 B2
  • Filed: 06/20/2002
  • Issued: 11/21/2006
  • Est. Priority Date: 06/20/2002
  • Status: Expired due to Fees
First Claim
Patent Images

1. A method for categorizing documents comprising:

  • tagging words of said documents with tags indicating respective parts of speech of the words and producing corresponding part of speech tagged documents;

    selecting, by a feature selector, a first set of features based on the tagged words for a first one of said parts of speech;

    generating, by the feature selector, a multidimensional feature space that has plural dimensions corresponding to the first set of features;

    transforming, by a vectorizer, each of the documents into a vector and populating the feature space with each of the documents according to a degree of semantic relation between each document and the features corresponding to the dimensions of the feature space;

    grouping, by a clusterizer, said documents into clusters based on analyzing each of the vectors, wherein each of the clusters corresponds to a respective category; and

    determining, by the clusterizer, a semantic sufficiency of each of the categories; and

    in response to determining that the categories lack semantic sufficiency, refining, with the feature selector, vectorizer, and clusterizer, said clusters using at least another set of features for at least another one of said parts of speech different from the first one of the parts of speech.

View all claims
  • 3 Assignments
Timeline View
Assignment View
    ×
    ×