×

Method for improvement accuracy of decision tree based text categorization

  • US 6,253,169 B1
  • Filed: 05/28/1998
  • Issued: 06/26/2001
  • Est. Priority Date: 05/28/1998
  • Status: Expired due to Term
First Claim
Patent Images

1. A text categorization method comprising:

  • obtaining a collection of electronic documents;

    defining a sample set of documents from the collection;

    classifying the documents in the sample set in accordance with steps which include;

    (a) analyzing words in the documents ofthe sample set to identify a plurality of topics, (b) developing a plurality of local dictionaries, each containing words descriptive of a respective one of said plurality of topics, and (c) developing vectors for each of the documents in the sample set, with the vectors developed for each document in the sample set being indicative of words in a respective one of said plurality of local dictionaries developed for a respective one of said plurality of topics;

    forming a prediction model based on the classification of the documents in the sample set performed in said classifying step, said forming step including;

    (d) forming a plurality of decision trees for said plurality of topics, respectively, said decision trees each being formed based on the vectors developed for the documents in said sample for a respective one of said plurality of topics;

    classifying a new document based on the prediction model, wherein the step of classifing the documents in the sample set includes combining said plurality of local dictionaries into a single pooled dictionary, said single pooled dictionary containing sorted words with duplicate words removed, and wherein the step of classifying a new document based on the prediction model includes;

    identifying words in the new document which correspond to words in said single pooled dictionary;

    forming said words into groups belonging to respective ones of said plurality of topics;

    applying said plurality of decision trees to said groups to derive classification outcomes, each of said classification outcomes being generated by applying one of said plurality of decision trees to a respective one of said groups relative to one of said plurality of topics; and

    classifying the new document into at least one of said plurality of topics based on said classification outcomes.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×