Method for improvement accuracy of decision tree based text categorization
First Claim
Patent Images
1. A text categorization method comprising:
- obtaining a collection of electronic documents;
defining a sample set of documents from the collection;
classifying the documents in the sample set in accordance with steps which include;
(a) analyzing words in the documents ofthe sample set to identify a plurality of topics, (b) developing a plurality of local dictionaries, each containing words descriptive of a respective one of said plurality of topics, and (c) developing vectors for each of the documents in the sample set, with the vectors developed for each document in the sample set being indicative of words in a respective one of said plurality of local dictionaries developed for a respective one of said plurality of topics;
forming a prediction model based on the classification of the documents in the sample set performed in said classifying step, said forming step including;
(d) forming a plurality of decision trees for said plurality of topics, respectively, said decision trees each being formed based on the vectors developed for the documents in said sample for a respective one of said plurality of topics;
classifying a new document based on the prediction model, wherein the step of classifing the documents in the sample set includes combining said plurality of local dictionaries into a single pooled dictionary, said single pooled dictionary containing sorted words with duplicate words removed, and wherein the step of classifying a new document based on the prediction model includes;
identifying words in the new document which correspond to words in said single pooled dictionary;
forming said words into groups belonging to respective ones of said plurality of topics;
applying said plurality of decision trees to said groups to derive classification outcomes, each of said classification outcomes being generated by applying one of said plurality of decision trees to a respective one of said groups relative to one of said plurality of topics; and
classifying the new document into at least one of said plurality of topics based on said classification outcomes.
2 Assignments
0 Petitions
Accused Products
Abstract
A text categorization method automatically classifies electronic documents by developing a single pooled dictionary of words for a sample set of documents, and then generating a decision tree model, based on the pooled dictionary, for classifying new documents. Adaptive resampling techniques are applied to improve the accuracy of the decision tree model.
-
Citations
18 Claims
-
1. A text categorization method comprising:
-
obtaining a collection of electronic documents;
defining a sample set of documents from the collection;
classifying the documents in the sample set in accordance with steps which include;
(a) analyzing words in the documents ofthe sample set to identify a plurality of topics, (b) developing a plurality of local dictionaries, each containing words descriptive of a respective one of said plurality of topics, and (c) developing vectors for each of the documents in the sample set, with the vectors developed for each document in the sample set being indicative of words in a respective one of said plurality of local dictionaries developed for a respective one of said plurality of topics;
forming a prediction model based on the classification of the documents in the sample set performed in said classifying step, said forming step including;
(d) forming a plurality of decision trees for said plurality of topics, respectively, said decision trees each being formed based on the vectors developed for the documents in said sample for a respective one of said plurality of topics;
classifying a new document based on the prediction model, wherein the step of classifing the documents in the sample set includes combining said plurality of local dictionaries into a single pooled dictionary, said single pooled dictionary containing sorted words with duplicate words removed, and wherein the step of classifying a new document based on the prediction model includes;
identifying words in the new document which correspond to words in said single pooled dictionary;
forming said words into groups belonging to respective ones of said plurality of topics;
applying said plurality of decision trees to said groups to derive classification outcomes, each of said classification outcomes being generated by applying one of said plurality of decision trees to a respective one of said groups relative to one of said plurality of topics; and
classifying the new document into at least one of said plurality of topics based on said classification outcomes. - View Dependent Claims (2, 3)
-
-
4. A text categorization method comprising:
-
obtaining a collection of electronic documents;
defining a sample set of documents from the collection;
classifying the documents in the sample set in accordance with steps which include;
(a) analyzing words in the documents of the sample set to identify a plurality of topics, (b) developing a plurality of local dictionaries, each containing words descriptive of a respective one of said plurality of topics, and (c) developing vectors for each of the documents in the sample set, with the vectors developed for each document in the sample set being indicative of words in a respective one of said plurality of local dictionaries developed for a respective one of said plurality of topics;
forming a prediction model based on the classification of the documents in the sample set performed in said classifying step, said forming step including;
(d) forming a plurality of decision trees for said plurality of topics, respectively, said decision trees each being formed based on the vectors developed for the documents in said sample for a respective one of said plurality of topics;
classifying a new document based on the prediction model, wherein the step of classifying the documents in the sample set includes combining said plurality of local dictionaries into a single pooled dictionary, said single pooled dictionary containing sorted words with duplicate words removed, and wherein said step of forming a plurality of decision trees is performed in accordance with an adaptive resampling technique which includes, for respective ones of said plurality of topics, the following steps;
(e) forming a first decision tree for a given topic based on each of the vectors developed for documents in a first subset of said sample set which correspond to the given topic;
(f) testing accuracy of said first decision tree by using said first decision tree to classify documents in said sample set;
(g) identifying, based on said testing step, documents in said first subset that were erroneously classified by said first decision tree; and
(h) forming a second decision tree for the given topic based on each of the vectors developed for documents in a second subset of said sample set which correspond to the given topic, said second decision tree being developed so as to properly classify the documents erroneously classified by said first decision tree. - View Dependent Claims (5, 6, 7, 8, 9, 10, 11, 12, 13)
performing steps (e) through (h) to analogously develop third through N decision trees for the given topic based on the vectors developed for documents in respective additional subsets of said sample set which correspond to the given topic, each of said third through N decision trees being developed so as to properly classify documents erroneously classified by a preceding decision tree.
-
-
6. The method of claim 5, wherein N=10.
-
7. The method of claim 5, wherein N lies in a range between 3 and 100, inclusive.
-
8. The method of claim 5, wherein the step of classifying a new document based on the prediction model includes:
-
identifying words in the new document which correspond to words in said single pooled dictionary;
forming said words into sets of words belonging to respective ones of said plurality of pre-selected topics;
applying said plurality of decision trees to respective ones of said sets relative to said plurality of pre-selected topics, said applying step including applying the first through N decision trees generated for each topic to a respective one of said sets, the first through N decision trees for each topic generating first through N classification outcomes which are then voted to determine a final classification outcome; and
classifying the new document into at least one of said plurality of pre-selected topics based on the final classification outcomes generated in said applying step.
-
-
9. The method of claim 5, wherein said first, second, and additional subsets are derived from a random selection of documents from said sample set.
-
10. The method of claim 9, wherein said first, second, and additional subsets each contain a same number of documents in said sample set, but not all the documents in said sample set are included in each of said first, second, and additional subsets.
-
11. The method of claim 10, wherein said first, second, and additional subsets each contain a same number of documents in said sample set by including redundant documents in each of said first, second, and additional subsets.
-
12. The method of claim 8, wherein said first through N classification outcomes are voted according to a simple majority vote.
-
13. The method of claim 5, wherein the first through N decision trees may be developed sequentially or simultaneously.
-
14. A text categorization method comprising:
-
obtaining a collection of electronic documents;
defining a sample set of documents from the collection;
classifying the documents in the sample set in accordance with steps which include;
(a) analyzing words in the documents of the sample set to identify a plurality of topics, (b) developing a plurality of local dictionaries, each containing words descriptive of a respective one of said plurality of topics, and (c) developing vectors for each of the documents in the sample set, with the vectors developed for each document in the sample set being indicative of words in a respective one of said plurality of local dictionaries developed for a respective one of said plurality of topics;
forming a prediction model based on the classification of the documents in the sample set performed in said classifying step, said forming step including;
(d) forming a plurality of decision trees for said plurality of topics, respectively, said decision trees each being formed based on the vectors developed for the documents in said sample for a respective one of said plurality of topics;
classifying a new document based on the prediction model, wherein said step of forming a plurality of decision trees is performed in accordance with an adaptive resampling technique which includes, for respective ones of said plurality of topics, the following steps;
forming multiple subsets of documents from said sample set for a given topic;
forming a group of decision trees each based on a different one of said multiple subsets;
testing accuracy of said group of decision trees, each decision tree in said group, except a last decision tree in said group, being tested for accuracy before another decision tree in said group is formed, said testing step further including maintaining a cumulative error table in memory, said table storing information indicative of which decision trees from said group erroneously classified documents in respective ones of said subsets, as well as information indicative of the erroneously classified documents themselves, said table being updated with said information after each decision tree in said group is tested for accuracy, wherein said step of forming a group of decision trees includes forming each decision tree in said group, except a first decision tree in said group, with an emphasis on properly classifying at least one document in said cumulative error table which has been erroneously classified by one or more previous decision trees in said group.
-
-
15. A text categorization method comprising:
-
obtaining collection of electronic documents;
defining a sample set of documents from the collection;
classifying the documents in the sample set in accordance with steps which include;
(a) analyzing words in the documents of the sample set to identify a plurality of topics, (b) developing a plurality of local dictionaries, each containing words descriptive of a respective one of said plurality of topics, and (c) developing vectors for each of the documents in the sample set, with the vectors developed for each document in the sample set being indicative of words in a respective one of said plurality of local dictionaries developed for a respective one of said plurality of topics;
forming a prediction model based on the classification of the documents in the sample set performed in said classifying step, said forming step including;
(d) forming a multi-class decision tree based on the vectors developed for the documents in said sample for said plurality of topics;
classifying a new document based on the prediction model. wherein the step of classifying the documents in the sample set includes;
combining said plurality oflocal dictionaries into a single pooled dictionary, said single pooled dictionary containing sorted words with duplicate words removed, wherein the step of classifying a new document based on the prediction model, includes;
identifying words in the new document which correspond to words in said single pooled dictionary;
forming said words into groups belonging to respective ones of said plurality of topics;
applying said multi-class decision tree to said groups to derive a classification outcome; and
classifying the new document into one of said plurality of topics based on said classification outcome.
-
-
16. A text categorization method comprising:
-
obtaining a collection of electronic documents;
defining a sample set of documents from the collection;
classifying the documents in the sample set in accordance with steps which include;
(a) analyzing words in the documents ofthe sample set to identify a plurality of topics, (b) developing a plurality of local dictionaries, each containing words descriptive of a respective one of said plurality of topics, and (c) developing vectors for each of the documents in the sample set, with the vectors developed for each document in the sample set being indicative of words in a respective one of said plurality of local dictionaries developed for a respective one of said plurality of topics;
forming a prediction model based on the classification of the documents in the sample set performed in said classifying step, said forming step including;
(d) forming a plurality of decision trees for said plurality of topics, respectively, said decision trees each being formed based on the vectors developed for the documents in said sample for a respective one of said plurality of topics;
classifying a new document based on the prediction model, wherein the step of classifying a new document based on the prediction model includes;
identifying words in the new document which correspond to words in said plurality of local dictionaries;
forming said words into groups belonging to respective ones of said plurality of topics;
applying said plurality of decision trees to said groups to derive classification outcomes each of said classification outcomes being generated by applying one of said plurality of decision trees to a respective one of said groups relative to one of said plurality of topics; and
classifying the new document into at least one of said plurality of topics based on said classification outcomes, wherein the step of classifying the documents in the sample set includes combining said plurality of local dictionaries into a single pooled dictionary, said single pooled dictionary containing sorted words with duplicate words removed, and wherein said step of identifying words in the new document includes identifying words which correspond to words in said single pooled dictionary.
-
-
17. A text categorization method comprising:
-
obtaining a collection of electronic documents;
defining a sample set of documents from the collection;
classifying the documents in the sample set in accordance with steps which include;
(a) analyzing words in the documents of the sample set to identify a plurality of topics, (b) developing a plurality of local dictionaries, each containing words descriptive of a respective one of said plurality of topics, and (c) developing vectors for each of the documents in the sample set, with the vectors developed for each document in the sample set being indicative of words in a respective one of said plurality of local dictionaries developed for a respective one of said plurality of topics;
forming a prediction model based on the classification of the documents in the sample set performed in said classifying step, said forming step including;
(d) forming a multi-class decision tree based on the vectors developed for the documents in said sample for said plurality of topics;
classifying a new documnet based on the prediction model, wherein the step of classifying a new document based on the prediction model includes;
identifying words in the new document which correspond to words in said plurality of local dictionaries;
forming said words into groups belonging to respective ones of said plurality of topics;
applying said multi-class desision tree to said groups to derive a classification outcome; and
classifying the new document into one of said plurality of topics based on said classification outcome, wherein the step of classifying the documents in the sample set includes combining said plurality of local dictionaries into a single pooled dictionary, said single pooled dictionary containing sorted words with duplicate words removed, and wherein said step of identifying words in the new document includes identifying words which correspond to words in said single pooled dictionary.
-
-
18. A text categorization method comprising:
-
obtaining a collection of electronic documents;
defining a sample set of documents from the collection;
classifying the documents in the sample set in accordance with steps that include;
(a) analyzing words in the documents of the sample set to identify a plurality of topics, (b) developing a plurality of local dictionaries, each containing words descriptive of a respective one of said plurality of topics, and (c) developing vectors for each of the documents in the sample set, with the vectors developed for each document in the sample set being indicative of words in a respective one of said plurality of local dictionaries developed for a respective one of said plurality of topics;
forming a prediction model based on the classification of the documents in the sample set performed in said classifying step, said forming step including;
(d) forming a plurality of decision tries for said plurality of topics, respectively, said decision tries each being formed based on the vectors developed for the documents in said sample for a respective one of said plurality of topics; and
classifing a new document based on the prediction model, wherein the step of classifying the documents in the sample set includes;
combining said plurality of local dictionaries into a single pooled dictionary, said single pooled dictionary containing sorted words with duplicate words removed.
-
Specification