Multilevel taxonomy based on features derived from training documents classification using fisher values as discrimination values
First Claim
1. A process for classifying new documents containing features under nodes defining a multilevel taxonomy, based on features derived from a training set of documents that have been classified under respective nodes of the taxonomy, the process comprising:
- associating a respective set of features with each one of said plurality of nodes, each given set of features comprising a plurality of features that are in at least one training document classified under the associated node; and
classifying each new document under at least one node, based on the set of features associated with said at least one node, further comprising;
determining a discrimination value for each term in at least one training document which is classified under each one of a plurality of the nodes of the taxonomy, wherein the discrimination value comprises a Fisher value based on the equation;
where t represents a term, d represents a document, c represents a class, determining a minimum discrimination value for each of said plurality of nodes;
wherein the features in each given set of features have discrimination values equal to or above the minimum discrimination value determined for the node associated with the given set of features.
1 Assignment
0 Petitions
Accused Products
Abstract
A system, process, and article of manufacture for organizing a large text database into a hierarchy of topics and for maintaining this organization as documents are added and deleted and as the topic hierarchy changes. Given sample documents belonging to various nodes in the topic hierarchy, the tokens (terms, phrases, dates, or other usable feature in the document) that are most useful at each internal decision node for the purpose of routing new documents to the children of that node are automatically detected. Using feature terms, statistical models are constructed for each topic node. The models are used in an estimation technique to assign topic paths to new unlabeled documents. The hierarchical technique, in which feature terms can be very different at different nodes, leads to an efficient context-sensitive classification technique. The hierarchical technique can handle millions of documents and tens of thousands of topics. A resulting taxonomy and path enhanced retrieval system (TAPER) is used to generate context-dependent document indexing terms. The topic paths are used, in addition to keywords, for better focused searching and browsing of the text database.
-
Citations
32 Claims
-
1. A process for classifying new documents containing features under nodes defining a multilevel taxonomy, based on features derived from a training set of documents that have been classified under respective nodes of the taxonomy, the process comprising:
-
associating a respective set of features with each one of said plurality of nodes, each given set of features comprising a plurality of features that are in at least one training document classified under the associated node; and
classifying each new document under at least one node, based on the set of features associated with said at least one node, further comprising;
determining a discrimination value for each term in at least one training document which is classified under each one of a plurality of the nodes of the taxonomy, wherein the discrimination value comprises a Fisher value based on the equation;
where t represents a term, d represents a document, c represents a class, determining a minimum discrimination value for each of said plurality of nodes;
wherein the features in each given set of features have discrimination values equal to or above the minimum discrimination value determined for the node associated with the given set of features. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
scanning each new document to determine the features in the document; and
defining, for each of said plurality of said plurality of nodes and for each new document, the probability that the new document is classified under the node, based on the set of features associated with the node and the features in the document.
-
-
4. A process as recited in claim 3, wherein said step of defining the probability comprises the step of applying a statistical model to define said probability that features in each given new document would occur at the frequency at which they do occur in the given new document.
-
5. A process as recited in claim 4, wherein said statistical model comprises a Bernoulli model.
-
6. A process as recited in claim 4, wherein said statistical model comprises a Poisson model.
-
7. A process as recited in claim 4, wherein said step of classifying further comprises the step of assigning each given new document to at least one respective node in at least one level of the taxonomy, wherein the at least one node to which each given new document is assigned is the node for which the defined probability is above a predefined threshold among all of the nodes at the same level in the taxonomy.
-
8. A process as recited in claim 7, wherein the at least one node to which each given new document is assigned is the node for which the defined probability is maximum among all of the nodes at the same level in the taxonomy.
-
9. A process as recited in claim 8, wherein said step of assigning each given new document to at least one respective node in at least one level of the taxonomy comprises the step of assigning each given new document to at least one respective node in each of a plurality of levels of the taxonomy.
-
10. A process as recited in claim 1, wherein said step of selecting a set of features comprises selecting features that are in all of training documents classified under the associated node and that have discrimination values equal to or above the minimum discrimination value.
-
11. A process as recited in claim 1, wherein said step of determining a discrimination value comprises determining a discrimination value for each feature in a plurality of training documents which are classified under each one of a plurality of the nodes of the taxonomy.
-
12. A process as recited in claim 1, wherein said step of determining a discrimination value comprises determining a discrimination value for each feature in all of the training documents which are classified under each one of a plurality of the nodes of the taxonomy.
-
13. A process as recited in claim 1, wherein said step of associating a respective set of features with each node comprises the step of determining the number of features to associate with each respective node.
-
14. A process as recited in claim 13, wherein said step of associating a respective set of features with each given node comprises the steps of:
-
ranking, by discrimination power, each of a plurality of features that are in at least one training document classified under the each given node;
providing an optimal number N of features for each given node; and
defining the set of features associated with a given node as the features ranked highest to the Nth highest in said step of ranking.
-
-
15. A process as recited in claim 14, wherein said step of providing an optimal number N comprises the step of determining the number N for each given node based on a test set of documents.
-
16. A process as recited in claim 1, further comprising the step of displaying, for given node of a plurality of nodes of the taxonomy, a signature comprising at least one feature associated with the documents classified under the given node.
-
17. A process as recited in claim 16, wherein said signature for each given node comprises a plurality of features associated with the documents classified under the given node.
-
18. A process as recited in claim 16, wherein said signature for each given node comprises a plurality of features that occur in the documents classified under the given node, but which are determined to have a relatively low frequency of occurrence among documents under the given node.
-
19. A classifier system for classifying new documents containing terms under nodes defining a multilevel taxonomy, based on feature terms derived from a training set of documents which are classified under respective nodes on the taxonomy, the system comprising:
-
means for determining discrimination value for each term in at least one training document which is classified under each one of a plurality of the nodes of the taxonomy, wherein the discrimination value comprises a Fisher value based on the equation;
where t represents a term, d represents a document, c represents a class, means for determining a minimum discrimination value for each of said plurality of nodes;
means for selecting a set of feature terms associated with each one of said plurality of nodes, said feature terms comprising terms that are in at least one training document classified under the associated node and that have discrimination values equal to or above the minimum discrimination value;
means for classifying each new document under at least one node, based on the feature terms associated with said at least one node. - View Dependent Claims (20, 21, 22, 23, 24, 25)
means for scanning each new document to determine the terms in the document; and
means for defining, for each of said plurality of said plurality of nodes and for each new document, the probability that the new document is classified under the node, based on the feature terms associated with the node and the terms in the document.
-
-
21. A system as recited in claim 20, wherein said means for defining the probability comprises means for applying a Bernoulli model to define said probability for each of said plurality of nodes.
-
22. A system as recited in claim 19, wherein said means for selecting a set of feature terms comprises means for selecting terms that are in a plurality of training documents classified under the associated node and that have discrimination values equal to or above the minimum discrimination value.
-
23. A system as recited in claim 19, wherein said means for selecting a set of feature terms comprises means for selecting terms that are in all of training documents classified under the associated node and that have discrimination values equal to or above the minimum discrimination value.
-
24. A system as recited in claim 19, wherein said means for determining a discrimination value comprises means for determining a discrimination value for each term in a plurality of training documents which are classified under each one of a plurality of the nodes of the taxonomy.
-
25. A system as recited in claim 19, wherein said means for determining a discrimination value comprises means for determining a discrimination value for each term in all of the training documents which are classified under each one of a plurality of the nodes of the taxonomy.
-
26. An article of manufacture comprising a computer program carrier readable by a computer and embodying one or more instructions executable by the computer to perform a process for classifying new documents containing terms under nodes defining a multilevel taxonomy, based on feature terms derived from a training set of documents which are classified under respective nodes of the taxonomy, the process comprising:
-
determining a discrimination value for each term in at least one training document which is classified under each one of a plurality of the nodes of the taxonomy, wherein the discrimination value comprises a Fisher value based on the equation;
where t represents a term, d represents a document, c represents a class, determining a minimum discrimination value for each of said plurality of nodes;
selecting a set of feature terms associated with each one of said plurality of nodes;
said feature terms comprising terms that are in at least one training document classified under the associated node and that have discrimination values equal to or above the minimum discrimination value; and
classifying each new document under at least one node, based on the feature terms associated with said at least one node. - View Dependent Claims (27, 28, 29, 30, 31, 32)
scanning each new document to determine the terms in the document; and
defining, for each of said plurality of said plurality of nodes and for each new document, the probability that the new document is classified under the node, based on the feature terms associated with the node and the terms in the document.
-
-
28. An article as recited in claim 27, wherein said step of defining the probability comprises the step of applying a Bernoulli model to define said probability for each of said plurality of nodes.
-
29. An article as recited in claim 26, wherein said step of selecting a set of feature terms comprises selecting terms that are in a plurality of training documents classified under the associated node and that have discrimination values equal to or above the minimum discrimination value.
-
30. An article as recited in claim 26, wherein said step of selecting a set of feature terms comprises selecting terms that are in all of training documents classified under the associated node and that have discrimination values equal to or above the minimum discrimination value.
-
31. An article as recited in claim 26, wherein said step of determining a discrimination value comprises determining a discrimination value for each term in a plurality of training documents which are classified under each one of a plurality of the nodes of the taxonomy.
-
32. An article as recited in claim 26, wherein said step of determining a discrimination value comprises determining a discrimination value for each term in all of the training documents which are classified under each one of a plurality of the nodes of the taxonomy.
Specification