Multilevel taxonomy based on features derived from training documents classification using fisher values as discrimination values

US 6,233,575 B1
Filed: 06/23/1998
Issued: 05/15/2001
Est. Priority Date: 06/24/1997
Status: Expired due to Fees

First Claim

Patent Images

1. A process for classifying new documents containing features under nodes defining a multilevel taxonomy, based on features derived from a training set of documents that have been classified under respective nodes of the taxonomy, the process comprising:

associating a respective set of features with each one of said plurality of nodes, each given set of features comprising a plurality of features that are in at least one training document classified under the associated node; and

classifying each new document under at least one node, based on the set of features associated with said at least one node, further comprising;

determining a discrimination value for each term in at least one training document which is classified under each one of a plurality of the nodes of the taxonomy, wherein the discrimination value comprises a Fisher value based on the equation;

$Fisher (t) = \frac{\sum_{c_{1}, c_{2}} {(μ (c_{1}, t) - μ (c_{2}, t))}^{2}}{\sum_{c} \frac{1}{\langle c \rangle} \sum_{d \in c} {(n (t, d, c) - μ (c, t))}^{2}}$ where t represents a term, d represents a document, c represents a class, $μ (c, t) = \frac{1}{\langle c \rangle} \sum_{d \in c} x (d, t), and$ $x (d, t) = an occurrence rate of t in d;$ determining a minimum discrimination value for each of said plurality of nodes;

wherein the features in each given set of features have discrimination values equal to or above the minimum discrimination value determined for the node associated with the given set of features.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system, process, and article of manufacture for organizing a large text database into a hierarchy of topics and for maintaining this organization as documents are added and deleted and as the topic hierarchy changes. Given sample documents belonging to various nodes in the topic hierarchy, the tokens (terms, phrases, dates, or other usable feature in the document) that are most useful at each internal decision node for the purpose of routing new documents to the children of that node are automatically detected. Using feature terms, statistical models are constructed for each topic node. The models are used in an estimation technique to assign topic paths to new unlabeled documents. The hierarchical technique, in which feature terms can be very different at different nodes, leads to an efficient context-sensitive classification technique. The hierarchical technique can handle millions of documents and tens of thousands of topics. A resulting taxonomy and path enhanced retrieval system (TAPER) is used to generate context-dependent document indexing terms. The topic paths are used, in addition to keywords, for better focused searching and browsing of the text database.

Citations

32 Claims

1. A process for classifying new documents containing features under nodes defining a multilevel taxonomy, based on features derived from a training set of documents that have been classified under respective nodes of the taxonomy, the process comprising:
- associating a respective set of features with each one of said plurality of nodes, each given set of features comprising a plurality of features that are in at least one training document classified under the associated node; and
  
  classifying each new document under at least one node, based on the set of features associated with said at least one node, further comprising;
  
  determining a discrimination value for each term in at least one training document which is classified under each one of a plurality of the nodes of the taxonomy, wherein the discrimination value comprises a Fisher value based on the equation;
  
  $Fisher (t) = \frac{\sum_{c_{1}, c_{2}} {(μ (c_{1}, t) - μ (c_{2}, t))}^{2}}{\sum_{c} \frac{1}{\langle c \rangle} \sum_{d \in c} {(n (t, d, c) - μ (c, t))}^{2}}$ where t represents a term, d represents a document, c represents a class, $μ (c, t) = \frac{1}{\langle c \rangle} \sum_{d \in c} x (d, t), and$ $x (d, t) = an occurrence rate of t in d;$ determining a minimum discrimination value for each of said plurality of nodes;
  
  wherein the features in each given set of features have discrimination values equal to or above the minimum discrimination value determined for the node associated with the given set of features.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
- - 2. A process as recited in claim 1, wherein said step of selecting a set of features comprises selecting features that are in a plurality of training documents classified under the associated node and that have discrimination values equal to or above the minimum discrimination value.
  - 3. A process as recited in claim 1, wherein said step of classifying comprises:
4. A process as recited in claim 3, wherein said step of defining the probability comprises the step of applying a statistical model to define said probability that features in each given new document would occur at the frequency at which they do occur in the given new document.
5. A process as recited in claim 4, wherein said statistical model comprises a Bernoulli model.
6. A process as recited in claim 4, wherein said statistical model comprises a Poisson model.
7. A process as recited in claim 4, wherein said step of classifying further comprises the step of assigning each given new document to at least one respective node in at least one level of the taxonomy, wherein the at least one node to which each given new document is assigned is the node for which the defined probability is above a predefined threshold among all of the nodes at the same level in the taxonomy.
8. A process as recited in claim 7, wherein the at least one node to which each given new document is assigned is the node for which the defined probability is maximum among all of the nodes at the same level in the taxonomy.
9. A process as recited in claim 8, wherein said step of assigning each given new document to at least one respective node in at least one level of the taxonomy comprises the step of assigning each given new document to at least one respective node in each of a plurality of levels of the taxonomy.
10. A process as recited in claim 1, wherein said step of selecting a set of features comprises selecting features that are in all of training documents classified under the associated node and that have discrimination values equal to or above the minimum discrimination value.
11. A process as recited in claim 1, wherein said step of determining a discrimination value comprises determining a discrimination value for each feature in a plurality of training documents which are classified under each one of a plurality of the nodes of the taxonomy.
12. A process as recited in claim 1, wherein said step of determining a discrimination value comprises determining a discrimination value for each feature in all of the training documents which are classified under each one of a plurality of the nodes of the taxonomy.
13. A process as recited in claim 1, wherein said step of associating a respective set of features with each node comprises the step of determining the number of features to associate with each respective node.
14. A process as recited in claim 13, wherein said step of associating a respective set of features with each given node comprises the steps of:
- ranking, by discrimination power, each of a plurality of features that are in at least one training document classified under the each given node;
  
  providing an optimal number N of features for each given node; and
  
  defining the set of features associated with a given node as the features ranked highest to the Nth highest in said step of ranking.
15. A process as recited in claim 14, wherein said step of providing an optimal number N comprises the step of determining the number N for each given node based on a test set of documents.
16. A process as recited in claim 1, further comprising the step of displaying, for given node of a plurality of nodes of the taxonomy, a signature comprising at least one feature associated with the documents classified under the given node.
17. A process as recited in claim 16, wherein said signature for each given node comprises a plurality of features associated with the documents classified under the given node.
18. A process as recited in claim 16, wherein said signature for each given node comprises a plurality of features that occur in the documents classified under the given node, but which are determined to have a relatively low frequency of occurrence among documents under the given node.

19. A classifier system for classifying new documents containing terms under nodes defining a multilevel taxonomy, based on feature terms derived from a training set of documents which are classified under respective nodes on the taxonomy, the system comprising:
- means for determining discrimination value for each term in at least one training document which is classified under each one of a plurality of the nodes of the taxonomy, wherein the discrimination value comprises a Fisher value based on the equation;
  
  $Fisher (t) = \frac{\sum_{c1, c2} {(μ (c_{1}, t) - μ (c_{2}, t))}^{2}}{\sum_{c} \frac{1}{\langle c \rangle} \sum_{d \in c} {(x (t, d, c) - μ (c, t))}^{2}}$ where t represents a term, d represents a document, c represents a class, $μ (c, t) = \frac{1}{\langle c \rangle} \sum_{d \in c} x (d, t), and$ $x (d, t) = an occurrence rate of t in d;$ means for determining a minimum discrimination value for each of said plurality of nodes;
  
  means for selecting a set of feature terms associated with each one of said plurality of nodes, said feature terms comprising terms that are in at least one training document classified under the associated node and that have discrimination values equal to or above the minimum discrimination value;
  
  means for classifying each new document under at least one node, based on the feature terms associated with said at least one node.
- View Dependent Claims (20, 21, 22, 23, 24, 25)
- - 20. A system as recited in claim 19, wherein said means for classifying comprises:
21. A system as recited in claim 20, wherein said means for defining the probability comprises means for applying a Bernoulli model to define said probability for each of said plurality of nodes.
22. A system as recited in claim 19, wherein said means for selecting a set of feature terms comprises means for selecting terms that are in a plurality of training documents classified under the associated node and that have discrimination values equal to or above the minimum discrimination value.
23. A system as recited in claim 19, wherein said means for selecting a set of feature terms comprises means for selecting terms that are in all of training documents classified under the associated node and that have discrimination values equal to or above the minimum discrimination value.
24. A system as recited in claim 19, wherein said means for determining a discrimination value comprises means for determining a discrimination value for each term in a plurality of training documents which are classified under each one of a plurality of the nodes of the taxonomy.
25. A system as recited in claim 19, wherein said means for determining a discrimination value comprises means for determining a discrimination value for each term in all of the training documents which are classified under each one of a plurality of the nodes of the taxonomy.

26. An article of manufacture comprising a computer program carrier readable by a computer and embodying one or more instructions executable by the computer to perform a process for classifying new documents containing terms under nodes defining a multilevel taxonomy, based on feature terms derived from a training set of documents which are classified under respective nodes of the taxonomy, the process comprising:
- determining a discrimination value for each term in at least one training document which is classified under each one of a plurality of the nodes of the taxonomy, wherein the discrimination value comprises a Fisher value based on the equation;
  
  $Fisher (t) = \frac{\sum_{c1, c2} {(μ (c_{1}, t) - μ (c_{2}, t))}^{2}}{\sum_{c} \frac{1}{\langle c \rangle} \sum_{d \in c} {(x (t, d, c) - μ (c, t))}^{2}}$ where t represents a term, d represents a document, c represents a class, $μ (c, t) = \frac{1}{\langle c \rangle} \sum_{d \in c} x (d, t), and$ $x (d, t) = an occurrence rate of t in d;$ determining a minimum discrimination value for each of said plurality of nodes;
  
  selecting a set of feature terms associated with each one of said plurality of nodes;
  
  said feature terms comprising terms that are in at least one training document classified under the associated node and that have discrimination values equal to or above the minimum discrimination value; and
  
  classifying each new document under at least one node, based on the feature terms associated with said at least one node.
- View Dependent Claims (27, 28, 29, 30, 31, 32)
- - 27. An article as recited in claim 26, wherein said step of classifying comprises:
28. An article as recited in claim 27, wherein said step of defining the probability comprises the step of applying a Bernoulli model to define said probability for each of said plurality of nodes.
29. An article as recited in claim 26, wherein said step of selecting a set of feature terms comprises selecting terms that are in a plurality of training documents classified under the associated node and that have discrimination values equal to or above the minimum discrimination value.
30. An article as recited in claim 26, wherein said step of selecting a set of feature terms comprises selecting terms that are in all of training documents classified under the associated node and that have discrimination values equal to or above the minimum discrimination value.
31. An article as recited in claim 26, wherein said step of determining a discrimination value comprises determining a discrimination value for each term in a plurality of training documents which are classified under each one of a plurality of the nodes of the taxonomy.
32. An article as recited in claim 26, wherein said step of determining a discrimination value comprises determining a discrimination value for each term in all of the training documents which are classified under each one of a plurality of the nodes of the taxonomy.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Agrawal, Rakesh, Dom, Byron Edward, Chakrabarti, Soumen, Raghavan, Prabhakar
Primary Examiner(s)
Breene, John
Assistant Examiner(s)
CHANNAVAJJALA, SRIRAMA T

Application Number

US09/102,861
Time in Patent Office

1,057 Days
Field of Search

707/1-10, 707/100-104, 707/200-206, 707/500-503, 707/511-516, 707/531-536, 707/907, 706/12-21, 706/25-28, 706/45-55, 706/60-61, 706/934, 382/156-157
US Class Current

1/1
CPC Class Codes

G06F 16/355   Class or cluster creation o...

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99936   Pattern matching access

Multilevel taxonomy based on features derived from training documents classification using fisher values as discrimination values

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

32 Claims

Specification

Solutions

Use Cases

Quick Links

Multilevel taxonomy based on features derived from training documents classification using fisher values as discrimination values

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

32 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links