Automated taxonomy generation

US 7,266,548 B2
Filed: 06/30/2004
Issued: 09/04/2007
Est. Priority Date: 06/30/2004
Status: Expired due to Fees

First Claim

Patent Images

1. A computer readable medium having computer-executable components comprising:

(a) a node generator constructed to receive a list of training terms based on a set of training documents, and to generate a first sibling node comprising a first set of probabilities, and to generate a second sibling node comprising a second set of probabilities, the first set of probabilities comprising, for each term in the list of training terms, a probability of the term appearing in a document, and the second set of probabilities comprising, for each term in the list of training terms, a probability of the term appearing in a document, wherein the first sibling node and the second sibling node are generated by dividing from a parent node;

(b) a document assigner constructed to associate, based on the first and second set of probabilities, each document of the set of training documents to at least one of a group consisting of the first sibling node, the second sibling node, and a null set, the documents associated with the first sibling node forming a first document set and the documents associated with the second sibling node forming a second document set; and

(c) a tree manager constructed to communicate at least one of the first document set and the second document set to the node generator to create a binary tree data structure comprising a hierarchy of a plurality of sibling nodes based on recursive performance of the node generator and the document assigner, and(d) a document sorter constructed to associate a new document to at least one node of the plurality of sibling nodes based on respective sets of probabilities associated with the nodes, andwherein the tree manager stores the binary tree data structure for access by the document sorter.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In a hierarchical taxonomy of document, the categories of information may be structured as a binary tree with the nodes of the binary tree containing information relevant to the search. The binary tree may be ‘trained’ or formed by examining a training set of documents and separating those documents into two child nodes. Each of those sets of documents may then be further split into two nodes to create the binary tree data structure. The nodes may be generated to maximize the likelihood that all of the training documents are in either or both of the two child nodes. In one example, each node of the binary tree may be associated with a list of terms and each term in each list of terms is associated with a probability of that term appearing in a document given that node. New documents may be categorized by the nodes of the tree. For example, the new documents may be assigned to a particular node based upon the statistical similarity between that document and the associated node.

76 Citations

View as Search Results

30 Claims

1. A computer readable medium having computer-executable components comprising:
- (a) a node generator constructed to receive a list of training terms based on a set of training documents, and to generate a first sibling node comprising a first set of probabilities, and to generate a second sibling node comprising a second set of probabilities, the first set of probabilities comprising, for each term in the list of training terms, a probability of the term appearing in a document, and the second set of probabilities comprising, for each term in the list of training terms, a probability of the term appearing in a document, wherein the first sibling node and the second sibling node are generated by dividing from a parent node;
  
  (b) a document assigner constructed to associate, based on the first and second set of probabilities, each document of the set of training documents to at least one of a group consisting of the first sibling node, the second sibling node, and a null set, the documents associated with the first sibling node forming a first document set and the documents associated with the second sibling node forming a second document set; and
  
  (c) a tree manager constructed to communicate at least one of the first document set and the second document set to the node generator to create a binary tree data structure comprising a hierarchy of a plurality of sibling nodes based on recursive performance of the node generator and the document assigner, and(d) a document sorter constructed to associate a new document to at least one node of the plurality of sibling nodes based on respective sets of probabilities associated with the nodes, andwherein the tree manager stores the binary tree data structure for access by the document sorter.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The computer readable medium of claim 1, wherein the document sorter compares a statistical distance between the new document and each of the first and second sibling nodes.
  - 3. The computer readable medium of claim 1, further comprising a term generator constructed to receive the set of training documents and to generate the list of training terms based on terms appearing in at least a portion of the document in the set of training documents.
  - 4. The computer readable medium of claim 3, wherein the term generator generates the list of training terms based on the frequency of occurrence of the terms appearing in at least a portion of the documents.
  - 5. The computer readable medium of claim 3, wherein the term generator considers a predetermined list of exclusionary terms.
  - 6. The computer readable medium of claim 1, wherein the node generator determines the first and second sets of probabilities based on maximizing a likelihood of all training documents being associated with the first and second node based on the first and second sets of probabilities.
  - 7. The computer readable medium of claim 6, wherein the node generator maximizes the likelihood based on an expectation maximization algorithm.
  - 8. The computer readable medium of claim 1, wherein the document assigner determines a statistical distance value between each document of the set of training documents and each of the first node and the second node.
  - 9. The computer readable medium of claim 8, wherein the document assigner associates a document of the set of training documents to the first node if the determined distance value between the document and the first node is less than a predetermined threshold.
  - 10. The computer readable medium of claim 8, wherein the distance value is a KL divergence value.

11. A computer implemented method comprising the steps of:
- (a) creating a binary taxonomy tree based upon a set of training documents, such that each node of the binary taxonomy tree is associated with a list of terms, and each term in each list of terms is associated with a probability of that term appearing in a document given that node, wherein a root node is first created and is then used to create child nodes of the binary taxonomy tree, wherein the child nodes are created by division from their respective parent nodes, and wherein the binary taxonomy tree is stored for associating new documents with nodes of the binary taxonomy tree; and
  
  (b) associating a new document with at least one node of the binary tree based upon a distance value between that document and the node.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18)
- - 12. The method of claim 11, wherein creating the binary taxonomy tree comprises determining each probability of the term appearing in a document based on an expectation maximization algorithm maximizing a likelihood of each document in the set of training documents being generated by the list of terms associated with each of two sibling nodes of the binary taxonomy tree.
  - 13. The method of claim 11, wherein the distance value is determined based upon a K1 divergence.
  - 14. The method of claim 13, wherein the new document is associated with a node having the K1 divergence below a distance threshold.
  - 15. The method of claim 13, wherein associating the new document comprises associating the new document to a node with a path having the least K1 divergence over the path.
  - 16. The method of claim 11, wherein creating the binary taxonomy tree comprises determining each list of terms associated with a node based on the list of terms associated with a parent node of the node associated with the list of terms.
  - 17. The method of claim 11, wherein creating the binary taxonomy tree comprises associating at least a portion of the set of training documents to at least one of a first child node, a second child node, and a null set.
  - 18. The method of claim 17, wherein associating at least a portion of the training documents is based on each probability of each term associated with the first child node and each probability of each term associated with the second child node.

19. A computer readable medium having computer-executable instructions for performing steps comprising:
- (a) accessing a document;
  
  (b) based upon a first probability of a set of training terms appearing in the document, determining a first distance value between the document and a first of two sibling nodes;
  
  (c) based upon a second probability of the set of training terms appearing in the document, determining a second distance value between the document and a second of two sibling nodes, wherein the two sibling nodes are created by division from a parent node;
  
  (d) if the first distance value is below a distance threshold, determining if two children nodes are associated with the first of two sibling nodes;
  
  (e) if two children nodes are associated with the first of two sibling nodes, then determining a third distance value between the document and the first of the two children nodes, and determining a fourth distance value between the document and the second of the two children nodes; and
  
  (f) if two children nodes are associated with the first of two sibling nodes, associating the document with at least one of the first and the second children nodes based upon the third distance value and the fourth distance value, wherein a tree representation of the parent node, the two sibling nodes, and the two children nodes are stored in memory for use in classifying; and
  
  (g) classifying a new document into one of the two sibling nodes or one of the two children nodes based on the terms in the new document and the set of probabilities associated with each node in the tree representation.
- View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27)
- - 20. The computer readable medium of claim 19, wherein determining the first distance value comprises determining a first K1 divergence between the document and the first of two sibling nodes, and wherein determining the second distance value comprises determining a second K1 divergence between the document and the second of two sibling nodes.
  - 21. The computer readable medium of claim 19, wherein the distance threshold is the second distance value.
  - 22. The computer readable medium of claim 19, wherein the distance threshold is a predetermined entropy value.
  - 23. The computer readable medium of claim 19, further comprising determining if the second distance value is below the distance threshold and determining if two other children nodes are associated with the second of two sibling nodes.
  - 24. The computer readable medium of claim 23, if two other children nodes are associated with the second of two sibling nodes, further comprising determining a fifth distance value between the document and the first of the two other children nodes, and determining a sixth distance value between the document and the second of the two other children nodes.
  - 25. The computer readable medium of claim 23, if two other children nodes are not associated with the second of two sibling nodes, further comprising associating the document with the second of two sibling nodes.
  - 26. The computer readable medium of claim 19, if two children nodes are not associated with the first of two sibling nodes, further comprising associating the document with the first of two sibling nodes.
  - 27. The computer readable medium of claim 19, wherein if neither the first and the second distance values is below the distance threshold, further comprising associating the document with a parent node of the first and second of two sibling nodes.

28. A computer implemented method comprising:
- (a) receiving a training set of documents, each document comprising a list of terms;
  
  (b) selecting a first set of training terms from at least a portion of the terms listed in the lists of terms;
  
  (c) for each of the training terms, generating a first probability of the training term appearing in any document and associating that probability with a first node;
  
  (d) for each of the training terms, generating a second probability of the training term appearing in any document and associating that probability with a second node;
  
  (e) based on the first and second probabilities for each training term, associating each list of terms to at least one of the group consisting of the first node, the second node, and a null set;
  
  (f) forming a second set of training terms from at least a portion of the terms listed in the lists of terms associated with the first node;
  
  (g) for each of the training terms in the second set of training terms, generating a third probability of the training term appearing in any document and associating that probability with a third node, wherein the third node is generated by dividing the first node;
  
  (h) for each of the training terms in the second set of training terms, generating a fourth probability of the training term appearing in any document and associating that probability with a fourth node, wherein the fourth node is generated by dividing the second node;
  
  (i) based on the third and fourth probabilities for each training term, associating each list of terms to at least one of the group consisting of the third node, the fourth node, and the null set,storing in a memory, a tree representation of the first node, the second node, the third node, and the fourth node, and(j) classifying a new document into one of the first node, second node, third node, or fourth node based on the terms in the new document and the set of probabilities associated with each node in the tree representation.
- View Dependent Claims (29, 30)
- - 29. The method of claim 28, wherein generating the probability of the term comprises maximizing the probability that each list of terms is in at least one of a first node and a second node of a layer of the binary tree.
  - 30. The method of claim 28, wherein assigning the new document comprises generating a new list of terms appearing in the new document and walking the tree based on the probabilities of each term associated with each node of the tree.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
ServiceNow Incorporated
Original Assignee
Microsoft Corporation
Inventors
Weare, Christopher B
Primary Examiner(s)
Rimell; Sam
Assistant Examiner(s)
Bibbee; Jared M

Application Number

US10/881,893
Publication Number

US 20060004747A1
Time in Patent Office

1,161 Days
Field of Search

707/3, 707/5, 707/101, 707/102
US Class Current

1/1
CPC Class Codes

G06F 16/367   Ontology

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99942   Manipulating data structure...

Y10S 707/99943   Generating database or data...

Automated taxonomy generation

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

76 Citations

30 Claims

Specification

Solutions

Use Cases

Quick Links

Automated taxonomy generation

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

76 Citations

30 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links