Automated taxonomy generation

US 20060004747A1
Filed: 06/30/2004
Published: 01/05/2006
Est. Priority Date: 06/30/2004
Status: Active Grant

First Claim

Patent Images

1. A computer readable medium having computer-executable components comprising:

(a) a node generator constructed to receive a list of training terms based on a set of training documents, and to generate a first sibling node comprising a first set of probabilities, and to generate a second sibling node comprising a second set of probabilities, the first set of probabilities comprising, for each term in the list of training terms, a probability of the term appearing in a document, and the second set of probabilities comprising, for each term in the list of training terms, a probability of the term appearing in a document;

(b) a document assigner constructed to associate, based on the first and second set of probabilities, each document of the set of training documents to at least one of a group consisting of the first sibling node, the second sibling node, and a null set, the documents associated with the first sibling node forming a first document set and the documents associated with the second sibling node forming a second document set; and

(c) a tree manager constructed to communicate at least one of the first document set and the second document set to the node generator to create a binary tree data structure comprising a hierarchy of a plurality of sibling nodes based on recursive performance of the node generator and the document assigner.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In a hierarchical taxonomy of document, the categories of information may be structured as a binary tree with the nodes of the binary tree containing information relevant to the search. The binary tree may be ‘trained’ or formed by examining a training set of documents and separating those documents into two child nodes. Each of those sets of documents may then be further split into two nodes to create the binary tree data structure. The nodes may be generated to maximize the likelihood that all of the training documents are in either or both of the two child nodes. In one example, each node of the binary tree may be associated with a list of terms and each term in each list of terms is associated with a probability of that term appearing in a document given that node. New documents may be categorized by the nodes of the tree. For example, the new documents may be assigned to a particular node based upon the statistical similarity between that document and the associated node.

95 Citations

View as Search Results

36 Claims

1. A computer readable medium having computer-executable components comprising:
- (a) a node generator constructed to receive a list of training terms based on a set of training documents, and to generate a first sibling node comprising a first set of probabilities, and to generate a second sibling node comprising a second set of probabilities, the first set of probabilities comprising, for each term in the list of training terms, a probability of the term appearing in a document, and the second set of probabilities comprising, for each term in the list of training terms, a probability of the term appearing in a document;
  
  (b) a document assigner constructed to associate, based on the first and second set of probabilities, each document of the set of training documents to at least one of a group consisting of the first sibling node, the second sibling node, and a null set, the documents associated with the first sibling node forming a first document set and the documents associated with the second sibling node forming a second document set; and
  
  (c) a tree manager constructed to communicate at least one of the first document set and the second document set to the node generator to create a binary tree data structure comprising a hierarchy of a plurality of sibling nodes based on recursive performance of the node generator and the document assigner.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The computer readable medium of claim 1, further comprising a document sorter constructed to associate a new document to at least one node of the plurality of sibling nodes based on the generated probability of the sets of probabilities.
  - 3. The computer readable medium of claim 2, wherein the document sorter compares a statistical distance between the new document and each of the first and second sibling nodes.
  - 4. The computer readable medium of claim 1, further comprising a term generator constructed to receive the set of training documents and to generate the list of training terms based on terms appearing in at least a portion of the document in the set of training documents.
  - 5. The computer readable medium of claim 4, wherein the term generator generates the list of training terms based on the frequency of occurrence of the terms appearing in at least a portion of the documents.
  - 6. The computer readable medium of claim 4, wherein the term generator considers a predetermined list of exclusionary terms.
  - 7. The computer readable medium of claim 1, wherein the node generator determines the first and second sets of probabilities based on maximizing a likelihood of all training documents being associated with the first and second node based on the first and second sets of probabilities.
  - 8. The computer readable medium of claim 7, wherein the node generator maximizes the likelihood based on an expectation maximization algorithm.
  - 9. The computer readable medium of claim 1, wherein the document assigner determines a statistical distance value between each document of the set of training documents and each of the first node and the second node.
  - 10. The computer readable medium of claim 9, wherein the document assigner associates a document of the set of training documents to the first node if the determined distance value between the document and the first node is less than a predetermined threshold.
  - 11. The computer readable medium of claim 9, wherein the distance value is a KL divergence value.

12. A computer readable medium having stored thereon a binary tree data structure comprising:
- (a) a root node stored in at least one region of the computer readable medium having associated therewith a first list of probabilities assigned to individual terms found in a set of training documents;
  
  (b) a first child node stored in at least one region of the computer readable medium and associated with the root node in a parent-child relationship, the first child node having associated therewith a second list of probabilities assigned to individual terms found in a set of training documents; and
  
  (c) a second child node stored in at least one region of the computer readable medium and associated with the root node in a parent-child relationship, the second child node having associated therewith a third list of probabilities assigned to individual terms found in a set of training documents.

13. A computer readable medium having stored thereon a document comprising:
- (a) a plurality of terms appearing in the document;
  
  (b) metadata including a node indicator which indicates which node of a binary taxonomy tree is associated with the document, wherein each node of the binary taxonomy tree is associated with a term list and a term probability list.
- View Dependent Claims (14, 15)
- - 14. The computer readable medium of claim 13, wherein the metadata includes a text string.
  - 15. The computer readable medium of claim 14, wherein the text string includes a binary indication of the path to the associated node through the binary taxonomy tree.

16. A method comprising the steps of:
- (a) creating a binary taxonomy tree based upon a set of training documents, such that each node of the binary taxonomy tree is associated with a list of terms, and each term in each list of terms is associated with a probability of that term appearing in a document given that node; and
  
  (b) associating a new document with at least one node of the binary tree based upon a distance value between that document and the node.
- View Dependent Claims (17, 18, 19, 20, 21, 22, 23)
- - 17. The method of claim 16, wherein creating the binary taxonomy tree comprises determining each probability of the term appearing in a document based on an expectation maximization algorithm maximizing a likelihood of each document in the set of training documents being generated by the list of terms associated with each of two sibling nodes of the binary taxonomy tree.
  - 18. The method of claim 16, wherein the distance value is determined based upon a K1 divergence.
  - 19. The method of claim 18, wherein the new document is associated with a node having the K1 divergence below a distance threshold.
  - 20. The method of claim 18, wherein associating the new document comprises associating the new document to a node with a path having the least K1 divergence over the path.
  - 21. The method of claim 16, wherein creating the binary taxonomy tree comprises determining each list of terms associated with a node based on the list of terms associated with a parent node of the node associated with the list of terms.
  - 22. The method of claim 16, wherein creating the binary taxonomy tree comprises associating at least a portion of the set of training documents to at least one of a first child node, a second child node, and a null set.
  - 23. The method of claim 22, wherein associating at least a portion of the training documents is based on each probability of each term associated with the first child node and each probability of each term associated with the second child node.

24. A computer readable medium having computer-executable instructions for performing steps comprising:
- (a) accessing a document;
  
  (b) based upon a first probability of a set of training terms appearing in the document, determining a first distance value between the document and a first of two sibling nodes;
  
  (c) based upon a second probability of the set of training terms appearing in the document, determining a second distance value between the document and a second of two sibling nodes;
  
  (d) if the first distance value is below a distance threshold, determining if two children nodes are associated with the first of two sibling nodes;
  
  (e) if two children nodes are associated with the first of two sibling nodes, then determining a third distance value between the document and the first of the two children nodes, and determining a fourth distance value between the document and the second of the two children nodes; and
  
  (f) if two children nodes are associated with the first of two sibling nodes, associating the document with at least one of the first and the second children nodes based upon the third distance value and the fourth distance value.
- View Dependent Claims (25, 26, 27, 28, 29, 30, 31, 32)
- - 25. The computer readable medium of claim 24, wherein determining the first distance value comprises determining a first K1 divergence between the document and the first of two sibling nodes, and wherein determining the second distance value comprises determining a second K1 divergence between the document and the second of two sibling nodes.
  - 26. The computer readable medium of claim 24, wherein the distance threshold is the second distance value.
  - 27. The computer readable medium of claim 24, wherein the distance threshold is a predetermined entropy value.
  - 28. The computer readable medium of claim 24, further comprising determining if the second distance value is below the distance threshold and determining if two other children nodes are associated with the second of two sibling nodes.
  - 29. The computer readable medium of claim 28, if two other children nodes are associated with the second of two sibling nodes, further comprising determining a fifth distance value between the document and the first of the two other children nodes, and determining a sixth distance value between the document and the second of the two other children nodes.
  - 30. The computer readable medium of claim 28, if two other children nodes are not associated with the second of two sibling nodes, further comprising associating the document with the second of two sibling nodes.
  - 31. The computer readable medium of claim 24, if two children nodes are not associated with the first of two sibling nodes, further comprising associating the document with the first of two sibling nodes.
  - 32. The computer readable medium of claim 24, wherein if neither the first and the second distance values is below the distance threshold, further comprising associating the document with a parent node of the first and second of two sibling nodes.

33. A method comprising:
- (a) receiving a training set of documents, each document comprising a list of terms;
  
  (b) selecting a first set of training terms from at least a portion of the terms listed in the lists of terms;
  
  (c) for each of the training terms, generating a first probability of the training term appearing in any document and associating that probability with a first node;
  
  (d) for each of the training terms, generating a second probability of the training term appearing in any document and associating that probability with a second node;
  
  (e) based on the first and second probabilities for each training term, associating each list of terms to at least one of the group consisting of the first node, the second node, and a null set;
  
  (f) forming a second set of training terms from at least a portion of the terms listed in the lists of terms associated with the first node;
  
  (g) for each of the training terms in the second set of training terms, generating a third probability of the training term appearing in any document and associating that probability with a third node;
  
  (h) for each of the training terms in the second set of training terms, generating a fourth probability of the training term appearing in any document and associating that probability with a fourth node; and
  
  (i) based on the third and fourth probabilities for each training term, associating each list of terms to at least one of the group consisting of the third node, the fourth node, and the null set.
- View Dependent Claims (34, 35, 36)
- - 34. The method of claim 33, wherein generating the probability of the term comprises maximizing the probability that each list of terms is in at least one of a first node and a second node of a layer of the binary tree.
  - 35. The method of claim 33, further comprising assigning a new document to a node of the binary tree.
  - 36. The method of claim 35, wherein assigning the new document comprises generating a new list of terms appearing in the new document and walking the tree based on the probabilities of each term associated with each node of the tree.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
ServiceNow Incorporated
Original Assignee
Microsoft Corporation
Inventors
Weare, Christopher B.

Granted Patent

US 7,266,548 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/367   Ontology

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99942   Manipulating data structure...

Y10S 707/99943   Generating database or data...

Automated taxonomy generation

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

95 Citations

36 Claims

Specification

Use Cases

Quick Links

Others

Automated taxonomy generation

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

95 Citations

36 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others