Automated taxonomy generation
First Claim
1. A computer readable medium having computer-executable components comprising:
- (a) a node generator constructed to receive a list of training terms based on a set of training documents, and to generate a first sibling node comprising a first set of probabilities, and to generate a second sibling node comprising a second set of probabilities, the first set of probabilities comprising, for each term in the list of training terms, a probability of the term appearing in a document, and the second set of probabilities comprising, for each term in the list of training terms, a probability of the term appearing in a document;
(b) a document assigner constructed to associate, based on the first and second set of probabilities, each document of the set of training documents to at least one of a group consisting of the first sibling node, the second sibling node, and a null set, the documents associated with the first sibling node forming a first document set and the documents associated with the second sibling node forming a second document set; and
(c) a tree manager constructed to communicate at least one of the first document set and the second document set to the node generator to create a binary tree data structure comprising a hierarchy of a plurality of sibling nodes based on recursive performance of the node generator and the document assigner.
4 Assignments
0 Petitions
Accused Products
Abstract
In a hierarchical taxonomy of document, the categories of information may be structured as a binary tree with the nodes of the binary tree containing information relevant to the search. The binary tree may be ‘trained’ or formed by examining a training set of documents and separating those documents into two child nodes. Each of those sets of documents may then be further split into two nodes to create the binary tree data structure. The nodes may be generated to maximize the likelihood that all of the training documents are in either or both of the two child nodes. In one example, each node of the binary tree may be associated with a list of terms and each term in each list of terms is associated with a probability of that term appearing in a document given that node. New documents may be categorized by the nodes of the tree. For example, the new documents may be assigned to a particular node based upon the statistical similarity between that document and the associated node.
95 Citations
36 Claims
-
1. A computer readable medium having computer-executable components comprising:
-
(a) a node generator constructed to receive a list of training terms based on a set of training documents, and to generate a first sibling node comprising a first set of probabilities, and to generate a second sibling node comprising a second set of probabilities, the first set of probabilities comprising, for each term in the list of training terms, a probability of the term appearing in a document, and the second set of probabilities comprising, for each term in the list of training terms, a probability of the term appearing in a document;
(b) a document assigner constructed to associate, based on the first and second set of probabilities, each document of the set of training documents to at least one of a group consisting of the first sibling node, the second sibling node, and a null set, the documents associated with the first sibling node forming a first document set and the documents associated with the second sibling node forming a second document set; and
(c) a tree manager constructed to communicate at least one of the first document set and the second document set to the node generator to create a binary tree data structure comprising a hierarchy of a plurality of sibling nodes based on recursive performance of the node generator and the document assigner. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A computer readable medium having stored thereon a binary tree data structure comprising:
-
(a) a root node stored in at least one region of the computer readable medium having associated therewith a first list of probabilities assigned to individual terms found in a set of training documents;
(b) a first child node stored in at least one region of the computer readable medium and associated with the root node in a parent-child relationship, the first child node having associated therewith a second list of probabilities assigned to individual terms found in a set of training documents; and
(c) a second child node stored in at least one region of the computer readable medium and associated with the root node in a parent-child relationship, the second child node having associated therewith a third list of probabilities assigned to individual terms found in a set of training documents.
-
-
13. A computer readable medium having stored thereon a document comprising:
-
(a) a plurality of terms appearing in the document;
(b) metadata including a node indicator which indicates which node of a binary taxonomy tree is associated with the document, wherein each node of the binary taxonomy tree is associated with a term list and a term probability list. - View Dependent Claims (14, 15)
-
-
16. A method comprising the steps of:
-
(a) creating a binary taxonomy tree based upon a set of training documents, such that each node of the binary taxonomy tree is associated with a list of terms, and each term in each list of terms is associated with a probability of that term appearing in a document given that node; and
(b) associating a new document with at least one node of the binary tree based upon a distance value between that document and the node. - View Dependent Claims (17, 18, 19, 20, 21, 22, 23)
-
-
24. A computer readable medium having computer-executable instructions for performing steps comprising:
-
(a) accessing a document;
(b) based upon a first probability of a set of training terms appearing in the document, determining a first distance value between the document and a first of two sibling nodes;
(c) based upon a second probability of the set of training terms appearing in the document, determining a second distance value between the document and a second of two sibling nodes;
(d) if the first distance value is below a distance threshold, determining if two children nodes are associated with the first of two sibling nodes;
(e) if two children nodes are associated with the first of two sibling nodes, then determining a third distance value between the document and the first of the two children nodes, and determining a fourth distance value between the document and the second of the two children nodes; and
(f) if two children nodes are associated with the first of two sibling nodes, associating the document with at least one of the first and the second children nodes based upon the third distance value and the fourth distance value. - View Dependent Claims (25, 26, 27, 28, 29, 30, 31, 32)
-
-
33. A method comprising:
-
(a) receiving a training set of documents, each document comprising a list of terms;
(b) selecting a first set of training terms from at least a portion of the terms listed in the lists of terms;
(c) for each of the training terms, generating a first probability of the training term appearing in any document and associating that probability with a first node;
(d) for each of the training terms, generating a second probability of the training term appearing in any document and associating that probability with a second node;
(e) based on the first and second probabilities for each training term, associating each list of terms to at least one of the group consisting of the first node, the second node, and a null set;
(f) forming a second set of training terms from at least a portion of the terms listed in the lists of terms associated with the first node;
(g) for each of the training terms in the second set of training terms, generating a third probability of the training term appearing in any document and associating that probability with a third node;
(h) for each of the training terms in the second set of training terms, generating a fourth probability of the training term appearing in any document and associating that probability with a fourth node; and
(i) based on the third and fourth probabilities for each training term, associating each list of terms to at least one of the group consisting of the third node, the fourth node, and the null set. - View Dependent Claims (34, 35, 36)
-
Specification