Automated taxonomy generation
First Claim
1. A computer readable medium having computer-executable components comprising:
- (a) a node generator constructed to receive a list of training terms based on a set of training documents, and to generate a first sibling node comprising a first set of probabilities, and to generate a second sibling node comprising a second set of probabilities, the first set of probabilities comprising, for each term in the list of training terms, a probability of the term appearing in a document, and the second set of probabilities comprising, for each term in the list of training terms, a probability of the term appearing in a document, wherein the first sibling node and the second sibling node are generated by dividing from a parent node;
(b) a document assigner constructed to associate, based on the first and second set of probabilities, each document of the set of training documents to at least one of a group consisting of the first sibling node, the second sibling node, and a null set, the documents associated with the first sibling node forming a first document set and the documents associated with the second sibling node forming a second document set; and
(c) a tree manager constructed to communicate at least one of the first document set and the second document set to the node generator to create a binary tree data structure comprising a hierarchy of a plurality of sibling nodes based on recursive performance of the node generator and the document assigner, and(d) a document sorter constructed to associate a new document to at least one node of the plurality of sibling nodes based on respective sets of probabilities associated with the nodes, andwherein the tree manager stores the binary tree data structure for access by the document sorter.
4 Assignments
0 Petitions
Accused Products
Abstract
In a hierarchical taxonomy of document, the categories of information may be structured as a binary tree with the nodes of the binary tree containing information relevant to the search. The binary tree may be ‘trained’ or formed by examining a training set of documents and separating those documents into two child nodes. Each of those sets of documents may then be further split into two nodes to create the binary tree data structure. The nodes may be generated to maximize the likelihood that all of the training documents are in either or both of the two child nodes. In one example, each node of the binary tree may be associated with a list of terms and each term in each list of terms is associated with a probability of that term appearing in a document given that node. New documents may be categorized by the nodes of the tree. For example, the new documents may be assigned to a particular node based upon the statistical similarity between that document and the associated node.
76 Citations
30 Claims
-
1. A computer readable medium having computer-executable components comprising:
-
(a) a node generator constructed to receive a list of training terms based on a set of training documents, and to generate a first sibling node comprising a first set of probabilities, and to generate a second sibling node comprising a second set of probabilities, the first set of probabilities comprising, for each term in the list of training terms, a probability of the term appearing in a document, and the second set of probabilities comprising, for each term in the list of training terms, a probability of the term appearing in a document, wherein the first sibling node and the second sibling node are generated by dividing from a parent node; (b) a document assigner constructed to associate, based on the first and second set of probabilities, each document of the set of training documents to at least one of a group consisting of the first sibling node, the second sibling node, and a null set, the documents associated with the first sibling node forming a first document set and the documents associated with the second sibling node forming a second document set; and (c) a tree manager constructed to communicate at least one of the first document set and the second document set to the node generator to create a binary tree data structure comprising a hierarchy of a plurality of sibling nodes based on recursive performance of the node generator and the document assigner, and (d) a document sorter constructed to associate a new document to at least one node of the plurality of sibling nodes based on respective sets of probabilities associated with the nodes, and wherein the tree manager stores the binary tree data structure for access by the document sorter. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A computer implemented method comprising the steps of:
-
(a) creating a binary taxonomy tree based upon a set of training documents, such that each node of the binary taxonomy tree is associated with a list of terms, and each term in each list of terms is associated with a probability of that term appearing in a document given that node, wherein a root node is first created and is then used to create child nodes of the binary taxonomy tree, wherein the child nodes are created by division from their respective parent nodes, and wherein the binary taxonomy tree is stored for associating new documents with nodes of the binary taxonomy tree; and (b) associating a new document with at least one node of the binary tree based upon a distance value between that document and the node. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18)
-
-
19. A computer readable medium having computer-executable instructions for performing steps comprising:
-
(a) accessing a document; (b) based upon a first probability of a set of training terms appearing in the document, determining a first distance value between the document and a first of two sibling nodes; (c) based upon a second probability of the set of training terms appearing in the document, determining a second distance value between the document and a second of two sibling nodes, wherein the two sibling nodes are created by division from a parent node; (d) if the first distance value is below a distance threshold, determining if two children nodes are associated with the first of two sibling nodes; (e) if two children nodes are associated with the first of two sibling nodes, then determining a third distance value between the document and the first of the two children nodes, and determining a fourth distance value between the document and the second of the two children nodes; and (f) if two children nodes are associated with the first of two sibling nodes, associating the document with at least one of the first and the second children nodes based upon the third distance value and the fourth distance value, wherein a tree representation of the parent node, the two sibling nodes, and the two children nodes are stored in memory for use in classifying; and (g) classifying a new document into one of the two sibling nodes or one of the two children nodes based on the terms in the new document and the set of probabilities associated with each node in the tree representation. - View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27)
-
-
28. A computer implemented method comprising:
-
(a) receiving a training set of documents, each document comprising a list of terms; (b) selecting a first set of training terms from at least a portion of the terms listed in the lists of terms; (c) for each of the training terms, generating a first probability of the training term appearing in any document and associating that probability with a first node; (d) for each of the training terms, generating a second probability of the training term appearing in any document and associating that probability with a second node; (e) based on the first and second probabilities for each training term, associating each list of terms to at least one of the group consisting of the first node, the second node, and a null set; (f) forming a second set of training terms from at least a portion of the terms listed in the lists of terms associated with the first node; (g) for each of the training terms in the second set of training terms, generating a third probability of the training term appearing in any document and associating that probability with a third node, wherein the third node is generated by dividing the first node; (h) for each of the training terms in the second set of training terms, generating a fourth probability of the training term appearing in any document and associating that probability with a fourth node, wherein the fourth node is generated by dividing the second node; (i) based on the third and fourth probabilities for each training term, associating each list of terms to at least one of the group consisting of the third node, the fourth node, and the null set, storing in a memory, a tree representation of the first node, the second node, the third node, and the fourth node, and (j) classifying a new document into one of the first node, second node, third node, or fourth node based on the terms in the new document and the set of probabilities associated with each node in the tree representation. - View Dependent Claims (29, 30)
-
Specification