Method and apparatus for populating a predefined concept hierarchy or other hierarchical set of classified data items by minimizing system entrophy
First Claim
1. A computer-implemented method for automating classification of a new data item when adding the new data item to an hierarchically organized hierarchy set of classified data items, wherein nodes of the hierarchy set correspond to classes of data items, said method comprising:
- inputting into a computer system classes of documents comprising (i) a collection of classified documents within a hierarchy of classes and (ii) class labels associated with said collection of classified documents;
training said computer system using said collection of classified documents, wherein the training process comprises;
selecting, from an input set of said collection of classified documents, a set of tokens for use as classification attributes; and
building a dictionary comprising said classification attributes;
modeling a distribution of said classification attributes across said classes of documents by using a set of random variables and values associated with said set of random variables to represent said classification attributes in said dictionary;
calculating an initial entropy value at every node of each level of said hierarchy of classes using said set of random variables;
inputting into said computer system a new data item to be classified into one of said classes in said hierarchy of classes;
calculating entropy values for each of a plurality of possible classes into which said new data item could be classified;
comparing the calculated entropy values with said initial entropy value at every node of each level of said hierarchy of classes in order to create a plurality of conditional entropy values;
selecting a class having a lowest conditional entropy value; and
classifying said new data item in the selected class.
1 Assignment
0 Petitions
Accused Products
Abstract
A system and method for automated populating of an existing concept hierarchy of items with new items, using entropy as a measure of the correctness of a potential classification. User-defined concept hierarchies include, for example, document hierarchies such as directories for the Internet, library catalogues, patent databases and journals, and product hierarchies. These concept hierarchies can be huge and are usually maintained manually. An internet directory may have, for example, millions of Web sites, thousands of editors and hundreds of thousands of different categories. The method for populating a concept hierarchy includes calculating conditional ‘entropy’ values representing the randomness of distribution of classification attributes for the hierarchical set of classes if a new item is added to specific classes of the hierarchy and then selecting whichever class has the minimum randomness of distribution when calculated as a condition of insertion of the new data item.
62 Citations
23 Claims
-
1. A computer-implemented method for automating classification of a new data item when adding the new data item to an hierarchically organized hierarchy set of classified data items, wherein nodes of the hierarchy set correspond to classes of data items, said method comprising:
-
inputting into a computer system classes of documents comprising (i) a collection of classified documents within a hierarchy of classes and (ii) class labels associated with said collection of classified documents; training said computer system using said collection of classified documents, wherein the training process comprises; selecting, from an input set of said collection of classified documents, a set of tokens for use as classification attributes; and building a dictionary comprising said classification attributes; modeling a distribution of said classification attributes across said classes of documents by using a set of random variables and values associated with said set of random variables to represent said classification attributes in said dictionary; calculating an initial entropy value at every node of each level of said hierarchy of classes using said set of random variables; inputting into said computer system a new data item to be classified into one of said classes in said hierarchy of classes; calculating entropy values for each of a plurality of possible classes into which said new data item could be classified; comparing the calculated entropy values with said initial entropy value at every node of each level of said hierarchy of classes in order to create a plurality of conditional entropy values; selecting a class having a lowest conditional entropy value; and classifying said new data item in the selected class. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A computer program product comprising program code recorded on a machine-readable recording medium, for controlling the operation of a data processing apparatus on which the program code executes to perform a method for automating classification of a new data item when adding the new data item to an hierarchically organized hierarchy set of classified data items, wherein nodes of the hierarchy set correspond to classes of data items, said method comprising:
-
inputting into a computer system classes of documents comprising (i) a collection of classified documents within a hierarchy of classes and (ii) class labels associated with said collection of classified documents; training said computer system using said collection of classified documents, wherein the training process comprises; selecting, from an input set of said collection of classified documents, a set of tokens for use as classification attributes; and building a dictionary comprising said classification attributes; modeling a distribution of said classification attributes across said classes of documents by using a set of random variables and values associated with said set of random variables to represent said classification attributes in said dictionary; calculating an initial entropy value at every node of each level of said hierarchy of classes using said set of random variables; inputting into said computer system a new data item to be classified into one of said classes in said hierarchy of classes; calculating entropy values for each of a plurality of possible classes into which said new data item could be classified; comparing the calculated entropy values with said initial entropy value at every node of each level of said hierarchy of classes in order to create a plurality of conditional entropy values; selecting a class having a lowest conditional entropy value; and classifying said new data item in the selected class. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A data processing apparatus for automating classification of a new data item when adding the new data item to an hierarchically organized hierarchy set of classified data items, wherein nodes of the hierarchy set correspond to classes of data items, said data processing apparatus comprising a microprocessor adapted to:
-
input into a computer system classes of documents comprising (i) a collection of classified documents within a hierarchy of classes and (ii) class labels associated with said collection of classified documents; train said computer system using said collection of classified documents, wherein the training of said computer system comprises; selecting, from an input set of said collection of classified documents, a set of tokens for use as classification attributes; and building a dictionary comprising said classification attributes; model a distribution of said classification attributes across said classes of documents by using a set of random variables and values associated with said set of random variables to represent said classification attributes in said dictionary; calculate an initial entropy value at every node of each level of said hierarchy of classes using said set of random variables; input into said computer system a new data item to be classified into one of said classes in said hierarchy of classes; calculate entropy values for each of a plurality of possible classes into which said new data item could be classified; compare the calculated entropy values with said initial entropy value at every node of each level of said hierarchy of classes in order to create a plurality of conditional entropy values; select a class having a lowest conditional entropy value; and classify said new data item in the selected class. - View Dependent Claims (18, 19, 20, 21, 22, 23)
-
Specification