Method and apparatus for populating a predefined concept hierarchy or other hierarchical set of classified data items by minimizing system entrophy
First Claim
1. A method for automating classification of a new data item when adding the new data item to an hierarchically organized;
- set of classified data items, wherein nodes of the hierarchy correspond to classes of data items, the method comprising the steps of;
for a new data item requiring classification within the set of classified data items, identifying classification attributes of the new data item by reference to a set of classification attributes for the set of classified data items;
calculating a conditional value representative of the randomness of distribution of classification attributes for data items within the set of classified data items, which value is conditional on the new data item being added to a first class at a particular level of the hierarchy, and repeating the step of calculating a conditional value for each sibling class of the first class at said particular level of the hierarchy wherein said conditional values are each conditional on adding the new data item to a different respective one of said sibling classes at said particular level of the hierarchy;
comparing the conditional values to identify the lowest conditional value; and
selecting the class having the lowest conditional value for classifying the new data item.
1 Assignment
0 Petitions
Accused Products
Abstract
Disclosed are a system and method for automated populating of an existing concept hierarchy of items with new items, using entropy as a measure of the correctness of a potential classification. User-defined concept hierarchies include, for example, document hierarchies such as directories for the Internet (such as yahoo), library catalogues, patent databases and journals, and product hierarchies. These concept hierarchies can be huge and are usually maintained manually. An internet directory may have, for example, millions of Web sites, thousands of editors and hundreds of thousands of different categories. The method for populating a concept hierarchy includes calculating conditional ‘entropy’ values representing the randomness of distribution of classification attributes for the hierarchical set of classes if a new item is added to specific classes of the hierarchy and then selecting whichever class has the minimum randomness of distribution when calculated as a condition of insertion of the new data item.
108 Citations
23 Claims
-
1. A method for automating classification of a new data item when adding the new data item to an hierarchically organized;
- set of classified data items, wherein nodes of the hierarchy correspond to classes of data items, the method comprising the steps of;
for a new data item requiring classification within the set of classified data items, identifying classification attributes of the new data item by reference to a set of classification attributes for the set of classified data items;
calculating a conditional value representative of the randomness of distribution of classification attributes for data items within the set of classified data items, which value is conditional on the new data item being added to a first class at a particular level of the hierarchy, and repeating the step of calculating a conditional value for each sibling class of the first class at said particular level of the hierarchy wherein said conditional values are each conditional on adding the new data item to a different respective one of said sibling classes at said particular level of the hierarchy;
comparing the conditional values to identify the lowest conditional value; and
selecting the class having the lowest conditional value for classifying the new data item. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- set of classified data items, wherein nodes of the hierarchy correspond to classes of data items, the method comprising the steps of;
-
18. A method for automating classification of a new data item when adding the new data item to an hierarchical set of classified data items in which each data item of the hierarchical set is associated with a specific class of data items, the class corresponding to a specific node of the hierarchy, the method comprising the steps of:
-
creating a set of classification attributes for the set of classified data items, such that each data item in the set may be classified by reference to classification attributes of the data item and classification attributes of classes of data items in the hierarchy;
for a new data item requiring classification within the set of classified data items, identifying classification attributes of the new data item by reference to the set of classification attributes;
calculating a conditional value representative of the randomness of distribution of classification attributes for data items within the set of classified data items, which value is conditional on the new data item being added to a first class at a particular level of the hierarchy, and repeating the step of calculating a conditional value for each sibling class of the first class at said particular level of the hierarchy wherein said conditional values are each conditional on adding the new data item to a different respective one of said sibling classes at said particular level of the hierarchy;
comparing the conditional values to identify the lowest conditional value; and
selecting the class having the lowest conditional value for classifying the new data item.
-
-
19. A method for verifying the classification of a current data item within an hierarchical set of classified data items, wherein nodes of the hierarchy correspond to classes of data items and data items within the classes each have one or more classification attributes, and wherein the current item has been classified within a first class corresponding to a node within the hierarchical set of classified data items, the method comprising the steps of:
-
calculating an initial value representative of the randomness of distribution of classification attributes for the set of classified data items;
calculating a conditional value representative of the randomness of distribution of classification attributes for data items within the set of classified data items, which value is conditional on the current data item being classified within a respective one of the set of sibling classes of the first class as an alternative to the first class, wherein sibling classes comprise the set of immediate descendant classes of a common ancestor class in the hierarchy;
repeating the step of calculating a conditional value respectively for any other ones of said sibling classes, wherein said conditional values are each conditional on classifying the current data item within a different respective one of said sibling classes; and
checking whether the initial value representative of the randomness of distribution of classification attributes is a lower value than each of the conditional values.
-
-
20. A computer program product comprising program code recorded on a machine-readable recording medium, for controlling the operation of a data processing apparatus on which the program code executes to perform a method for automated classification of a new data item within an hierarchically organized set of classified data items in which nodes of the hierarchy correspond to classes of data items, wherein the program code comprises:
-
program code for identifying classification attributes of a new data item which requires classification within the set of data items, by reference to a set of classification attributes for the set of classified data items;
program code for calculating a conditional value representative of the randomness of distribution of classification attributes for data items within the set of classified data items, which value is conditional on the new data item being added to a first class at a particular level of the hierarchy, and repeating the step of calculating a conditional value for each sibling class of the first class at said particular level of the hierarchy wherein said conditional values are each conditional on adding the new data item to a different respective one of said sibling classes at said particular level of the hierarchy;
program code for comparing the conditional values to identify the lowest conditional value; and
program code for selecting the class having the lowest conditional value for classifying the new data item.
-
-
21. A data processing apparatus comprising:
-
storage means, for storing a set of data items and information defining an hierarchical organization of the set of data items within classes, wherein nodes of the hierarchy correspond to classes of data items;
storage means for storing a set of classification attributes for the set of data items;
means for identifying classification attributes of a new data item which requires classification within the set of data items, by reference to a set of classification attributes for the set of data items;
means for calculating a conditional value representative of the randomness of distribution of classification attributes for data items within the set of data items, which value is conditional on the new data item being added to a first class at a particular level of the hierarchy, and repeating the step of calculating a conditional value for each sibling class of the first class at said particular level of the hierarchy wherein said conditional values are each conditional on adding the new data item to a different respective one of is said sibling classes at said particular level of the hierarchy;
means for comparing the conditional values to identify the lowest conditional value; and
means for selecting the class having the lowest conditional value for classifying the new data item.
-
-
22. A directory server computer comprising:
-
storage means for storing a set of data items and information defining an hierarchical organization of the set of data items within classes, wherein nodes of the hierarchy correspond to classes of data items;
storage means for storing a set of classification attributes for the set of data items;
means for identifying classification attributes of a new data item which requires classification within the set of data items, by reference to a set of classification attributes for the set of data items;
means, for calculating a conditional value representative of the randomness of distribution of classification attributes for data items within the set of data items, which value is conditional on the new data item being added to a first class at a particular level of the hierarchy, and repeating the step of calculating a conditional value for each sibling class of the first class at said particular level of the hierarchy wherein said conditional values are each conditional on adding the new data item to a different respective one of said sibling classes at said particular level of the hierarchy;
means for comparing the conditional values to identify the lowest conditional value; and
means for selecting the class having the lowest conditional value for classifying the new data item.
-
-
23. An automated classifier for classifying a new data item when adding the new data item to an hierarchically organized set of classified data items in which nodes of the hierarchy correspond to classes of data items, the classifier comprising:
-
means for identifying classification attributes of a new data item which requires classification within the set of data items, by reference to a set of classification attributes for the set of classified data items;
means for calculating a conditional value representative of the randomness of distribution of classification attributes for data items within the set of classified data items, which value is conditional on the new data item being added to a first class at a particular level of the hierarchy, and repeating the step of calculating a conditional value for each sibling class of the first class at said particular level of the hierarchy wherein said conditional values are each conditional on adding the new data item to a different respective one of said sibling classes at said particular level of the hierarchy;
means for comparing the conditional values to identify the lowest conditional value; and
means for selecting the class having the lowest conditional value for classifying the new data item.
-
Specification