Very-large-scale automatic categorizer for web content
First Claim
1. A method of training a classifier system by utilizing previously classified data objects comprising one or more electronic documents including at least one of a text document, an image file, an audio sequence, a video sequence, and a hybrid document including a combination of text and images, said previously classified data objects being organized into a subject hierarchy of a plurality of nodes, the method comprising:
- selecting one node of the plurality of nodes;
aggregating those of the previously classified data objects corresponding to the selected node and any associated sub-nodes of the selected node, to form a content class of data objects, said content class of data objects comprising a content class of the one or more electronic documents;
aggregating those of the previously classified data objects corresponding to any associated sibling nodes of the selected node and any associated sub-nodes of the sibling nodes to form an anti-content class of data objects, said anti-content class of data objects comprising an anti-content class of the one or more electronic documents; and
extracting features from at least one of the content class of data objects and the anti-content class of data objects to facilitate characterization of said previously classified data objects.
3 Assignments
0 Petitions
Accused Products
Abstract
A method and apparatus for efficiently classifying and categorizing data objects such as electronic text, graphics, and audio based documents within very-large-scale hierarchical classification trees is provided. In accordance with one embodiment of the invention, a first node of a plurality of nodes of a subject hierarchy is selected. Previously classified data objects corresponding to a selected first node of a subject hierarchy as well as any associated sub-nodes of the selected node are aggregated to form a content class of data objects. Similarly, data objects corresponding to sibling nodes of the selected node and any associated sub-nodes of the sibling nodes are then aggregated to form an anti-content class of data objects. Features are then extracted from each of the content class of data objects and the anti-content class of data objects to facilitate characterization of said previously classified data objects.
-
Citations
22 Claims
-
1. A method of training a classifier system by utilizing previously classified data objects comprising one or more electronic documents including at least one of a text document, an image file, an audio sequence, a video sequence, and a hybrid document including a combination of text and images, said previously classified data objects being organized into a subject hierarchy of a plurality of nodes, the method comprising:
-
selecting one node of the plurality of nodes;
aggregating those of the previously classified data objects corresponding to the selected node and any associated sub-nodes of the selected node, to form a content class of data objects, said content class of data objects comprising a content class of the one or more electronic documents;
aggregating those of the previously classified data objects corresponding to any associated sibling nodes of the selected node and any associated sub-nodes of the sibling nodes to form an anti-content class of data objects, said anti-content class of data objects comprising an anti-content class of the one or more electronic documents; and
extracting features from at least one of the content class of data objects and the anti-content class of data objects to facilitate characterization of said previously classified data objects. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
ranking said extracted features based upon a frequency of occurrence for each extracted feature;
identifying a corner feature of said extracted features such that the frequency of occurrence of said corner feature is equal to or immediately greater than its corresponding rank, and wherein said corner feature defines a first group of features having respective frequencies of occurrence greater than the corner feature, and a second group of features having respective frequencies of occurrence less than the corner feature; and
accepting a first set of features from said first group of features and a second set of features from said second group of features, wherein the cumulative frequencies of occurrence of said first set of features is a fractional percentage of the cumulative frequencies of occurrence of said second set of features.
-
-
8. The method of claim 7, wherein the cumulative frequencies of occurrence of said first set of features is approximately 20 percent of the cumulative frequencies of occurrence of said second set of features, and the cumulative frequencies of occurrence of said second set of features is approximately 80 percent of the cumulative frequencies of occurrence of said first set of features.
-
9. The method of claim 6, wherein said salient features are n-gram salient features.
-
10. The method of claim 6, wherein said extracted features are determined to be salient based upon mutual information techniques.
-
11. An apparatus comprising:
-
a storage medium having stored therein a plurality of programming instructions designed to implement a plurality of functions of a category name service for providing a category name to a data object, including first one or more functions to select a first node of a hierarchically organized classifier having a plurality of nodes and one or more previously classified data objects associated with each of said plurality of nodes, aggregate those of the previously classified data objects corresponding to the selected node and any associated sub-nodes of the selected node to form a content class of data objects, aggregate those of the previously classified data objects corresponding to any associated sibling nodes of the selected node and any associated sub-nodes of the sibling nodes to form an anti-content class of data objects, extract features from at least one of the content class of data objects and the anti-content class of data objects to facilitate characterization of said previously classified data objects; and
a processor coupled to the storage medium to execute the programming instructions. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
rank said extracted features based upon a frequency of occurrence for each extracted feature; identify a corner feature of said extracted features such that the frequency of occurrence of said corner feature is equal to or immediately greater than its corresponding rank, and wherein said corner feature defines a first group of features having respective frequencies of occurrence greater than the corner feature, and a second group of features having respective frequencies of occurrence less than the corner feature; and
accept a first set of features from said first group of features and a second set of features from said second group of features, wherein the cumulative frequencies of occurrence of said first set of features is a fractional percentage of the cumulative frequencies of occurrence of said second set of features.
-
-
18. The apparatus of claim 17, wherein the cumulative frequencies of occurrence of said first set of features is approximately 20 percent of the cumulative frequencies of occurrence of said second set of features, and the cumulative frequencies of occurrence of said second set of features is approximately 80 percent of the cumulative frequencies of occurrence of said first set of features.
-
19. The apparatus of claim 16, wherein said salient features are n-gram salient features.
-
20. The method of claim 16, wherein said salient features are determined based upon mutual information techniques.
-
21. The apparatus of claim 11, wherein said data object comprises an electronic document.
-
22. The apparatus of claim 21, wherein said electronic document comprises at least one of a text document, an image file, an audio sequence, a video sequence, and a hybrid document including a combination of text and images.
Specification