LARGE SCALE UNSUPERVISED HIERARCHICAL DOCUMENT CATEGORIZATION USING ONTOLOGICAL GUIDANCE
First Claim
1. A storage medium storing instructions executable by a processing device to perform a method comprising:
- generating a hierarchical classifier for a taxonomy of hierarchically organized categories wherein each category is represented by one or more textual category descriptors, the hierarchical classifier being generated by a method including (i) constructing queries from the textual category descriptors representing the categories and querying a documents database using the constructed queries to retrieve pseudo-relevant documents and (ii) extracting language models comprising multinomial distributions over the words of a textual vocabulary for the categories of the taxonomy by inferring a hierarchical topic model representing the taxonomy from at least the pseudo-relevant documents; and
classifying an input document using the generated hierarchical classifier.
1 Assignment
0 Petitions
Accused Products
Abstract
A classification method includes constructing queries from category descriptors representing categories of a taxonomy of hierarchically organized categories. The query constructed for a category c includes a query component based on descriptors of the category c and at least one query component based on descriptors of an ancestor or descendant category of the category c. A documents database is queried using the constructed queries to retrieve pseudo-relevant documents. Language models for the categories of the taxonomy are extracted from the pseudo-relevant documents by inferring a hierarchical topic model representing the taxonomy. An input document is classified by optimizing mixture weights of a weighted combination of categories of the hierarchical topic model respective to the input document.
61 Citations
25 Claims
-
1. A storage medium storing instructions executable by a processing device to perform a method comprising:
-
generating a hierarchical classifier for a taxonomy of hierarchically organized categories wherein each category is represented by one or more textual category descriptors, the hierarchical classifier being generated by a method including (i) constructing queries from the textual category descriptors representing the categories and querying a documents database using the constructed queries to retrieve pseudo-relevant documents and (ii) extracting language models comprising multinomial distributions over the words of a textual vocabulary for the categories of the taxonomy by inferring a hierarchical topic model representing the taxonomy from at least the pseudo-relevant documents; and classifying an input document using the generated hierarchical classifier. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method comprising:
-
constructing queries from category descriptors representing categories of a taxonomy of hierarchically organized categories; querying a documents database using the constructed queries to retrieve pseudo-relevant documents; extracting category profiles for the categories of the taxonomy from at least the pseudo-relevant documents by inferring a hierarchical topic model representing the taxonomy; and classifying an input document by optimizing mixture weights of a weighted combination of categories of the hierarchical topic model respective to the input document; wherein at least the constructing, extracting, and classifying operations are performed by a digital processing device. - View Dependent Claims (9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. An apparatus comprising:
a digital processing device configured to generate a hierarchical classifier for a taxonomy of hierarchically organized categories wherein each category is represented by one or more category descriptors, the digital processing device generating the hierarchical classifier by a method including; constructing queries for categories of the taxonomy of hierarchically organized categories, the query constructed for a category c of the taxonomy including a query component constructed from one or more textual descriptors of the category c and at least one of a query component constructed from one or more textual descriptors of an ancestor category of the category c and a query component constructed from textual descriptors of one or more descendant categories of the category c; querying a documents database using the constructed queries to retrieve pseudo-relevant documents; and extracting category profiles for the categories of the taxonomy from at least the pseudo-relevant documents by inferring a hierarchical topic model representing the taxonomy. - View Dependent Claims (18, 19, 20, 21, 22, 23, 24, 25)
Specification