Classifying text into hierarchical categories
First Claim
1. A computer-implemented method comprising:
- classifying a text into first subject matter categories;
identifying one or more second subject matter categories in a plurality of second subject matter categories, each of the second subject matter categories being a hierarchical classification of a plurality of confirmed valid search results for queries, and wherein at least one query for each identified second subject matter category comprises a term in the text;
filtering the identified second subject matter categories by excluding identified second subject matter categories whose ancestors are not among the first subject matter categories;
for each second subject matter category in the filtered second subject matter categories;
extracting one or more constituent terms from the queries of whose confirmed valid search results the second subject matter category is the hierarchical classification, where the constituent terms appear in the text;
calculating an initial weight of the second subject matter category, the calculating comprising determining a sum of term frequency-inverse document frequency (tf-idf) values of each extracted constituent term in relation to a corpus of documents; and
selecting the second subject matter category based on the initial weight and based on a threshold where the threshold specifies a degree of relatedness between a selected subject matter category and the text; and
where the selected second subject matter categories are a sufficient basis for recommending to a user content associated with one or more of the selected second subject matter categories.
2 Assignments
0 Petitions
Accused Products
Abstract
Systems, methods and program products for classifying text. A system classifies text into first subject matter categories. The system identifies one or more second subject matter categories in a collection of second subject matter categories, each of the second categories is a hierarchical classification of a collection of confirmed valid search results for queries, in which at least one query for each identified second category includes a term in the text. The system filters the identified categories by excluding identified categories whose ancestors are not among the first categories. The system selects categories from the filtered categories based on one or more thresholds in which a threshold specifies a degree of relatedness between a selected category and the text. The selected categories are a sufficient basis for recommending content to a user, the content being associated with one or more of the selected categories.
-
Citations
24 Claims
-
1. A computer-implemented method comprising:
-
classifying a text into first subject matter categories; identifying one or more second subject matter categories in a plurality of second subject matter categories, each of the second subject matter categories being a hierarchical classification of a plurality of confirmed valid search results for queries, and wherein at least one query for each identified second subject matter category comprises a term in the text; filtering the identified second subject matter categories by excluding identified second subject matter categories whose ancestors are not among the first subject matter categories; for each second subject matter category in the filtered second subject matter categories; extracting one or more constituent terms from the queries of whose confirmed valid search results the second subject matter category is the hierarchical classification, where the constituent terms appear in the text; calculating an initial weight of the second subject matter category, the calculating comprising determining a sum of term frequency-inverse document frequency (tf-idf) values of each extracted constituent term in relation to a corpus of documents; and selecting the second subject matter category based on the initial weight and based on a threshold where the threshold specifies a degree of relatedness between a selected subject matter category and the text; and where the selected second subject matter categories are a sufficient basis for recommending to a user content associated with one or more of the selected second subject matter categories. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A non-transitory computer-readable medium having instructions stored thereon, the instructions, when executed by one or more processors, cause the processors to perform operations comprising:
-
classifying a text into first subject matter categories; identifying one or more second subject matter categories in a plurality of second subject matter categories, each of the second subject matter categories being a hierarchical classification of a plurality of confirmed valid search results for queries, and wherein at least one query for each identified second subject matter category comprises a term in the text; filtering the identified second subject matter categories by excluding identified second subject matter categories whose ancestors are not among the first subject matter categories; for each second subject matter category in the filtered second subject matter categories; extracting one or more constituent terms from the queries of whose confirmed valid search results the second subject matter category is the hierarchical classification, where the constituent terms appear in the text; calculating an initial weight of the second subject matter category, the calculating comprising determining a sum of term frequency-inverse document frequency (tf-idf) values of each extracted constituent term in relation to a corpus of documents; and selecting the second subject matter category based on the initial weight and based on a threshold where the threshold specifies a degree of relatedness between a selected subject matter category and the text; and where the selected second subject matter categories are a sufficient basis for recommending to a user content associated with one or more of the selected second subject matter categories. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A system comprising:
-
one or more processors; and memory having instructions stored thereon, the instructions when executed by the one or more processors, cause the processors to perform operations comprising; classifying a text into first subject matter categories; identifying one or more second subject matter categories in a plurality of second subject matter categories, each of the second subject matter categories being a hierarchical classification of a plurality of confirmed valid search results for queries, and wherein at least one query for each identified second subject matter category comprises a term in the text; filtering the identified second subject matter categories by excluding identified second subject matter categories whose ancestors are not among the first subject matter categories; for each second subject matter category in the filtered second subject matter categories; extracting one or more constituent terms from the queries of whose confirmed valid search results the second subject matter category is the hierarchical classification, where the constituent terms appear in the text; calculating an initial weight of the second subject matter category, the calculating comprising determining a sum of term frequency-inverse document frequency (tf-idf) values of each extracted constituent term in relation to a corpus of documents; and selecting the second subject matter category based on the initial weight and based on a threshold where the threshold specifies a degree of relatedness between a selected subject matter category and the text; and where the selected second subject matter categories are a sufficient basis for recommending to a user content associated with one or more of the selected second subject matter categories. - View Dependent Claims (18, 19, 20, 21, 22, 23, 24)
-
Specification