Classifying text into hierarchical categories

US 8,145,636 B1
Filed: 03/13/2009
Issued: 03/27/2012
Est. Priority Date: 03/13/2009
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method comprising:

classifying a text into first subject matter categories;

identifying one or more second subject matter categories in a plurality of second subject matter categories, each of the second subject matter categories being a hierarchical classification of a plurality of confirmed valid search results for queries, and wherein at least one query for each identified second subject matter category comprises a term in the text;

filtering the identified second subject matter categories by excluding identified second subject matter categories whose ancestors are not among the first subject matter categories;

for each second subject matter category in the filtered second subject matter categories;

extracting one or more constituent terms from the queries of whose confirmed valid search results the second subject matter category is the hierarchical classification, where the constituent terms appear in the text;

calculating an initial weight of the second subject matter category, the calculating comprising determining a sum of term frequency-inverse document frequency (tf-idf) values of each extracted constituent term in relation to a corpus of documents; and

selecting the second subject matter category based on the initial weight and based on a threshold where the threshold specifies a degree of relatedness between a selected subject matter category and the text; and

where the selected second subject matter categories are a sufficient basis for recommending to a user content associated with one or more of the selected second subject matter categories.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems, methods and program products for classifying text. A system classifies text into first subject matter categories. The system identifies one or more second subject matter categories in a collection of second subject matter categories, each of the second categories is a hierarchical classification of a collection of confirmed valid search results for queries, in which at least one query for each identified second category includes a term in the text. The system filters the identified categories by excluding identified categories whose ancestors are not among the first categories. The system selects categories from the filtered categories based on one or more thresholds in which a threshold specifies a degree of relatedness between a selected category and the text. The selected categories are a sufficient basis for recommending content to a user, the content being associated with one or more of the selected categories.

Citations

24 Claims

1. A computer-implemented method comprising:
- classifying a text into first subject matter categories;
  
  identifying one or more second subject matter categories in a plurality of second subject matter categories, each of the second subject matter categories being a hierarchical classification of a plurality of confirmed valid search results for queries, and wherein at least one query for each identified second subject matter category comprises a term in the text;
  
  filtering the identified second subject matter categories by excluding identified second subject matter categories whose ancestors are not among the first subject matter categories;
  
  for each second subject matter category in the filtered second subject matter categories;
  
  extracting one or more constituent terms from the queries of whose confirmed valid search results the second subject matter category is the hierarchical classification, where the constituent terms appear in the text;
  
  calculating an initial weight of the second subject matter category, the calculating comprising determining a sum of term frequency-inverse document frequency (tf-idf) values of each extracted constituent term in relation to a corpus of documents; and
  
  selecting the second subject matter category based on the initial weight and based on a threshold where the threshold specifies a degree of relatedness between a selected subject matter category and the text; and
  
  where the selected second subject matter categories are a sufficient basis for recommending to a user content associated with one or more of the selected second subject matter categories.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1 in which calculating a tf-idf value of each extracted constituent term further comprises:
    - calculating an inverse document frequency (idf) value of the constituent term in relation to the corpus of documents;
      
      calculating a term frequency (tf) value of the constituent term; and
      
      determining the tf-idf value of the extracted constituent term based on the idf value of the constituent term and the tf value of the constituent term.
  - 3. The method of claim 2 in which calculating the idf value of the constituent term further comprises calculating an idf quotient, the calculating including dividing a total number of documents in the corpus by a number of documents in which the constituent term appears.
  - 4. The method of claim 2 in which calculating the tf value of the constituent term further comprises dividing a number of times the constituent term appears in the text by a length of the text.
  - 5. The method of claim 2 in which calculating the tf value of the constituent term further comprises:
    - for each confirmed valid search result for the queries from which the constituent term is extracted;
      
      dividing a number of times the constituent term appears in the search result by a length of the search result to obtain a relative term frequency; and
      
      applying the relative term frequency to the tf value of the constituent term.
  - 6. The method of claim 1, in which selecting the second subject matter category further comprises:
    - calculating a number of distinct constituent terms in the extracted constituent terms; and
      
      selecting the second subject matter category as a first selected subject matter category if the number of distinct constituent terms satisfies a first threshold.
  - 7. The method of claim 6 in which selecting the second subject matter category further comprises:
    - identifying one or more constituent terms from the extracted constituent terms, the identified constituent terms matching a refinement in a hierarchy in the first selected subject matter category, the refinement having a level in the hierarchy;
      
      boosting the initial weight of the first selected subject matter category by a first boost value to acquire a first boosted weight, the first boost value commensurate with the level of the refinement;
      
      boosting the first boosted weight by a second boost value to acquire a second boosted weight, the second boost value commensurate with a total number of constituent terms in the extracted constituent terms; and
      
      selecting the first selected subject matter category if the second boosted weight of the first selected subject matter category satisfies a second threshold.
  - 8. The method of claim 1, further comprising determining a confirmed valid search result by:
    - receiving a search query;
      
      presenting one or more search results responsive to the search query; and
      
      receiving a selection of at least one search result from the one or more search results, the selection designating the confirmed valid search result.

9. A non-transitory computer-readable medium having instructions stored thereon, the instructions, when executed by one or more processors, cause the processors to perform operations comprising:
- classifying a text into first subject matter categories;
  
  identifying one or more second subject matter categories in a plurality of second subject matter categories, each of the second subject matter categories being a hierarchical classification of a plurality of confirmed valid search results for queries, and wherein at least one query for each identified second subject matter category comprises a term in the text;
  
  filtering the identified second subject matter categories by excluding identified second subject matter categories whose ancestors are not among the first subject matter categories;
  
  for each second subject matter category in the filtered second subject matter categories;
  
  extracting one or more constituent terms from the queries of whose confirmed valid search results the second subject matter category is the hierarchical classification, where the constituent terms appear in the text;
  
  calculating an initial weight of the second subject matter category, the calculating comprising determining a sum of term frequency-inverse document frequency (tf-idf) values of each extracted constituent term in relation to a corpus of documents; and
  
  selecting the second subject matter category based on the initial weight and based on a threshold where the threshold specifies a degree of relatedness between a selected subject matter category and the text; and
  
  where the selected second subject matter categories are a sufficient basis for recommending to a user content associated with one or more of the selected second subject matter categories.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. The computer-readable medium of claim 9 in which calculating a tf-idf value of each extracted constituent term further comprises:
    - calculating an inverse document frequency (idf) value of the constituent term in relation to the corpus of documents;
      
      calculating a term frequency (tf) value of the constituent term; and
      
      determining the tf-idf value of the extracted constituent term based on the idf value of the constituent term and the tf value of the constituent term.
  - 11. The computer-readable medium of claim 10 in which calculating the idf value of the constituent term further comprises calculating an idf quotient, the calculating including dividing a total number of documents in the corpus by a number of documents in which the constituent term appears.
  - 12. The computer-readable medium of claim 10 in which calculating the tf value of the constituent term further comprises dividing a number of times the constituent term appears in the text by a length of the text.
  - 13. The computer-readable medium of claim 10 in which calculating the tf value of the constituent term further comprises:
    - for each confirmed valid search result for the queries from which the constituent term is extracted;
      
      dividing a number of times the constituent term appears in the search result by a length of the search result to obtain a relative term frequency; and
      
      applying the relative term frequency to the tf value of the constituent term.
  - 14. The computer-readable medium of claim 9 in which selecting the second subject matter category further comprises:
    - calculating a number of distinct constituent terms in the extracted constituent terms; and
      
      selecting the second subject matter category as a first selected subject matter category if the number of distinct constituent terms satisfies a first threshold.
  - 15. The computer-readable medium of claim 14 in which selecting the second subject matter category further comprises:
    - identifying one or more constituent terms from the extracted constituent terms, the identified constituent terms matching a refinement in a hierarchy in the first selected subject matter category, the refinement having a level in the hierarchy;
      
      boosting the initial weight of the first selected subject matter category by a first boost value to acquire a first boosted weight, the first boost value commensurate with the level of the refinement;
      
      boosting the first boosted weight by a second boost value to acquire a second boosted weight, the second boost value commensurate with a total number of constituent terms in the extracted constituent terms; and
      
      selecting the first selected subject matter category if the second boosted weight of the first selected subject matter category satisfies a second threshold.
  - 16. The computer-readable medium of claim 9, the operations further comprising determining a confirmed valid search result by:
    - receiving a search query;
      
      presenting one or more search results responsive to the search query; and
      
      receiving a selection of at least one search result from the one or more search results, the selection designating the confirmed valid search result.

17. A system comprising:
- one or more processors; and
  
  memory having instructions stored thereon, the instructions when executed by the one or more processors, cause the processors to perform operations comprising;
  
  classifying a text into first subject matter categories;
  
  identifying one or more second subject matter categories in a plurality of second subject matter categories, each of the second subject matter categories being a hierarchical classification of a plurality of confirmed valid search results for queries, and wherein at least one query for each identified second subject matter category comprises a term in the text;
  
  filtering the identified second subject matter categories by excluding identified second subject matter categories whose ancestors are not among the first subject matter categories;
  
  for each second subject matter category in the filtered second subject matter categories;
  
  extracting one or more constituent terms from the queries of whose confirmed valid search results the second subject matter category is the hierarchical classification, where the constituent terms appear in the text;
  
  calculating an initial weight of the second subject matter category, the calculating comprising determining a sum of term frequency-inverse document frequency (tf-idf) values of each extracted constituent term in relation to a corpus of documents; and
  
  selecting the second subject matter category based on the initial weight and based on a threshold where the threshold specifies a degree of relatedness between a selected subject matter category and the text; and
  
  where the selected second subject matter categories are a sufficient basis for recommending to a user content associated with one or more of the selected second subject matter categories.
- View Dependent Claims (18, 19, 20, 21, 22, 23, 24)
- - 18. The system of claim 17 in which calculating a tf-idf value of each extracted constituent term further comprises:
    - calculating an inverse document frequency (idf) value of the constituent term in relation to the corpus of documents;
      
      calculating a term frequency (tf) value of the constituent term; and
      
      determining the tf-idf value of the extracted constituent term based on the idf value of the constituent term and the tf value of the constituent term.
  - 19. The system of claim 18 in which calculating the idf value of the constituent term further comprises calculating an idf quotient, the calculating including dividing a total number of documents in the corpus by a number of documents in which the constituent term appears.
  - 20. The system of claim 18 in which calculating the tf value of the constituent term further comprises dividing a number of times the constituent term appears in the text by a length of the text.
  - 21. The system of claim 18 in which calculating the tf value of the constituent term further comprises:
    - for each confirmed valid search result for the queries from which the constituent term is extracted;
      
      dividing a number of times the constituent term appears in the search result by a length of the search result to obtain a relative term frequency; and
      
      applying the relative term frequency to the tf value of the constituent term.
  - 22. The system of claim 17 in which selecting the second subject matter category further comprises:
    - calculating a number of distinct constituent terms in the extracted constituent terms; and
      
      selecting the second subject matter category as a first selected subject matter category if the number of distinct constituent terms satisfies a first threshold.
  - 23. The system of claim 22 in which selecting the second subject matter category further comprises:
    - identifying one or more constituent terms from the extracted constituent terms, the identified constituent terms matching a refinement in a hierarchy in the first selected subject matter category, the refinement having a level in the hierarchy;
      
      boosting the initial weight of the first selected subject matter category by a first boost value to acquire a first boosted weight, the first boost value commensurate with the level of the refinement;
      
      boosting the first boosted weight by a second boost value to acquire a second boosted weight, the second boost value commensurate with a total number of constituent terms in the extracted constituent terms; and
      
      selecting the first selected subject matter category if the second boosted weight of the first selected subject matter category satisfies a second threshold.
  - 24. The system of claim 17, the operations further comprising determining a confirmed valid search result by:
    - receiving a search query;
      
      presenting one or more search results responsive to the search query; and
      
      receiving a selection of at least one search result from the one or more search results, the selection designating the confirmed valid search result.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Jeh, Glen M., Yang, Beverly
Primary Examiner(s)
Thai, Hanh

Application Number

US12/404,232
Time in Patent Office

1,110 Days
Field of Search

707/706, 707/750, 707/754, 707/736, 707748-749
US Class Current

707/736
CPC Class Codes

G06F 16/353 into predefined classes

Classifying text into hierarchical categories

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

Classifying text into hierarchical categories

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links