Method and apparatus for classifying documents based on user inputs

US 7,769,751 B1
Filed: 01/17/2006
Issued: 08/03/2010
Est. Priority Date: 01/17/2006
Status: Active Grant

First Claim

Patent Images

1. A method executed on one or more processors for automatically classifying documents based on topics and user inputs, comprising:

receiving a set of documents which are classified as relating to the specific topic;

producing an initial feature vector that corresponds to frequency of a term'"'"'s occurrence in the set of documents;

using the initial feature vector to classify another set of documents to produce an initial classified set of documents;

receiving click information associated with a set of queries related to the specific topic, wherein the click information includes a click-through rate at which a query result is selected after being presented and a click duration indicating an amount of time during which the query result is accessed;

using the click information to remove off-topic documents in the set of documents to obtain an updated set of documents, wherein a document is off-topic if the click-through rate or click duration associated with the document indicates the document is off-topic;

determining an updated feature vector using the updated set of documents; and

re-classifying the classified set of documents using the updated feature vector when the percentage of documents identified as off-topic exceeds a threshold which is greater than 0, otherwise retaining the initial classified set of documents.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

One embodiment of the present invention provides a system that automatically classifies documents (such as web pages) based on user inputs. During operation, the system obtains a “classified” set of documents which are classified as relating to a specific topic. The system also obtains queries related to the specific topic. These queries produce “query results” which enable the user to access documents related to the query. The queries also include “click information” which specifies how one or more users have accessed the query results. The system uses this click information to identify documents in the classified set of documents which are not related to the specific topic or are off-topic. If such documents are identified, the system shifts the identified documents so that they are regarded as off-topic and/or spam, and removes the identified documents from the classified set of documents.

101 Citations

View as Search Results

21 Claims

1. A method executed on one or more processors for automatically classifying documents based on topics and user inputs, comprising:
- receiving a set of documents which are classified as relating to the specific topic;
  
  producing an initial feature vector that corresponds to frequency of a term'"'"'s occurrence in the set of documents;
  
  using the initial feature vector to classify another set of documents to produce an initial classified set of documents;
  
  receiving click information associated with a set of queries related to the specific topic, wherein the click information includes a click-through rate at which a query result is selected after being presented and a click duration indicating an amount of time during which the query result is accessed;
  
  using the click information to remove off-topic documents in the set of documents to obtain an updated set of documents, wherein a document is off-topic if the click-through rate or click duration associated with the document indicates the document is off-topic;
  
  determining an updated feature vector using the updated set of documents; and
  
  re-classifying the classified set of documents using the updated feature vector when the percentage of documents identified as off-topic exceeds a threshold which is greater than 0, otherwise retaining the initial classified set of documents.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, wherein a document in the set of documents includes an annotation to indicate whether the document is related to the specific topic.
  - 3. The method of claim 2, wherein an annotation in a given document indicates whether the given document:
    - is related to the specific topic;
      
      is a spam document;
      
      oris not related to the specific topic or is off-topic.
  - 4. The method of claim 1, wherein the query results are generated by identifying queries that match documents in the set of documents.
  - 5. The method of claim 1, wherein the method further comprises:
    - receiving a new query;
      
      determining whether the new query is related to the specific topic; and
      
      processing the new query to produce query results, wherein if the new query is related to the specific topic, processing the new query involves adjusting relevancy scores for documents based on annotations associated with the documents.
  - 6. The method of claim 5, wherein determining whether the new query is related to the specific topic involves applying a query detector that uses Bloom filters to terms in the new query.
  - 7. The method of claim 6, wherein prior to receiving the new query, the method further comprises constructing the Bloom filter by:
    - identifying queries which trigger documents in the set of documents;
      
      identifying common n-grams in the identified queries;
      
      excluding commonly occurring n-grams from the identified n-grams; and
      
      building a Bloom filter based on the remaining identified n-grams.
  - 8. The method of claim 5, wherein adjusting relevancy scores involves:
    - boosting relevancy scores for documents which are annotated as being related the specific topic;
      
      reducing relevancy scores for documents which are annotated as being spam documents; and
      
      changing the rankings of search results based on the adjusted relevancy scores.
  - 9. The method of claim 1, wherein the documents are web pages.

10. A computer-readable storage medium storing instructions that when executed by a computer cause the computer to perform a method for automatically classifying documents based on topics and user inputs, the method comprising:
- receiving a set of documents which are classified as relating to a specific topic;
  
  producing an initial feature vector that corresponds to frequency of a term'"'"'s occurrence in the set of documents;
  
  using the initial feature vector to classify another set of documents to produce an initial classified set of documents;
  
  receiving click information associated with a set of queries related to the specific topic, wherein the click information includes a click-through rate at which a query result is selected after being presented and a click duration indicating an amount of time during which the query result is accessed;
  
  using the click information to remove off-topic documents in the set of documents to obtain an updated set of documents, wherein a document is off-topic if the click-through rate or click duration associated with the document indicates the document is off-topic;
  
  determining an updated feature vector using the updated set of documents; and
  
  re-classifying the classified set of documents using the updated feature vector when the percentage of documents identified as off-topic exceeds a threshold which is greater than 0, otherwise retaining the initial classified set of documents.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
- - 11. The computer-readable storage medium of claim 10, wherein a document in the set of documents includes an annotation to indicate whether the document is related to the specific topic.
  - 12. The computer-readable storage medium of claim 11, wherein an annotation in a given document indicates whether the given document:
    - is related to the specific topic;
      
      is a spam document;
      
      oris not related to the specific topic or is off-topic.
  - 13. The computer-readable storage medium of claim 10, wherein the query results are generated by identifying queries that match documents in the set of documents.
  - 14. The computer-readable storage medium of claim 10, wherein the method further comprises:
    - receiving a new query;
      
      determining whether the new query is related to the specific topic; and
      
      processing the new query to produce query results, wherein if the new query is related to the specific topic, processing the new query involves adjusting relevancy scores for documents based on annotations associated with the documents.
  - 15. The computer-readable storage medium of claim 14, wherein determining whether the new query is related to the specific topic involves applying a query detector that uses Bloom filters to terms in the new query.
  - 16. The computer-readable storage medium of claim 15, wherein prior to receiving the new query, the method further comprises constructing the Bloom filter by:
    - identifying queries which trigger documents in the set of documents;
      
      identifying common n-grams in the identified queries;
      
      excluding commonly occurring n-grams from the identified n-grams; and
      
      building a Bloom filter based on the remaining identified n-grams.
  - 17. The computer-readable storage medium of claim 14, wherein adjusting relevancy scores involves:
    - boosting relevancy scores for documents which are annotated as being related the specific topic;
      
      reducing relevancy scores for documents which are annotated as being spam documents; and
      
      changing the rankings of search results based on the adjusted relevancy scores.
  - 18. The computer-readable storage medium of claim 10, wherein the documents are web pages.

19. A computer system that automatically classifies documents based on user inputs, comprising:
- a processor;
  
  a memory;
  
  a document-receiving mechanism configured to receive a set of documents which are classified as relating to a specific topic;
  
  a feature-vector producing mechanism configured to produce an initial feature vector that corresponds to frequency of a term'"'"'s occurrence in the set of documents;
  
  a classifying mechanism configured to use the initial feature vector to classify another set of documents to produce an initial classified set of documents;
  
  a query-receiving mechanism configured to receive click information associated with a set of queries related to the specific topic, wherein the q click information includes a click-through rate at which a query result is selected after being presented and a click duration indicating an amount of time during which the query result is accessed;
  
  a removing mechanism configured to use the click information to remove off-topic documents in the set of documents to obtain an updated set of documents, wherein a document is off-topic if the click-through rate or the click duration associated with the document indicates that the document is off-topic;
  
  a determination mechanism configured to determine an updated feature vector using the updated set of documents; and
  
  a re-classification mechanism configured to re-classifying the classified set of documents using the updated feature vector when the percentage of documents identified as off-topic exceeds a threshold which is greater than 0, otherwise retaining the initial classified set of documents.
- View Dependent Claims (20, 21)
- - 20. The computer system of claim 19, wherein the apparatus further comprises a query-processing mechanism, wherein the query-processing mechanism is configured to:
    - receive a new query;
      
      determine whether the new query is related to the specific topic; and
      
      toprocess the new query to produce query results, wherein if the new query is related to the specific topic, processing the new query involves adjusting relevancy scores for documents based on annotations associated with the documents, wherein the annotations indicate whether the documents are related to the specific topic.
  - 21. The computer system of claim 20, wherein while adjusting the relevancy scores for documents, the query-processing mechanism is configured to:
    - boost relevancy scores for documents which are annotated as being related the specific topic;
      
      reduce relevancy scores for documents which are annotated as being spam documents; and
      
      tochange the rankings of search results based on the adjusted relevancy scores.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Qian, Zhe, Feng, Zhengzhu, Wu, Jun, Guo, Quji
Primary Examiner(s)
Trujillo; James
Assistant Examiner(s)
BURKE, JEFF A

Application Number

US11/334,157
Time in Patent Office

1,659 Days
Field of Search

707/6, 707/722, 707/723, 707/731, 707/736, 707/748, 707/749, 707/750
US Class Current

707/728
CPC Class Codes

G06F 16/335 Filtering based on addition...

Method and apparatus for classifying documents based on user inputs

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

101 Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for classifying documents based on user inputs

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

101 Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links