Method and apparatus for classifying documents based on user inputs
First Claim
1. A method executed on one or more processors for automatically classifying documents based on topics and user inputs, comprising:
- receiving a set of documents which are classified as relating to the specific topic;
producing an initial feature vector that corresponds to frequency of a term'"'"'s occurrence in the set of documents;
using the initial feature vector to classify another set of documents to produce an initial classified set of documents;
receiving click information associated with a set of queries related to the specific topic, wherein the click information includes a click-through rate at which a query result is selected after being presented and a click duration indicating an amount of time during which the query result is accessed;
using the click information to remove off-topic documents in the set of documents to obtain an updated set of documents, wherein a document is off-topic if the click-through rate or click duration associated with the document indicates the document is off-topic;
determining an updated feature vector using the updated set of documents; and
re-classifying the classified set of documents using the updated feature vector when the percentage of documents identified as off-topic exceeds a threshold which is greater than 0, otherwise retaining the initial classified set of documents.
2 Assignments
0 Petitions
Accused Products
Abstract
One embodiment of the present invention provides a system that automatically classifies documents (such as web pages) based on user inputs. During operation, the system obtains a “classified” set of documents which are classified as relating to a specific topic. The system also obtains queries related to the specific topic. These queries produce “query results” which enable the user to access documents related to the query. The queries also include “click information” which specifies how one or more users have accessed the query results. The system uses this click information to identify documents in the classified set of documents which are not related to the specific topic or are off-topic. If such documents are identified, the system shifts the identified documents so that they are regarded as off-topic and/or spam, and removes the identified documents from the classified set of documents.
101 Citations
21 Claims
-
1. A method executed on one or more processors for automatically classifying documents based on topics and user inputs, comprising:
-
receiving a set of documents which are classified as relating to the specific topic; producing an initial feature vector that corresponds to frequency of a term'"'"'s occurrence in the set of documents; using the initial feature vector to classify another set of documents to produce an initial classified set of documents; receiving click information associated with a set of queries related to the specific topic, wherein the click information includes a click-through rate at which a query result is selected after being presented and a click duration indicating an amount of time during which the query result is accessed; using the click information to remove off-topic documents in the set of documents to obtain an updated set of documents, wherein a document is off-topic if the click-through rate or click duration associated with the document indicates the document is off-topic; determining an updated feature vector using the updated set of documents; and re-classifying the classified set of documents using the updated feature vector when the percentage of documents identified as off-topic exceeds a threshold which is greater than 0, otherwise retaining the initial classified set of documents. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A computer-readable storage medium storing instructions that when executed by a computer cause the computer to perform a method for automatically classifying documents based on topics and user inputs, the method comprising:
-
receiving a set of documents which are classified as relating to a specific topic; producing an initial feature vector that corresponds to frequency of a term'"'"'s occurrence in the set of documents; using the initial feature vector to classify another set of documents to produce an initial classified set of documents; receiving click information associated with a set of queries related to the specific topic, wherein the click information includes a click-through rate at which a query result is selected after being presented and a click duration indicating an amount of time during which the query result is accessed; using the click information to remove off-topic documents in the set of documents to obtain an updated set of documents, wherein a document is off-topic if the click-through rate or click duration associated with the document indicates the document is off-topic; determining an updated feature vector using the updated set of documents; and re-classifying the classified set of documents using the updated feature vector when the percentage of documents identified as off-topic exceeds a threshold which is greater than 0, otherwise retaining the initial classified set of documents. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A computer system that automatically classifies documents based on user inputs, comprising:
-
a processor; a memory; a document-receiving mechanism configured to receive a set of documents which are classified as relating to a specific topic; a feature-vector producing mechanism configured to produce an initial feature vector that corresponds to frequency of a term'"'"'s occurrence in the set of documents; a classifying mechanism configured to use the initial feature vector to classify another set of documents to produce an initial classified set of documents; a query-receiving mechanism configured to receive click information associated with a set of queries related to the specific topic, wherein the q click information includes a click-through rate at which a query result is selected after being presented and a click duration indicating an amount of time during which the query result is accessed; a removing mechanism configured to use the click information to remove off-topic documents in the set of documents to obtain an updated set of documents, wherein a document is off-topic if the click-through rate or the click duration associated with the document indicates that the document is off-topic; a determination mechanism configured to determine an updated feature vector using the updated set of documents; and a re-classification mechanism configured to re-classifying the classified set of documents using the updated feature vector when the percentage of documents identified as off-topic exceeds a threshold which is greater than 0, otherwise retaining the initial classified set of documents. - View Dependent Claims (20, 21)
-
Specification