Self-improving system and method for classifying pages on the world wide web
First Claim
1. A method of categorizing documents comprising:
- locating a plurality of documents to be categorized;
extracting textual and contextual features from within each of the documents;
identifying untrustworthy documents from the extracted features, said untrustworthy documents being eliminated from the plurality of documents to be categorized;
evaluating each of the documents according to one or more of the extracted textual and contextual features;
identifying lists of documents from the evaluated documents relating to a topic in response to a user query relating to the topic; and
identifying documents within the identified lists relating to the topic.
2 Assignments
0 Petitions
Accused Products
Abstract
A self-improving system and method for classifying a plurality of digital documents such as web pages into one or more categories. Textual features and contextual features are extracted from a digital document and submitted to a committee machine. The committee machine assigns a rating to the digital document as a function of the extracted features and provides the location such as a URL for the digital document and its rating to an output data store. The output data store stores a list of locations for the plurality of digital documents. The output data store further segregates the locations of the digital document into categories based on the content of each document as indicated by the assigned rating.
201 Citations
26 Claims
-
1. A method of categorizing documents comprising:
-
locating a plurality of documents to be categorized;
extracting textual and contextual features from within each of the documents;
identifying untrustworthy documents from the extracted features, said untrustworthy documents being eliminated from the plurality of documents to be categorized;
evaluating each of the documents according to one or more of the extracted textual and contextual features;
identifying lists of documents from the evaluated documents relating to a topic in response to a user query relating to the topic; and
identifying documents within the identified lists relating to the topic. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A method of categorizing documents comprising:
-
locating a plurality of documents to be categorized;
evaluating each of the located plurality of documents according one or more of the following;
eliminating pathological pages;
rating connected documents;
analyzing links within each of the documents;
analyzing a file name (e.g., URL) of each of the documents; and
analyzing names of images within each of the documents;
indexing the evaluated documents into a plurality of lists in response to a user query relating to a topic; and
identifying lists relating to the topic and identifying documents within the identified lists relating to the topic.
-
-
11. A method of categorizing documents comprising:
-
locating a plurality of documents to be categorized according to one or more of the following;
considering documents identified by a user which have not been previously evaluated;
considering links within documents which links have not been previously evaluated; and
considering links within aggregated documents which links have not been previously evaluated;
evaluating each of the located plurality of documents;
indexing the evaluated documents into a plurality of lists in response to a user query relating to a topic; and
identifying lists relating to the topic and identifying documents within the identified lists relating to the topic.
-
-
12. A system of categorizing documents comprising:
-
an input data store identifying documents to be evaluated;
a feature extraction tool extracting page-level information and features from the documents to be evaluated;
a committee machine;
for consolidating extracted page-level information and features to decide whether the extracted page-level information and features are trustworthy content;
for categorizing the documents based on whether the extracted page-level level information and features are trustworthy content;
an output data store for storing an identification of each of the categorized documents according to their categories. - View Dependent Claims (13, 14, 15, 16, 17)
-
-
18. A computer readable medium having computer executable instructions for categorizing a plurality of documents, comprising:
-
locating instructions for locating the plurality of documents to be evaluated;
extracting instructions for extracting page-level information and/or features from the documents to be evaluated;
examining instructions for examining the extracted page-level information and/or features to determine whether the extracted page-level information and/or features are trustworthy content;
categorizing instruction for categorizing documents according to extracted identified page-level level information and/or features determined to be trustworthy content; and
storing instructions for storing locations of categorized documents according to their categories. - View Dependent Claims (19, 20, 21, 22, 23, 24, 25, 26)
-
Specification