Self-improving system and method for classifying pages on the world wide web

US 20030225763A1
Filed: 04/14/2003
Published: 12/04/2003
Est. Priority Date: 04/15/2002
Status: Abandoned Application

First Claim

Patent Images

1. A method of categorizing documents comprising:

locating a plurality of documents to be categorized;

extracting textual and contextual features from within each of the documents;

identifying untrustworthy documents from the extracted features, said untrustworthy documents being eliminated from the plurality of documents to be categorized;

evaluating each of the documents according to one or more of the extracted textual and contextual features;

identifying lists of documents from the evaluated documents relating to a topic in response to a user query relating to the topic; and

identifying documents within the identified lists relating to the topic.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A self-improving system and method for classifying a plurality of digital documents such as web pages into one or more categories. Textual features and contextual features are extracted from a digital document and submitted to a committee machine. The committee machine assigns a rating to the digital document as a function of the extracted features and provides the location such as a URL for the digital document and its rating to an output data store. The output data store stores a list of locations for the plurality of digital documents. The output data store further segregates the locations of the digital document into categories based on the content of each document as indicated by the assigned rating.

201 Citations

26 Claims

1. A method of categorizing documents comprising:
- locating a plurality of documents to be categorized;
  
  extracting textual and contextual features from within each of the documents;
  
  identifying untrustworthy documents from the extracted features, said untrustworthy documents being eliminated from the plurality of documents to be categorized;
  
  evaluating each of the documents according to one or more of the extracted textual and contextual features;
  
  identifying lists of documents from the evaluated documents relating to a topic in response to a user query relating to the topic; and
  
  identifying documents within the identified lists relating to the topic.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, wherein the plurality of documents are located by one or more of the following techniques:
    - considering documents identified by a user which have not been previously evaluated;
      
      considering links within documents which links have not been previously evaluated;
      
      or considering links within aggregated documents which links have not been previously evaluated.
  - 3. The method of claim 1, wherein the evaluating each of the documents includes determining a rating for each of the documents as a function of the extracted textual and/or contextual features, wherein the identifying lists relative to the topic includes comparing the rating of each of the documents to a threshold value associated with the topic, said threshold value being predetermined by the user or a third party.
  - 4. The method of claim 3, wherein a first list of documents includes documents having a determined rating less than or equal to the threshold value, and wherein a second list of documents includes documents having a determined rating greater than the threshold value.
  - 5. The method of claim 3, wherein the extracting textual features from within each of the documents includes extracting textual components including words, letters, and internal punctuation marks, and wherein the evaluating each of the documents includes determining a rating for each of the documents as a function of the extracted textual components.
  - 6. The method of claim 3, wherein the extracting contextual features from within each of the documents includes extracting text associated with an image within the document, and wherein the evaluating each of the documents includes determining the rating for each of the documents as a function of the extracted text associated with the image.
  - 7. The method of claim 3, wherein the extracting contextual features from within each of the documents includes extracting text associated with a link within the document, and wherein the evaluating each of the documents includes determining the rating for each of the documents as a function of the extracted text associated with the link.
  - 8. The method of claim 1, wherein the extracting contextual features from within each of the documents includes extracting links from within each of the documents, wherein the evaluating each of the documents includes comparing target locations of extracted links to locations of the identified list of documents to identify unknown links, and wherein target documents of one or more of said unknown links are automatically located to be categorized.
  - 9. The method of claim 1, wherein the extracting contextual features from within each of the documents includes extracting a file name (e.g., URL) of each of the documents, and wherein the evaluating each of the documents includes comparing the extracted file name for each of the documents to file names of the identified list of documents to determine whether a particular document has been previously evaluated.

10. A method of categorizing documents comprising:
- locating a plurality of documents to be categorized;
  
  evaluating each of the located plurality of documents according one or more of the following;
  
  eliminating pathological pages;
  
  rating connected documents;
  
  analyzing links within each of the documents;
  
  analyzing a file name (e.g., URL) of each of the documents; and
  
  analyzing names of images within each of the documents;
  
  indexing the evaluated documents into a plurality of lists in response to a user query relating to a topic; and
  
  identifying lists relating to the topic and identifying documents within the identified lists relating to the topic.

11. A method of categorizing documents comprising:
- locating a plurality of documents to be categorized according to one or more of the following;
  
  considering documents identified by a user which have not been previously evaluated;
  
  considering links within documents which links have not been previously evaluated; and
  
  considering links within aggregated documents which links have not been previously evaluated;
  
  evaluating each of the located plurality of documents;
  
  indexing the evaluated documents into a plurality of lists in response to a user query relating to a topic; and
  
  identifying lists relating to the topic and identifying documents within the identified lists relating to the topic.

12. A system of categorizing documents comprising:
- an input data store identifying documents to be evaluated;
  
  a feature extraction tool extracting page-level information and features from the documents to be evaluated;
  
  a committee machine;
  
  for consolidating extracted page-level information and features to decide whether the extracted page-level information and features are trustworthy content;
  
  for categorizing the documents based on whether the extracted page-level level information and features are trustworthy content;
  
  an output data store for storing an identification of each of the categorized documents according to their categories.
- View Dependent Claims (13, 14, 15, 16, 17)
- - 13. The system of claim 12, wherein the committee machine is a learning-based classifier, and wherein the learning-based classifier determines a rating of each of the documents according to extracted page-level information and features.
  - 14. The system of claim 13, wherein the committee machine categorizes documents into a first list of documents and a second list of documents by comparing the determined rating of each document to a threshold value, said threshold value being defined by a user or a third party, and wherein the first list of documents includes documents having a determined rating less than or equal to the threshold value, and wherein the second list of documents includes documents having a determined rating greater than the threshold value.
  - 15. The system of claim 14, wherein the output data store is a master database storing the identification of the first list of documents and the identification of the second list of documents.
  - 16. The system of claim 15, wherein the output data store further stores the rating of each the categorized documents and the threshold value.
  - 17. The system of claim 15 further including a training data store for storing training documents, wherein said training documents are used to train the committee machine.

18. A computer readable medium having computer executable instructions for categorizing a plurality of documents, comprising:
- locating instructions for locating the plurality of documents to be evaluated;
  
  extracting instructions for extracting page-level information and/or features from the documents to be evaluated;
  
  examining instructions for examining the extracted page-level information and/or features to determine whether the extracted page-level information and/or features are trustworthy content;
  
  categorizing instruction for categorizing documents according to extracted identified page-level level information and/or features determined to be trustworthy content; and
  
  storing instructions for storing locations of categorized documents according to their categories.
- View Dependent Claims (19, 20, 21, 22, 23, 24, 25, 26)
- - 19. The computer readable medium of claim 18, wherein the locating instructions includes instruction for locating one or more documents in response to a request received from a user.
  - 20. The computer readable medium of claim 19, wherein the categorizing instructions includes instructions for determining a rating for each of the located documents as a function of the extracted features.
  - 21. The computer readable medium of claim 20, wherein the examining instructions includes instruction for examining textual components from within each of the located documents, said textual components include words, letters, and internal punctuation marks, and wherein the categorizing instructions includes instructions for determining the rating for each of the located documents as a function of the extracted textual components.
  - 22. The computer readable medium of claim 21, wherein the examining instructions includes instruction for examining contextual components from within each of the located documents, said contextual components include links, text associated with links, text associated with images, and URLs, and wherein the categorizing instructions includes instructions for determining the rating for each of the documents as a function of the examined contextual components.
  - 23. The computer readable medium of claim 22, wherein the storing instructions includes instructions for storing documents having a determined rating less than or equal to a threshold value in a first list, and wherein the storing instructions includes instructions for storing documents having a determined score greater than the predetermined threshold value in a second list, said threshold value being predetermined by a user or third party.
  - 24. The computer readable medium of claim 18, wherein the examining instructions includes instructions for identifying untrustworthy documents as a function of the extracted features, and wherein the examining instructions includes instruction for eliminating identified untrustworthy documents from categorization.
  - 25. The computer readable medium of claim 18, wherein the extracting instructions includes instruction for extracting links from within each of the documents, wherein the examining instructions includes instruction for determining a location of a target document of the link, and wherein the examining instructions includes instructions for comparing the determined location of the target document to stored locations of categorized documents to identify unknown links.
  - 26. The computer readable medium of claim 25, wherein the locating instructions further includes instruction for automatically locating one or more documents identified by unknown links.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Lulich, Daniel P., Rehfuss, Paul Stephen, Guilak, Farzin G.

Application Number

US10/413,441
Publication Number

US 20030225763A1
Time in Patent Office

Days
Field of Search
US Class Current

707/7
CPC Class Codes

G06F 16/353 into predefined classes

Self-improving system and method for classifying pages on the world wide web

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

201 Citations

26 Claims

Specification

Solutions

Use Cases

Quick Links

Self-improving system and method for classifying pages on the world wide web

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

201 Citations

26 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links