System and method for focussed web crawling

US 6,418,433 B1
Filed: 01/28/1999
Issued: 07/09/2002
Est. Priority Date: 01/28/1999
Status: Expired due to Fees

First Claim

Patent Images

1. A general purpose computer including a data storage device including a computer usable medium having computer readable code means for focussed crawling of a wide area computer network such as the World Wide Web, comprising:

computer readable code means for receiving a seed set of Web pages in a crawl database, the seed set being selected based at least on relevance to at least one topic;

computer readable code means for identifying outlink Web pages from one or more Web pages in the crawl database;

computer readable code means for evaluating one or more outlink Web pages for relevance to the topic; and

computer readable code means for causing outlinks only of Web pages evaluated as being relevant to the topic to be stored in the crawl database.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A focussed Web crawler learns to recognize Web pages that are relevant to the interest of one or more users, from a set of examples provided by the users. It then explores the Web starting from the example set, using the statistics collected from the examples and other analysis on the link graph of the growing crawl database, to guide itself towards relevant, valuable resources and away from irrelevant and/or low quality material on the Web. Thereby, the Web crawler builds a comprehensive topic-specific library for the benefit of specific users.

428 Citations

32 Claims

1. A general purpose computer including a data storage device including a computer usable medium having computer readable code means for focussed crawling of a wide area computer network such as the World Wide Web, comprising:
- computer readable code means for receiving a seed set of Web pages in a crawl database, the seed set being selected based at least on relevance to at least one topic;
  
  computer readable code means for identifying outlink Web pages from one or more Web pages in the crawl database;
  
  computer readable code means for evaluating one or more outlink Web pages for relevance to the topic; and
  
  computer readable code means for causing outlinks only of Web pages evaluated as being relevant to the topic to be stored in the crawl database.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The computer of claim 1, further comprising:
3. The computer of claim 1, further comprising:
- at least one watchdog module periodically determining new and old pages to consider, the new pages being selected from the outlink Web pages, the old pages being selected from pages in the crawl database; and
  
  one or more worker modules responsive to the watchdog module for accessing the new and old pages to consider, each Web page being associated with a respective Num_Tries field representing the number of times the respective page has been accessed, the Num_Tries field of a Web page being incremented each time the Web page is considered.
4. The computer of claim 1, further comprising one or more worker modules for accessing new and old pages to consider, each Web page being associated with a respective field representing the number of times the respective page has been accessed, the field of a Web page being incremented each time the Web page is considered, wherein outlinks of a Web page considered by the worker module are entered into a link table in the crawl database only when the means for evaluating determines that the Web page is relevant.
5. The computer of claim 4, wherein the worker module includes means for determining whether a gathering rate of pages is considered is below a panic threshold.
6. The computer of claim 5, wherein the computer includes computer readable code means for considering all outlinks and inlinks of at least one Web page in the crawl database when the gathering rate is at or below the panic threshold.
7. The computer of claim 5, wherein the topic defines a scope, and the scope is increased to an expanded scope when the gathering rate is at or below the threshold.
8. The computer of claim 7, further comprising computer readable code means for estimating a size of a prospective Web page consideration step based on the expanded scope, and permitting the prospective consideration step only when the size is at or below a size threshold.

9. A computer system for focussed searching of the World Wide Web, comprising:
- a computer including at least one worker module for undertaking work, the work including adding outlinks of a Web page to a crawl database only when the Web page is relevant to a predefined topic, the system including the crawl database, the crawl database being accessible by the computer, the crawl database being focussed only on the topic such that the system includes no comprehensive database of the World Wide Web.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16, 17)
- - 10. The computer system of claim 9, comprising:
11. The computer system of claim 10, further comprising:
- means for assigning a revisitation priority to a Web page, based on the means for evaluating.
12. The computer of claim 10, further comprising a watchdog module for scheduling worker thread work, wherein the watchdog module periodically determines new and old pages to consider, the new pages being selected from the outlink Web pages, the old pages being selected from pages in the crawl database, the worker module being responsive to the watchdog module for accessing the new and old pages to consider, each Web page being associated with a respective field representing the number of times the respective page has been accessed, the field of a Web page being incremented each time the Web page is considered.
13. The computer of claim 12, wherein outlinks of a Web page considered by the worker module are entered into a link table in the crawl database only when the means for evaluating determines that the Web page is relevant.
14. The computer of claim 12, wherein the worker module includes means for determining whether a gathering rate of pages is considered is below a panic threshold.
15. The computer of claim 14, wherein the worker module includes means for considering all outlinks and inlinks of at least one Web page in the crawl database when the gathering rate is at or below the panic threshold.
16. The computer of claim 14, wherein the topic defines a scope, and the scope is increased to an expanded scope when the gathering rate is at or below the threshold.
17. The computer of claim 16, wherein the worker module further comprises means for estimating a size of a prospective Web page consideration step based on the expanded scope, and permitting the prospective consideration step only when the size is at or below a size threshold.

18. A computer-implemented method for building a focussed database of Web pages, comprising the acts of:
- receiving a search query from a user;
  
  in response to the search query, accessing a crawl database containing only information pertaining to Web pages related to a predefined range of topics;
  
  periodically determining new and old pages to consider, the new pages being selected from outlink Web pages, the old pages being selected from pages in the crawl database; and
  
  accessing the new and old pages to consider, each Web page being associated with a respective field representing the number of times the respective page has been accessed, the field of a Web page being incremented each time the Web page is considered.
- View Dependent Claims (19, 20, 21, 22, 23, 24)
- - 19. The method of claim 18, further comprising:
20. The method of claim 18, wherein outlinks of a Web page are entered into a link table in the crawl database only when the Web page is relevant.
21. The method of claim 18, further including determining whether a gathering rate of pages is considered to be below a panic threshold.
22. The method of claim 21, further comprising considering all outlinks and inlinks of at least one Web page in the crawl database when the gathering rate is at or below the panic threshold.
23. The method of claim 21, wherein the topic defines a scope, and the scope is increased to an expanded scope when the gathering rate is at or below the threshold.
24. The method of claim 23, further comprising estimating a size of a prospective Web page consideration step based on the expanded scope, and permitting the prospective consideration step only when the size is at or below a size threshold.

25. A computer program device comprising:
- a computer program storage device readable by a digital processing apparatus; and
  
  a program means on the program storage device and including instructions executable by the digital processing apparatus for performing method steps for focussed Web crawling, the method steps comprising the acts of;
  
  receiving a search query from a user;
  
  in response to the search query, accessing a crawl database containing only information pertaining to Web pages related to a predefined topic or topics; and
  
  determining whether a gathering rate of pages is considered to be below a panic threshold.
- View Dependent Claims (26, 27, 28, 29, 30, 31)
- - 26. The computer program device of claim 25, wherein the acts further comprise:
27. The computer program device of claim 26, wherein the acts further comprise:
- periodically determining new and old pages to consider, the new pages being selected from the outlink Web pages, the old pages being selected from pages in the crawl database; and
  
  accessing the new and old pages to consider, each Web page being associated with a respective field representing the number of times the respective page has been accessed, the field of a Web page being incremented each time the Web page is considered.
28. The computer program device of claim 26, wherein outlinks of a Web page are entered into a link table in the crawl database only when the Web page is relevant.
29. The computer program device of claim 25, wherein the acts further comprise considering all outlinks and inlinks of at least one Web page in the crawl database when the gathering rate is at or below the panic threshold.
30. The computer program device of claim 25, wherein the topic defines a scope, and the scope is increased to an expanded scope when the gathering rate is at or below the threshold.
31. The computer program device of claim 30, wherein the acts further comprise estimating a size of a prospective Web page consideration step based on the expanded scope, and permitting the prospective consideration step only when the size is at or below a size threshold.

32. A computer-implemented method for establishing a database of Web pages based on the relevance of the pages to at least one predefined topic, comprising the acts of:
- providing at least one example Web page;
  
  determining the relevance of at least one Web page based at least partially on the example Web page; and
  
  at least in part using the relevance of the Web page, determining whether to insert outlinks of the page into the database, wherein the act of determining whether to insert outlinks of the page into the database is undertaken using either a “
  
  soft”
  
  policy or a “
  
  hard”
  
  policy.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
van den Berg, Martin Henk, Dom, Byron Edward, Chakrabarti, Soumen
Primary Examiner(s)
Alam, Hosain T.

Application Number

US09/239,921
Time in Patent Office

1,258 Days
Field of Search

707/1-6, 707/10, 707/104, 707/501, 707/513, 709/200, 709/203, 709/217-219
US Class Current

1/1
CPC Class Codes

G06F 16/951   Indexing; Web crawling tech...

G06F 16/953   Querying, e.g. by the use o...

Y10S 707/99935   Query augmenting and refini...

System and method for focussed web crawling

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

428 Citations

32 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for focussed web crawling

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

428 Citations

32 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links