System and method for focussed web crawling
First Claim
1. A general purpose computer including a data storage device including a computer usable medium having computer readable code means for focussed crawling of a wide area computer network such as the World Wide Web, comprising:
- computer readable code means for receiving a seed set of Web pages in a crawl database, the seed set being selected based at least on relevance to at least one topic;
computer readable code means for identifying outlink Web pages from one or more Web pages in the crawl database;
computer readable code means for evaluating one or more outlink Web pages for relevance to the topic; and
computer readable code means for causing outlinks only of Web pages evaluated as being relevant to the topic to be stored in the crawl database.
1 Assignment
0 Petitions
Accused Products
Abstract
A focussed Web crawler learns to recognize Web pages that are relevant to the interest of one or more users, from a set of examples provided by the users. It then explores the Web starting from the example set, using the statistics collected from the examples and other analysis on the link graph of the growing crawl database, to guide itself towards relevant, valuable resources and away from irrelevant and/or low quality material on the Web. Thereby, the Web crawler builds a comprehensive topic-specific library for the benefit of specific users.
428 Citations
32 Claims
-
1. A general purpose computer including a data storage device including a computer usable medium having computer readable code means for focussed crawling of a wide area computer network such as the World Wide Web, comprising:
-
computer readable code means for receiving a seed set of Web pages in a crawl database, the seed set being selected based at least on relevance to at least one topic;
computer readable code means for identifying outlink Web pages from one or more Web pages in the crawl database;
computer readable code means for evaluating one or more outlink Web pages for relevance to the topic; and
computer readable code means for causing outlinks only of Web pages evaluated as being relevant to the topic to be stored in the crawl database. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
computer readable code means for assigning a revisitation priority to a Web page, based on the means for evaluating.
-
-
3. The computer of claim 1, further comprising:
-
at least one watchdog module periodically determining new and old pages to consider, the new pages being selected from the outlink Web pages, the old pages being selected from pages in the crawl database; and
one or more worker modules responsive to the watchdog module for accessing the new and old pages to consider, each Web page being associated with a respective Num_Tries field representing the number of times the respective page has been accessed, the Num_Tries field of a Web page being incremented each time the Web page is considered.
-
-
4. The computer of claim 1, further comprising one or more worker modules for accessing new and old pages to consider, each Web page being associated with a respective field representing the number of times the respective page has been accessed, the field of a Web page being incremented each time the Web page is considered, wherein outlinks of a Web page considered by the worker module are entered into a link table in the crawl database only when the means for evaluating determines that the Web page is relevant.
-
5. The computer of claim 4, wherein the worker module includes means for determining whether a gathering rate of pages is considered is below a panic threshold.
-
6. The computer of claim 5, wherein the computer includes computer readable code means for considering all outlinks and inlinks of at least one Web page in the crawl database when the gathering rate is at or below the panic threshold.
-
7. The computer of claim 5, wherein the topic defines a scope, and the scope is increased to an expanded scope when the gathering rate is at or below the threshold.
-
8. The computer of claim 7, further comprising computer readable code means for estimating a size of a prospective Web page consideration step based on the expanded scope, and permitting the prospective consideration step only when the size is at or below a size threshold.
-
9. A computer system for focussed searching of the World Wide Web, comprising:
-
a computer including at least one worker module for undertaking work, the work including adding outlinks of a Web page to a crawl database only when the Web page is relevant to a predefined topic, the system including the crawl database, the crawl database being accessible by the computer, the crawl database being focussed only on the topic such that the system includes no comprehensive database of the World Wide Web. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16, 17)
means executable by the computer for receiving into the crawl database a seed set of Web pages, the seed set being representative of at least the topic;
means in the worker module for identifying outlink Web pages from one or more Web pages in the crawl database;
means in the worker module for evaluating one or more outlink Web pages for relevance to the topic; and
means in the worker module for causing outlinks of Web pages to be stored in the crawl database in response to the means for evaluating.
-
-
11. The computer system of claim 10, further comprising:
means for assigning a revisitation priority to a Web page, based on the means for evaluating.
-
12. The computer of claim 10, further comprising a watchdog module for scheduling worker thread work, wherein the watchdog module periodically determines new and old pages to consider, the new pages being selected from the outlink Web pages, the old pages being selected from pages in the crawl database, the worker module being responsive to the watchdog module for accessing the new and old pages to consider, each Web page being associated with a respective field representing the number of times the respective page has been accessed, the field of a Web page being incremented each time the Web page is considered.
-
13. The computer of claim 12, wherein outlinks of a Web page considered by the worker module are entered into a link table in the crawl database only when the means for evaluating determines that the Web page is relevant.
-
14. The computer of claim 12, wherein the worker module includes means for determining whether a gathering rate of pages is considered is below a panic threshold.
-
15. The computer of claim 14, wherein the worker module includes means for considering all outlinks and inlinks of at least one Web page in the crawl database when the gathering rate is at or below the panic threshold.
-
16. The computer of claim 14, wherein the topic defines a scope, and the scope is increased to an expanded scope when the gathering rate is at or below the threshold.
-
17. The computer of claim 16, wherein the worker module further comprises means for estimating a size of a prospective Web page consideration step based on the expanded scope, and permitting the prospective consideration step only when the size is at or below a size threshold.
-
18. A computer-implemented method for building a focussed database of Web pages, comprising the acts of:
-
receiving a search query from a user;
in response to the search query, accessing a crawl database containing only information pertaining to Web pages related to a predefined range of topics;
periodically determining new and old pages to consider, the new pages being selected from outlink Web pages, the old pages being selected from pages in the crawl database; and
accessing the new and old pages to consider, each Web page being associated with a respective field representing the number of times the respective page has been accessed, the field of a Web page being incremented each time the Web page is considered. - View Dependent Claims (19, 20, 21, 22, 23, 24)
initially receiving a seed set of Web pages in the crawl database, the seed set defining the range of topics;
evaluating one or more Web pages in the database for relevance to the range of topics;
identifying outlink Web pages from one or more Web pages in the crawl database in response to the evaluating step;
evaluating one or more outlink Web pages for relevance to the range of topics; and
causing outlinks only of Web pages evaluated as being relevant to the topics to be stored in the crawl database.
-
-
20. The method of claim 18, wherein outlinks of a Web page are entered into a link table in the crawl database only when the Web page is relevant.
-
21. The method of claim 18, further including determining whether a gathering rate of pages is considered to be below a panic threshold.
-
22. The method of claim 21, further comprising considering all outlinks and inlinks of at least one Web page in the crawl database when the gathering rate is at or below the panic threshold.
-
23. The method of claim 21, wherein the topic defines a scope, and the scope is increased to an expanded scope when the gathering rate is at or below the threshold.
-
24. The method of claim 23, further comprising estimating a size of a prospective Web page consideration step based on the expanded scope, and permitting the prospective consideration step only when the size is at or below a size threshold.
-
25. A computer program device comprising:
-
a computer program storage device readable by a digital processing apparatus; and
a program means on the program storage device and including instructions executable by the digital processing apparatus for performing method steps for focussed Web crawling, the method steps comprising the acts of;
receiving a search query from a user;
in response to the search query, accessing a crawl database containing only information pertaining to Web pages related to a predefined topic or topics; and
determining whether a gathering rate of pages is considered to be below a panic threshold. - View Dependent Claims (26, 27, 28, 29, 30, 31)
initially receiving a seed set of Web pages in the crawl database, the seed set defining the topic;
identifying outlink Web pages from one or more Web pages in the crawl database;
evaluating one or more outlink Web pages for relevance to the topic; and
causing outlinks only of Web pages evaluated as being relevant to the topic to be stored in the crawl database.
-
-
27. The computer program device of claim 26, wherein the acts further comprise:
-
periodically determining new and old pages to consider, the new pages being selected from the outlink Web pages, the old pages being selected from pages in the crawl database; and
accessing the new and old pages to consider, each Web page being associated with a respective field representing the number of times the respective page has been accessed, the field of a Web page being incremented each time the Web page is considered.
-
-
28. The computer program device of claim 26, wherein outlinks of a Web page are entered into a link table in the crawl database only when the Web page is relevant.
-
29. The computer program device of claim 25, wherein the acts further comprise considering all outlinks and inlinks of at least one Web page in the crawl database when the gathering rate is at or below the panic threshold.
-
30. The computer program device of claim 25, wherein the topic defines a scope, and the scope is increased to an expanded scope when the gathering rate is at or below the threshold.
-
31. The computer program device of claim 30, wherein the acts further comprise estimating a size of a prospective Web page consideration step based on the expanded scope, and permitting the prospective consideration step only when the size is at or below a size threshold.
-
32. A computer-implemented method for establishing a database of Web pages based on the relevance of the pages to at least one predefined topic, comprising the acts of:
-
providing at least one example Web page;
determining the relevance of at least one Web page based at least partially on the example Web page; and
at least in part using the relevance of the Web page, determining whether to insert outlinks of the page into the database, wherein the act of determining whether to insert outlinks of the page into the database is undertaken using either a “
soft”
policy or a “
hard”
policy.
-
Specification