System, Method, and service for collaborative focused crawling of documents on a network
First Claim
1. A method of collaborative focused crawling of documents related to multiple focus topics on a network, the method comprising:
- selectively prioritizing the documents to crawl based on a set of rules;
fetching prioritized documents from the network;
for each fetched document, determining whether the fetched document is relevant to any of the multiple focus topics;
crawling the fetched document that matches any of the multiple focus topics; and
further crawling out-links on the fetched document based on an assumption that if the fetched document is of interest, the out-links are also of interest.
1 Assignment
0 Petitions
Accused Products
Abstract
A collaborative focused crawler crawls documents on a network locating documents that match multiple focus topics. The collaborative crawler comprises a fetcher and a focus engine. The fetcher prioritizes which documents to crawl based on a set of rules, obtains documents from the network, and outputs crawled documents to the focus engine. The focus engine determines whether a fetched document is relevant to any of the multiple focus topics. The focus engine determines whether fetched documents are disallowed. If a fetched document is disallowed, the present system may place the URL for that web document in a blacklist, a list of URLs that may not be crawled. URLs may be disallowed if they match a disallowed topic or if they fail a set of rules designed for a web space focus, for example, domain rules, IP address rules, and prefix rules.
212 Citations
20 Claims
-
1. A method of collaborative focused crawling of documents related to multiple focus topics on a network, the method comprising:
-
selectively prioritizing the documents to crawl based on a set of rules;
fetching prioritized documents from the network;
for each fetched document, determining whether the fetched document is relevant to any of the multiple focus topics;
crawling the fetched document that matches any of the multiple focus topics; and
further crawling out-links on the fetched document based on an assumption that if the fetched document is of interest, the out-links are also of interest. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A computer program product having instruction codes for implementing a collaborative focused crawling of documents related to multiple focus topics on a network, the computer program product comprising:
-
a first set of instruction codes for selectively prioritizing the documents to crawl based on a set of rules;
a second set of instruction codes for fetching prioritized documents from the network;
for each fetched document, a third set of instruction codes determines whether the fetched document is relevant to any of the multiple focus topics;
a fourth set of instruction codes for crawling the fetched document that matches any of the multiple focus topics; and
wherein the fourth set of instruction codes further crawls out-links on the fetched document based on an assumption that if the fetched document is of interest, the out-links are also of interest. - View Dependent Claims (15, 16, 17, 19, 20)
-
-
18. A system for implementing a collaborative focused crawling of documents related to multiple focus topics on a network, the system comprising:
-
an evaluator that selectively prioritizes the documents to crawl based on a set of rules;
a fetcher that fetches prioritized documents from the network;
for each fetched document, a focus engine determines whether the fetched document is relevant to any of the multiple focus topics;
a crawler for crawling the fetched document that matches any of the multiple focus topics; and
wherein the crawler further crawls out-links on the fetched document based on an assumption that if the fetched document is of interest, the out-links are also of interest.
-
Specification