SYSTEM AND METHOD FOR FOCUSED RE-CRAWLING OF WEB SITES
First Claim
Patent Images
1. A method of crawling the Web, said method comprising:
- crawling Web pages on the Web starting from a given set of seed Universal Resource Locators (URLs);
partitioning crawled Web pages into sets of relevant and irrelevant pages;
discovering from said sets of relevant and irrelevant pages a set of exclusion and inclusion patterns; and
restricting subsequent crawling of the Web through said set of exclusion and inclusion patterns.
3 Assignments
0 Petitions
Accused Products
Abstract
A method (100) of crawling the Web (620) is disclosed. The method (100) crawls (120) Web pages on the Web starting from a given (110) set of seed Universal Resource Locators (URLs). Crawled Web pages are partitioned (140) into sets of relevant and irrelevant pages. A set of exclusion and/or inclusion patterns are discovered (150) from the sets of relevant and irrelevant pages, and subsequent crawling of the Web is restricted through the set of exclusion and/or inclusion patterns.
12 Citations
20 Claims
-
1. A method of crawling the Web, said method comprising:
-
crawling Web pages on the Web starting from a given set of seed Universal Resource Locators (URLs); partitioning crawled Web pages into sets of relevant and irrelevant pages; discovering from said sets of relevant and irrelevant pages a set of exclusion and inclusion patterns; and restricting subsequent crawling of the Web through said set of exclusion and inclusion patterns. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A method of crawling a network, said method comprising:
-
crawling network pages on the network starting from a given set of seed Universal Resource Locators (URLs); partitioning crawled network pages into sets of relevant and irrelevant pages; discovering from said sets of relevant and irrelevant pages a set of exclusion and inclusion patterns; and restricting subsequent crawling of the network through said set of exclusion and inclusion patterns. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A computer program product stored on an electronic storage device useable with a computerized device tangibly embodying a program of instructions executable by said computerized device to perform a method comprising:
-
crawling Web pages on the Web starting from a given set of seed Universal Resource Locators (URLs); partitioning crawled Web pages into sets of relevant and irrelevant pages; discovering from said sets of relevant and irrelevant pages a set of exclusion and inclusion patterns; and restricting subsequent crawling of the Web through said set of exclusion and inclusion patterns.
-
Specification