SYSTEM, METHOD AND COMPUTER READABLE MEDIUM FOR WEB CRAWLING
First Claim
Patent Images
1. A method for web crawling comprising:
- determining a plurality of Uniform Resource Locators (URL)s;
determining a subset of the plurality of URLs that have associated interaction data;
selecting at least one URL of the subset; and
downloading a web page corresponding to the at least one selected URL.
5 Assignments
0 Petitions
Accused Products
Abstract
In a web crawler, a URL selection module selects URLs for pages to be downloaded. The URL selection module accesses an interaction data store that stores interaction data for web pages, including interaction data that indicates human interactions with the pages. To reduce the effects of link farms, the URL selection module filters the URLs to select only those URLs that have human interaction histories and provides the selected URLs to a download module for web page downloading.
-
Citations
20 Claims
-
1. A method for web crawling comprising:
-
determining a plurality of Uniform Resource Locators (URL)s; determining a subset of the plurality of URLs that have associated interaction data; selecting at least one URL of the subset; and downloading a web page corresponding to the at least one selected URL. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A web crawler comprising:
-
at least one Uniform Resource Locator (URL) data store that stores a plurality of URLs; at least one interaction data store that stores interaction data for a plurality of web pages, the interaction data indicating an interaction between a human and a web page corresponding to a URL; at least one download module that downloads web page content corresponding to a URL; and at least one URL selection module in communication with the at least one URL data store and the at least one interaction data store; wherein the at least one URL selection module selects at least one URL from the at least one URL data store that has interaction data in the at least one interaction data store; and wherein the at least one URL selection module provides the at least one selected URL to the at least one download module. - View Dependent Claims (14, 15, 16, 17, 18)
-
-
19. A computer-readable medium comprising computer-executable instructions for execution by a processor, that, when executed, cause the processor to:
-
select a Uniform Resource Locator (URL) from a URL data store; look up the selected URL in an interaction data store to determine if interaction data exists for the selected URL in the interaction data store; and if interaction data exists for the selected URL, provide the selected URL to a download module. - View Dependent Claims (20)
-
Specification