System, method and computer readable medium for web crawling
First Claim
Patent Images
1. A method of selecting a uniform resource locator (URL) in a web crawling procedure, the method comprising:
- querying a URL data store for URLs corresponding to webpages which have not been previously downloaded by a web crawler computing device during a web crawling procedure;
retrieving from the URL data store, in response to the querying, a plurality of URLs corresponding to webpages which have not been previously downloaded by the web crawler computing device during the web crawling procedure;
for each particular URL in the plurality of retrieved URLs;
(a) identifying one or more additional URLs within the URL data store corresponding to webpages that have been downloaded during the web crawling procedure, and that contain links to the particular URL; and
(b) retrieving, from interaction data store, user interaction event data associated with the identified one or more additional URLs that include links to the particular URL;
selecting a highest ranked URL from the plurality of retrieved URLs, based on the user interaction event data retrieved from the interaction data store associated with the plurality of retrieved URLs, wherein the selection of a particular URL as the highest ranked URL comprises performing a source element ranking and attention shift analysis for one or more web pages that include links to the particular URL, wherein performing the source element ranking and attention shift analysis comprises;
(a) detecting user interactions with content areas of each of the one or more web pages that include links to the particular URL, wherein the user interactions detected comprise one or more of clicks, hints, or lingers on the on the content areas of the web pages; and
(b) determining, based on the detected user interactions with the content areas, that the one or more web pages that include links to the particular URL are valid web pages;
submitting the selected URL to a download module;
downloading, by the download module, a webpage corresponding to the selected URL;
processing the downloaded webpage to extract any URLs within the downloaded webpage; and
adding the extracted URLs to the URL data store.
4 Assignments
0 Petitions
Accused Products
Abstract
In a web crawler, a URL selection module selects URLs for pages to be downloaded. The URL selection module accesses an interaction data store that stores interaction data for web pages, including interaction data that indicates human interactions with the pages. To reduce the effects of link farms, the URL selection module filters the URLs to select only those URLs that have human interaction histories and provides the selected URLs to a download module for web page downloading.
-
Citations
20 Claims
-
1. A method of selecting a uniform resource locator (URL) in a web crawling procedure, the method comprising:
-
querying a URL data store for URLs corresponding to webpages which have not been previously downloaded by a web crawler computing device during a web crawling procedure; retrieving from the URL data store, in response to the querying, a plurality of URLs corresponding to webpages which have not been previously downloaded by the web crawler computing device during the web crawling procedure; for each particular URL in the plurality of retrieved URLs; (a) identifying one or more additional URLs within the URL data store corresponding to webpages that have been downloaded during the web crawling procedure, and that contain links to the particular URL; and (b) retrieving, from interaction data store, user interaction event data associated with the identified one or more additional URLs that include links to the particular URL; selecting a highest ranked URL from the plurality of retrieved URLs, based on the user interaction event data retrieved from the interaction data store associated with the plurality of retrieved URLs, wherein the selection of a particular URL as the highest ranked URL comprises performing a source element ranking and attention shift analysis for one or more web pages that include links to the particular URL, wherein performing the source element ranking and attention shift analysis comprises; (a) detecting user interactions with content areas of each of the one or more web pages that include links to the particular URL, wherein the user interactions detected comprise one or more of clicks, hints, or lingers on the on the content areas of the web pages; and (b) determining, based on the detected user interactions with the content areas, that the one or more web pages that include links to the particular URL are valid web pages; submitting the selected URL to a download module; downloading, by the download module, a webpage corresponding to the selected URL; processing the downloaded webpage to extract any URLs within the downloaded webpage; and adding the extracted URLs to the URL data store. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. An apparatus configured to select a uniform resource locator (URL) in a web crawling procedure, the apparatus comprising:
-
one or more processors; and memory storing computer-readable instructions that, when executed by the one or more processors, cause the apparatus to; query a URL data store for URLs corresponding to webpages which have not been previously downloaded device during a web crawling procedure; retrieve from the URL data store, in response to the querying, a plurality of URLs corresponding to webpages which have not been previously downloaded during the web crawling procedure; for each particular URL in the plurality of retrieved URLs; (a) identify one or more additional URLs within the URL data store corresponding to webpages that have been downloaded during the web crawling procedure, and that contain links to the particular URL; and (b) retrieve, from interaction data store, user interaction event data associated with the identified one or more additional URLs that include links to the particular URL; select a highest ranked URL from the plurality of retrieved URLs based on the user interaction event data retrieved from the interaction data store associated with the plurality of retrieved URLs, wherein the selection of a particular URL as the highest ranked URL comprises performing a source element ranking and attention shift analysis for one or more web pages that include links to the particular URL, wherein performing the source element ranking and attention shift analysis comprises; (a) detecting user interactions with content areas of each of the one or more web pages that include links to the particular URL, wherein the user interactions detected comprise one or more of clicks, hints, or lingers on the on the content areas of the web pages; and (b) determining, based on the detected user interactions with the content areas, that the one or more web pages that include links to the particular URL are valid web pages; submit the selected URL to a download module; receive the selected URL; download a webpage corresponding to the selected URL; process the downloaded webpage to extract identify URLs within the downloaded webpage; and add the extracted URLs to the URL data store. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A non-transitory computer readable storage medium configured to store instructions that when executed cause a processor to perform selecting a uniform resource locator (URL) in a web crawling procedure, the processor being further configured to perform:
-
querying a URL data store for URLs corresponding to webpages which have not been previously downloaded by a web crawler computing device during a web crawling procedure; retrieving from the URL data store, in response to the querying, a plurality of URLs corresponding to webpages which have not been previously downloaded by the web crawler computing device during the web crawling procedure;
for each particular URL in the plurality of retrieved URLs;(a) identifying one or more additional URLs within the URL data store corresponding to webpages that have been downloaded during the web crawling procedure, and that contain links to the particular URL; and (b) retrieving, from interaction data store, user interaction event data associated with the identified one or more additional URLs that include links to the particular URL; selecting a highest ranked URL from the plurality of retrieved URLs based on the user interaction event data records retrieved from the interaction data store associated with the plurality of retrieved URLs, wherein the selection of a particular URL as the highest ranked URL comprises performing a source element ranking and attention shift analysis for one or more web pages that include links to the particular URL, wherein performing the source element ranking and attention shift analysis comprises; (a) detecting user interactions with content areas of each of the one or more web pages that include links to the particular URL, wherein the user interactions detected comprise one or more of clicks, hints, or lingers on the on the content areas of the web pages; and (b) determining, based on the detected user interactions with the content areas, that the one or more web pages that include links to the particular URL are valid web pages; submitting the selected URL to a download module; downloading, by the download module, a webpage corresponding to the selected URL; processing the downloaded webpage to extract any URLs within the downloaded webpage; and adding the extracted URLs to the URL data store. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification