×

System, method and computer readable medium for web crawling

  • US 9,940,391 B2
  • Filed: 11/02/2011
  • Issued: 04/10/2018
  • Est. Priority Date: 05/05/2009
  • Status: Active Grant
First Claim
Patent Images

1. A method of selecting a uniform resource locator (URL) in a web crawling procedure, the method comprising:

  • querying a URL data store for URLs corresponding to webpages which have not been previously downloaded by a web crawler computing device during a web crawling procedure;

    retrieving from the URL data store, in response to the querying, a plurality of URLs corresponding to webpages which have not been previously downloaded by the web crawler computing device during the web crawling procedure;

    for each particular URL in the plurality of retrieved URLs;

    (a) identifying one or more additional URLs within the URL data store corresponding to webpages that have been downloaded during the web crawling procedure, and that contain links to the particular URL; and

    (b) retrieving, from interaction data store, user interaction event data associated with the identified one or more additional URLs that include links to the particular URL;

    selecting a highest ranked URL from the plurality of retrieved URLs, based on the user interaction event data retrieved from the interaction data store associated with the plurality of retrieved URLs, wherein the selection of a particular URL as the highest ranked URL comprises performing a source element ranking and attention shift analysis for one or more web pages that include links to the particular URL, wherein performing the source element ranking and attention shift analysis comprises;

    (a) detecting user interactions with content areas of each of the one or more web pages that include links to the particular URL, wherein the user interactions detected comprise one or more of clicks, hints, or lingers on the on the content areas of the web pages; and

    (b) determining, based on the detected user interactions with the content areas, that the one or more web pages that include links to the particular URL are valid web pages;

    submitting the selected URL to a download module;

    downloading, by the download module, a webpage corresponding to the selected URL;

    processing the downloaded webpage to extract any URLs within the downloaded webpage; and

    adding the extracted URLs to the URL data store.

View all claims
  • 4 Assignments
Timeline View
Assignment View
    ×
    ×