Unsupervised, automated web host dynamicity detection, dead link detection and prerequisite page discovery for search indexed web pages
First Claim
1. A computer-implemented method comprising:
- retrieving, from a site host, pages associated with the site, wherein the pages contain content;
determining how dynamic the content of the site is, based on the degree to which the content of the retrieved pages changed since a previous crawl of the site;
if the content of the site is determined dynamic, in relation to a corresponding threshold, then continuing retrieving, from the site host, pages associated with the site; and
if the content of the site is determined not dynamic, in relation to the corresponding threshold, then not retrieving, from the site host, a subset of pages associated with the domain.
10 Assignments
0 Petitions
Accused Products
Abstract
Automated crawling of page links associated with a site domain that was previously crawled involves computing the dynamicity of a site based on totals of continuous dead links, live links and/or prerequisite pages encountered while crawling page links corresponding to the site. The degree to which links are crawled is optimized based on the dynamicity of the site. Some pages require that another particular page (i.e., a prerequisite page) is retrieved from the host prior to retrieving a given page, e.g., so that the prerequisite page can set a cookie. Prerequisite pages are determined based on stored information about pages that were retrieved, during a previous crawl, prior to retrieving a page. Prerequisite pages are identified to a search system so that when a user clicks on the URL for the page, the request is redirected to the prerequisite page to set the cookie appropriately.
-
Citations
26 Claims
-
1. A computer-implemented method comprising:
-
retrieving, from a site host, pages associated with the site, wherein the pages contain content;
determining how dynamic the content of the site is, based on the degree to which the content of the retrieved pages changed since a previous crawl of the site;
if the content of the site is determined dynamic, in relation to a corresponding threshold, then continuing retrieving, from the site host, pages associated with the site; and
if the content of the site is determined not dynamic, in relation to the corresponding threshold, then not retrieving, from the site host, a subset of pages associated with the domain. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A computer-implemented method for determining a prerequisite page for a page while checking, in an automated manner, page links associated with a site domain that was previously crawled, the method comprising:
-
retrieving, from crawler storage, information that indicates whether or not retrieving the page requires setting request information by retrieving a prerequisite page, wherein the information identifies one or more ancestor pages that were retrieved prior to retrieving the page during a previous crawl and the order in which the one or more ancestor pages were retrieved;
if the page requires setting request information by retrieving a prerequisite page, then (a) retrieving a first ancestor page, wherein the first ancestor page is the ancestor page that was retrieved immediately prior to retrieving the page during the previous crawl;
(b) after performing step (a), retrieving the page;
(c) determining whether the page is dead or alive;
(d) if the page is alive, then determining that the first ancestor page is a prerequisite page for the page; and
(e) if the page is dead, then retrieving a second ancestor page, wherein the second ancestor page is the ancestor page that was retrieved immediately prior to retrieving the first ancestor page during the previous crawl, and repeating steps (b), (c) and (d) for the second ancestor page and, if necessary, repeat step (e) for the next ancestor page in the order. - View Dependent Claims (21, 22, 23, 24, 25)
-
-
26. A computer software system for checking the validity of links to pages of content for indexing for subsequent search, the system comprising:
-
a dead link detector for checking the validity of links by calculating, based at least in part on stored first information about the content of each page that corresponds to a link, the amount of difference between a current version of the page of content and a version of the page that was previously crawled and that corresponds to the stored information;
a prerequisite page detector for determining, based at least in part on stored second information about pages that were retrieved prior to retrieving respective pages during a previous crawl, which particular one or more pages, if any, are required to be retrieved in order to retrieve a respective page;
a site domain dynamicity detector for determining, based at least in part on running totals of respective consecutive dead links, live links and prerequisite pages encountered while checking links associated with a previous crawl of a site domain, how dynamic the content of the site domain is; and
wherein the degree to which the validity of links associated with the site domain are checked by the dead link detector is based at least in part on how dynamic the content of the site domain is.
-
Specification