Unsupervised, automated web host dynamicity detection, dead link detection and prerequisite page discovery for search indexed web pages

US 20060294052A1
Filed: 08/13/2005
Published: 12/28/2006
Est. Priority Date: 06/28/2005
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method comprising:

retrieving, from a site host, pages associated with the site, wherein the pages contain content;

determining how dynamic the content of the site is, based on the degree to which the content of the retrieved pages changed since a previous crawl of the site;

if the content of the site is determined dynamic, in relation to a corresponding threshold, then continuing retrieving, from the site host, pages associated with the site; and

if the content of the site is determined not dynamic, in relation to the corresponding threshold, then not retrieving, from the site host, a subset of pages associated with the domain.

View all claims

10 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Automated crawling of page links associated with a site domain that was previously crawled involves computing the dynamicity of a site based on totals of continuous dead links, live links and/or prerequisite pages encountered while crawling page links corresponding to the site. The degree to which links are crawled is optimized based on the dynamicity of the site. Some pages require that another particular page (i.e., a prerequisite page) is retrieved from the host prior to retrieving a given page, e.g., so that the prerequisite page can set a cookie. Prerequisite pages are determined based on stored information about pages that were retrieved, during a previous crawl, prior to retrieving a page. Prerequisite pages are identified to a search system so that when a user clicks on the URL for the page, the request is redirected to the prerequisite page to set the cookie appropriately.

Citations

26 Claims

1. A computer-implemented method comprising:
- retrieving, from a site host, pages associated with the site, wherein the pages contain content;
  
  determining how dynamic the content of the site is, based on the degree to which the content of the retrieved pages changed since a previous crawl of the site;
  
  if the content of the site is determined dynamic, in relation to a corresponding threshold, then continuing retrieving, from the site host, pages associated with the site; and
  
  if the content of the site is determined not dynamic, in relation to the corresponding threshold, then not retrieving, from the site host, a subset of pages associated with the domain.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
- - 2. The method of claim 1, wherein determining how dynamic the content of the site is comprises maintaining a count of the number of continuous dead links encountered;
    - and wherein not retrieving a subset of pages comprises indicating that each link corresponding to a page from the subset of pages is a dead link, without actually retrieving the page from the site host.
  - 3. The method of claim 1, wherein determining how dynamic the content of the site is comprises maintaining a count of the number of continuous alive links encountered;
    - and wherein not retrieving a subset of pages comprises indicating that each link corresponding to a page from the subset of pages is an alive link, without actually retrieving the page from the site host.
  - 4. The method of claim 1, wherein determining how dynamic the content of the site is comprises maintaining a count of the number of continuous pages for which a same particular prerequisite page must be retrieved to retrieve the continuous pages;
    - and wherein not retrieving a subset of pages comprises indicating that each page from the subset of pages requires the particular prerequisite page to be retrieved to retrieve the page, without actually retrieving the page from the site host.
  - 5. The method of claim 1, further comprising:
    - determining whether a link to a page is dead or alive by retrieving, from crawler storage, indexable words that were found in the page corresponding to the link during a previous crawl of the page;
      
      retrieving the current version of the page;
      
      determining how many of the indexable words match words in the current version of the page; and
      
      if a ratio of matched words in the current version of the page over the number of indexable words exceeds a certain value, then mark the link corresponding to the page as an alive link, else mark the link corresponding to the page as a dead link.
  - 6. The method of claim 5, wherein retrieving the current version of the page includes:
    - retrieving, from crawler storage, information that indicates whether retrieving the page requires use of an HTTP POST method;
      
      if the HTTP POST method is required, then retrieving, from crawler storage, post data corresponding to the page which was used to retrieve the page during a previous crawl; and
      
      retrieving the page by transmitting the post data via an HTTP POST request.
  - 7. The method of claim 1, further comprising:
    - in response to determining that a link to a page is dead, retrieving, from crawler storage, information that indicates whether or not retrieving the page requires setting request information by retrieving a prerequisite page, wherein the information identifies one or more ancestor pages that were retrieved prior to retrieving the page during a previous crawl and the order in which the one or more ancestor pages were retrieved;
      
      if the page requires setting request information by retrieving a prerequisite page, then (a) retrieving a first ancestor page, wherein the first ancestor page is the ancestor page that was retrieved immediately prior to retrieving the page during the previous crawl;
      
      (b) after performing step (a), retrieving the page;
      
      (c) determining whether the page is dead or alive;
      
      (d) if the page is alive, then determining that the first ancestor page is a prerequisite page for the page; and
      
      (e) if the page is dead, then retrieving a second ancestor page, wherein the second ancestor page is the ancestor page that was retrieved immediately prior to retrieving the first ancestor page during the previous crawl, and repeating steps (b), (c) and (d) for the second ancestor page and, if necessary, repeat step (e) for the next ancestor page in the order.
  - 8. The method of claim 7, further comprising:
    - storing, in a search system, an identifier of the prerequisite page for the page; and
      
      in response to a user request for the page, wherein the request is made by using a link to the page from search results from the search system, directing a request to the prerequisite page based on the stored identifier of the prerequisite page; and
      
      then directing a request to the page.
  - 9. The method of claim 7, further comprising:
    - determining whether a link to a page is dead or alive by retrieving, from crawler storage, indexable words that were found in the page corresponding to the link during a previous crawl of the page;
      
      retrieving the current version of the page;
      
      determining how many of the indexable words match words in the current version of the page; and
      
      if a ratio of matched words in the current version of the page over the number of indexable words exceeds a certain value, then mark the link corresponding to the page as an alive link, else mark the link corresponding to the page as a dead link.
  - 10. The method of claim 1, wherein the step of retrieving pages comprises concurrently retrieving pages associated with multiple site domains, and wherein retrieving pages for each site domain is performed by a single respective processing thread that does not retrieving pages for any other site domain.
  - 11. A machine-readable medium carrying one or more sequences of instructions which, when executed by one or more processors, causes the one or more processors to perform the method recited in claim 1.
  - 12. A machine-readable medium carrying one or more sequences of instructions which, when executed by one or more processors, causes the one or more processors to perform the method recited in claim 2.
  - 13. A machine-readable medium carrying one or more sequences of instructions which, when executed by one or more processors, causes the one or more processors to perform the method recited in claim 3.
  - 14. A machine-readable medium carrying one or more sequences of instructions which, when executed by one or more processors, causes the one or more processors to perform the method recited in claim 4.
  - 15. A machine-readable medium carrying one or more sequences of instructions which, when executed by one or more processors, causes the one or more processors to perform the method recited in claim 5.
  - 16. A machine-readable medium carrying one or more sequences of instructions which, when executed by one or more processors, causes the one or more processors to perform the method recited in claim 6.
  - 17. A machine-readable medium carrying one or more sequences of instructions which, when executed by one or more processors, causes the one or more processors to perform the method recited in claim 7.
  - 18. A machine-readable medium carrying one or more sequences of instructions which, when executed by one or more processors, causes the one or more processors to perform the method recited in claim 8.
  - 19. A machine-readable medium carrying one or more sequences of instructions which, when executed by one or more processors, causes the one or more processors to perform the method recited in claim 9.

20. A computer-implemented method for determining a prerequisite page for a page while checking, in an automated manner, page links associated with a site domain that was previously crawled, the method comprising:
- retrieving, from crawler storage, information that indicates whether or not retrieving the page requires setting request information by retrieving a prerequisite page, wherein the information identifies one or more ancestor pages that were retrieved prior to retrieving the page during a previous crawl and the order in which the one or more ancestor pages were retrieved;
  
  if the page requires setting request information by retrieving a prerequisite page, then (a) retrieving a first ancestor page, wherein the first ancestor page is the ancestor page that was retrieved immediately prior to retrieving the page during the previous crawl;
  
  (b) after performing step (a), retrieving the page;
  
  (c) determining whether the page is dead or alive;
  
  (d) if the page is alive, then determining that the first ancestor page is a prerequisite page for the page; and
  
  (e) if the page is dead, then retrieving a second ancestor page, wherein the second ancestor page is the ancestor page that was retrieved immediately prior to retrieving the first ancestor page during the previous crawl, and repeating steps (b), (c) and (d) for the second ancestor page and, if necessary, repeat step (e) for the next ancestor page in the order.
- View Dependent Claims (21, 22, 23, 24, 25)
- - 21. The method of claim 20, further comprising:
    - storing, in a search system, an identifier of the prerequisite page for the page; and
      
      in response to a user request for the page, wherein the request is made by using a link to the page from search results from the search system, directing a request to the prerequisite page based on the stored identifier of the prerequisite page.
  - 22. The method of claim 21, further comprising:
    - after directing the request to the prerequisite page, directing a request to the page.
  - 23. A machine-readable medium carrying one or more sequences of instructions which, when executed by one or more processors, causes the one or more processors to perform the method recited in claim 20.
  - 24. A machine-readable medium carrying one or more sequences of instructions which, when executed by one or more processors, causes the one or more processors to perform the method recited in claim 21.
  - 25. A machine-readable medium carrying one or more sequences of instructions which, when executed by one or more processors, causes the one or more processors to perform the method recited in claim 22.

26. A computer software system for checking the validity of links to pages of content for indexing for subsequent search, the system comprising:
- a dead link detector for checking the validity of links by calculating, based at least in part on stored first information about the content of each page that corresponds to a link, the amount of difference between a current version of the page of content and a version of the page that was previously crawled and that corresponds to the stored information;
  
  a prerequisite page detector for determining, based at least in part on stored second information about pages that were retrieved prior to retrieving respective pages during a previous crawl, which particular one or more pages, if any, are required to be retrieved in order to retrieve a respective page;
  
  a site domain dynamicity detector for determining, based at least in part on running totals of respective consecutive dead links, live links and prerequisite pages encountered while checking links associated with a previous crawl of a site domain, how dynamic the content of the site domain is; and
  
  wherein the degree to which the validity of links associated with the site domain are checked by the dead link detector is based at least in part on how dynamic the content of the site domain is.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
R2 Solutions LLC (Acacia Research Corporation)
Original Assignee
Yahoo! Inc. (Apollo Global Management, Inc.)
Inventors
Kulkami, Parashuram, Raj, Binu, Nair, Thejas Madhavan

Granted Patent

US 7,610,267 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/951   Indexing; Web crawling tech...

G06F 16/958   Organisation or management ...

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99933   Query processing, i.e. sear...

Unsupervised, automated web host dynamicity detection, dead link detection and prerequisite page discovery for search indexed web pages

First Claim

10 Assignments

0 Petitions

Accused Products

Abstract

Citations

26 Claims

Specification

Solutions

Use Cases

Quick Links

Unsupervised, automated web host dynamicity detection, dead link detection and prerequisite page discovery for search indexed web pages

First Claim

10 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

26 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links