System and method for enabling website owners to manage crawl rate in a website indexing system

US 7,599,920 B1
Filed: 10/12/2006
Issued: 10/06/2009
Est. Priority Date: 10/12/2006
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method of indexing documents in websites, the method comprising:

on a server system having one or more processors and memory storing programs to be executed by the one or more processors;

for each website of a multiplicity of websites, each website having a corresponding current crawl rate limit;

crawling the respective website, in accordance with the current crawl rate limit corresponding to the respective website, to download documents from the respective website for inclusion in a database;

storing crawl data associated with the crawling of the respective website;

providing, for display, a crawl rate control mechanism to a respective owner of the respective website, including providing for display to the respective owner at least a portion of the crawl data, and enabling selection of a new crawl rate limit corresponding to the respective website by the respective owner;

comparing a maximum crawl rate for the respective website over a defined period of time with the current crawl rate limit for crawling the respective website to determine if the current crawl rate limit is a limiting factor in crawling the respective website; and

in response to a request to increase a current crawl rate for crawling the respective website, increasing the current crawl rate limit only when the current crawl rate limit is a limiting factor in crawling the respective website.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Web crawlers crawl websites to access documents of the website for purposes of indexing the documents for search engines. The web crawlers crawl a specified website at a crawl rate that is based on multiple factors. One of the factors is a pre-set crawl rate limit. According to certain embodiments, an owner for a specified website is enabled to modify the crawl rate limit for the specified website when one or more pre-set criteria are met.

67 Citations

View as Search Results

36 Claims

1. A computer-implemented method of indexing documents in websites, the method comprising:
- on a server system having one or more processors and memory storing programs to be executed by the one or more processors;
  
  for each website of a multiplicity of websites, each website having a corresponding current crawl rate limit;
  
  crawling the respective website, in accordance with the current crawl rate limit corresponding to the respective website, to download documents from the respective website for inclusion in a database;
  
  storing crawl data associated with the crawling of the respective website;
  
  providing, for display, a crawl rate control mechanism to a respective owner of the respective website, including providing for display to the respective owner at least a portion of the crawl data, and enabling selection of a new crawl rate limit corresponding to the respective website by the respective owner;
  
  comparing a maximum crawl rate for the respective website over a defined period of time with the current crawl rate limit for crawling the respective website to determine if the current crawl rate limit is a limiting factor in crawling the respective website; and
  
  in response to a request to increase a current crawl rate for crawling the respective website, increasing the current crawl rate limit only when the current crawl rate limit is a limiting factor in crawling the respective website.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The computer-implemented method of claim 1, further including crawling the respective website at a rate no greater than the current crawl rate limit.
  - 3. The computer-implemented method of claim 1, further comprising:
    - when the current crawl rate limit is not a limiting factor in crawling the respective website, informing the respective owner that a request for a faster crawl rate may not change a current crawl rate for crawling the respective website.
  - 4. The computer-implemented method of claim 1, further comprising:
    - when the current crawl rate limit is not a limiting factor in crawling the respective website, informing the respective owner that a faster crawl rate may not be selected.
  - 5. The computer-implemented method of claim 1, wherein the current crawl rate limit is a limiting factor only when a difference between the current crawl rate limit and the maximum crawl rate for the respective website over the defined period of time is less than a predefined quantity.
  - 6. The computer-implemented method of claim 1, further comprising:
    - in response to a request to decrease the current crawl rate for crawling the respective website, decreasing the current crawl rate.
  - 7. The computer-implemented method of claim 1, wherein storing crawl data further comprises determining a number documents of the respective website that are accessed during one or more crawl sessions.
  - 8. The computer-implemented method of claim 7, wherein storing crawl data further comprises determining an average quantity of time expended to access the documents from the respective website during the one or more crawl sessions.
  - 9. The computer-implemented method of claim 1, wherein storing crawl data further comprises determining a number of bytes downloaded from the respective website during one or more crawl sessions.
  - 10. The computer-implemented method of claim 1, including providing, for display, resource usage statistics corresponding to resources of the respective website used during a plurality of prior crawl visits of the respective website.
  - 11. The method of claim 1, wherein the providing includes providing, for concurrent display:
    - the current crawl rate limit associated with the crawling of the respective website;
      
      crawl data, including statistical information associated with crawling the respective website; and
      
      an interface for enabling the respective owner to select a new crawl rate limit.
  - 12. The method of claim 11, wherein the crawl data provided for display further includes resource usage statistics corresponding to resources of the respective website used during a plurality of prior crawl sessions of the respective website.
  - 13. The method of claim 11, wherein the providing for display further includes providing for display recommendations for selecting the new crawl rate limit, wherein the recommendations are based, at least in part, on whether the current crawl rate limit is a limiting factor in crawling the respective website.

14. A computer system comprising:
- memory;
  
  one or more processors; and
  
  at least one program stored in the memory and executed by the one or more processors, the at least one program including;
  
  web crawl control instructions for controlling crawling of each website of a multiplicity of websites, each website having a corresponding current crawl rate limit,the web crawl control instructions including;
  
  instructions for crawling a respective website of the multiplicity of websites, in accordance with the current crawl rate limit corresponding to the respective website, to download documents from the respective website for inclusion in a database;
  
  instructions for storing crawl data associated with the crawling of the respective website;
  
  instructions for providing, for display, a crawl rate control mechanism to a respective owner of the respective website, including providing, for display to the respective owner, at least a portion of the crawl data, and enabling selection, by the respective owner, of a new crawl rate limit corresponding to the respective website;
  
  instructions for comparing a maximum crawl rate for the respective website over a defined period of time with the current crawl rate limit for crawling the respective website to determine if the current crawl rate limit is a limiting factor in crawling the respective website; and
  
  instructions for responding to a request to increase the current crawl rate for crawling the respective website by increasing the current crawl rate limit only when the current crawl rate limit is a limiting factor in crawling the respective website.
- View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25)
- - 15. The computer system of claim 14, the web crawl control instructions further including instructions for crawling the respective website at a rate no greater than the current crawl rate limit.
  - 16. The computer system of claim 14, the web crawl control instructions further comprising:
    - instructions for informing the respective owner that a request for a faster crawl rate may not change a current crawl rate for crawling the respective website when the current crawl rate limit is not a limiting factor in crawling the respective website.
  - 17. The computer system of claim 14, wherein the current crawl rate limit is a limiting factor only when a difference between the current crawl rate limit and the maximum crawl rate for the respective website over the defined period of time is less than a predefined quantity.
  - 18. The computer system of claim 14, the web crawl control instructions further comprising:
    - instructions for decreasing the current crawl rate in response to a request to decrease the current crawl rate for crawling the respective website.
  - 19. The computer system of claim 14, wherein the instructions for storing crawl data further comprises instructions for determining a number documents of the respective website that are accessed during one or more crawl sessions.
  - 20. The computer system of claim 19, wherein the instructions for storing crawl data further comprises instructions for determining an average quantity of time expended to access the documents from the respective website during the one or more crawl sessions.
  - 21. The computer system of claim 14, wherein the instructions for storing crawl data further comprises instructions for determining a number of bytes downloaded from the respective website during one or more crawl sessions.
  - 22. The computer system of claim 14, including instructions for providing, for display, resource usage statistics corresponding to resources of the respective website used during a plurality of prior crawl visits of the respective website.
  - 23. The computer system of claim 14, wherein the instructions for providing further includes instructions providing, for concurrent display:
    - the current crawl rate associated with the crawling of the respective website;
      
      crawl data, including statistical information associated with crawling the respective website; and
      
      an interface for enabling the respective owner to select a new crawl rate limit.
  - 24. The computer system of claim 23, wherein the crawl data provided for display further includes a usage display of resource usage statistics corresponding to resources of the website used during a plurality of prior crawl sessions of the website.
  - 25. The computer system of claim 23, wherein the instructions for providing further includes instructions for providing an information display having recommendations for selecting the new crawl rate limit, wherein the recommendations are based, at least in part, on whether the current crawl rate is a limiting factor in crawling the website.

26. A a computer readable storage medium storing one or more programs for execution by one or more processors of a computer system, the one or more programs comprising:
- web crawl control instructions for controlling crawling of each website of a multiplicity of websites, each website having a corresponding current crawl rate limit, the web crawl control instructions including;
  
  instructions for crawling a respective website of the multiplicity of websites, in accordance with the current crawl rate limit corresponding to the respective website, to download documents from the respective website for inclusion in a database;
  
  instructions for storing crawl data associated with the crawling of the respective website;
  
  instructions for providing, for display, a crawl rate control mechanism to a respective owner of the respective website, including providing, for display to the respective owner, at least a portion of the crawl data, and enabling selection, by the respective owner, of a new crawl rate limit corresponding to the respective website;
  
  instructions for comparing a maximum crawl rate for the respective website over a defined period of time with the current crawl rate limit for crawling the respective website to determine if the current crawl rate limit is a limiting factor in crawling the respective website; and
  
  instructions for responding to a request to increase the current crawl rate for crawling the respective website by increasing the current crawl rate limit only when the current crawl rate limit is a limiting factor in crawling the respective website.
- View Dependent Claims (27, 28, 29, 30, 31, 32, 33, 34, 35, 36)
- - 27. The computer readable storage medium of claim 26, the web crawl control instructions further including instructions for crawling the respective website at a rate no greater than the current crawl rate limit.
  - 28. The computer readable storage medium of claim 26, the web crawl control instructions further comprising:
    - instructions for informing the respective owner that a request for a faster crawl rate may not change the current crawl rate for crawling the respective website when the current crawl rate limit is not a limiting factor in crawling the respective website.
  - 29. The computer readable storage medium of claim 26, wherein the current crawl rate limit is a limiting factor only when a difference between the current crawl rate limit and the maximum crawl rate for the respective website over the defined period of time is less than a predefined quantity.
  - 30. The computer readable storage medium of claim 26, the web crawl control instructions further comprising:
    - instructions for decreasing the current crawl rate in response to a request to decrease the current crawl rate for crawling the respective website.
  - 31. The computer readable storage medium of claim 26, wherein the instructions for storing crawl data further comprises instructions for determining a number of documents of the respective website that are accessed during one or more crawl sessions.
  - 32. The computer readable storage medium of claim 26, wherein the instructions for storing crawl data further comprises instructions for determining a number of bytes downloaded from the respective website during one or more crawl sessions.
  - 33. The computer readable storage medium of claim 26, including instructions for providing, for display, resource usage statistics corresponding to resources of the respective website used during a plurality of prior crawl visits of the website.
  - 34. The computer readable storage medium of claim 26, wherein the instructions for providing further includes instructions providing, for concurrent display:
    - the current crawl rate associated with the crawling of the respective website;
      
      crawl data, including statistical information associated with crawling the respective website; and
      
      an interface for enabling the respective owner to select a new crawl rate limit.
  - 35. The computer readable storage medium of claim 34, wherein the crawl data provided for display further includes a usage display of resource usage statistics corresponding to resources of the website used during a plurality of prior crawl sessions of the website.
  - 36. The computer readable storage medium of claim 34, wherein the instructions for providing further includes instructions for providing, for display, recommendations for selecting the new crawl rate limit, wherein the recommendations are based, at least in part, on whether the current crawl rate is a limiting factor in crawling the website.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Ibel, Maximilian, Fox, Vanessa, Lai, Katherine Jane, Cardwell, Neal Douglas, Bonkenburg, Ted J., Reali, Patrik Rene Celeste, Camp, Amanda Ann, Lilley, Jeremy J.
Primary Examiner(s)
Ehichioya; Fred I

Application Number

US11/549,075
Time in Patent Office

1,090 Days
Field of Search

707/2, 707/3, 707/5, 707/10, 709/218, 709/223, 709/224
US Class Current

1/1
CPC Class Codes

G06F 16/951   Indexing; Web crawling tech...

G06F 16/9566   URL specific, e.g. using al...

G06F 16/958   Organisation or management ...

Y10S 707/99933   Query processing, i.e. sear...

System and method for enabling website owners to manage crawl rate in a website indexing system

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

67 Citations

36 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for enabling website owners to manage crawl rate in a website indexing system

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

67 Citations

36 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links