Web crawler scheduler that utilizes sitemaps from websites

US 9,002,819 B2
Filed: 04/08/2013
Issued: 04/07/2015
Est. Priority Date: 05/31/2005
Status: Active Grant

First Claim

Patent Images

1. A method of scheduling documents for crawling, performed on a computer system having one or more processors and memory storing one or more programs for execution by the one or more processors, the method comprising:

obtaining sitemap information for a plurality of websites;

analyzing the sitemap information to identify a website, in the plurality of websites, having sitemap information that is at least potentially out of date;

updating the sitemap information for the identified website by downloading updated sitemap information for the identified website; and

scheduling documents for crawling in accordance with the updated sitemap information for the identified website;

whereinthe sitemap information includes one or more sitemap indexes, wherein each sitemap index in the one or more sitemap indices includes a list of URLs corresponding to documents stored at a website in the plurality of websites, andeach sitemap index in the one or more sitemap indexes includes information identifying one or more of;

a last modification date of a URL in the list of URLs, a change frequency of a document specified by the URL, a title of the document, an authority of the document, and a priority of the document.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods for scheduling documents for crawling are disclosed. In some implementations, a method includes obtaining sitemap information for a plurality of websites; and analyzing the sitemap information to identify a website, in the plurality of websites. The website has sitemap information that is at least potentially out of date. The method also includes updating the sitemap information for the identified website by downloading updated sitemap information for the identified website; and scheduling documents for crawling in accordance with the updated sitemap information for the identified website.

58 Citations

View as Search Results

33 Claims

1. A method of scheduling documents for crawling, performed on a computer system having one or more processors and memory storing one or more programs for execution by the one or more processors, the method comprising:
- obtaining sitemap information for a plurality of websites;
  
  analyzing the sitemap information to identify a website, in the plurality of websites, having sitemap information that is at least potentially out of date;
  
  updating the sitemap information for the identified website by downloading updated sitemap information for the identified website; and
  
  scheduling documents for crawling in accordance with the updated sitemap information for the identified website;
  
  whereinthe sitemap information includes one or more sitemap indexes, wherein each sitemap index in the one or more sitemap indices includes a list of URLs corresponding to documents stored at a website in the plurality of websites, andeach sitemap index in the one or more sitemap indexes includes information identifying one or more of;
  
  a last modification date of a URL in the list of URLs, a change frequency of a document specified by the URL, a title of the document, an authority of the document, and a priority of the document.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31)
- - 2. The method of claim 1, wherein a respective website in the plurality of websites includes one or more documents, and wherein the one or more documents are stored using a hierarchical structure.
  - 3. The method of claim 2, wherein the hierarchical structure is a tree structure.
  - 4. The method of claim 1, wherein the sitemap information includes one or more sitemaps, wherein each sitemap in the one or more sitemaps is generated using (i) an access log, (ii) a pre-made URL list, and (iii) information obtained from a content management system associated with the corresponding website.
  - 5. The method of claim 4, wherein a sitemap in the one or more sitemaps is in a format selected from the group consisting of an XML file, a plain-text file, a comma-separated value file, and a semicolon-separated file.
  - 6. The method of claim 1, wherein each sitemap index in the one or more sitemap indexes includes site-specific information for the website in the plurality of websites.
  - 7. The method of claim 6, wherein the site-specific information includes one or more of:
    - a list of crawl intervals, a crawl rate, and information identifying geographic location of the website.
  - 8. The method of claim 1, wherein scheduling documents for crawling uses a document'"'"'s change frequency.
  - 9. The method of claim 1, wherein scheduling documents for crawling is independent of a document'"'"'s change frequency.
  - 10. The method of claim 1, wherein scheduling documents for crawling uses a document'"'"'s priority.
  - 11. The method of claim 1, wherein scheduling documents for crawling is independent of a document'"'"'s priority.
  - 12. The method of claim 1, wherein the steps recited in claim 1 are executed (i) in response to obtaining a notification of a change to the sitemap information or (ii) in accordance with a predefined schedule.
  - 13. The method of claim 12, wherein the notification of a change to the sitemap information includes one or more of:
    - information identifying a sitemap or a sitemap index, or a network location of the sitemap of the sitemap index.
  - 14. The method of claim 1, further comprising:
    - generating a sitemap for a website in the plurality of websites.
  - 15. The method of claim 14, wherein generating the sitemap for the website includes:
    - accessing an access log for the website, the access log having one or more URLs;
      
      applying one or more filters to the one or more URLs, thereby producing one or more filtered URLs; and
      
      generating the sitemap using the one or more filtered URLs.
  - 16. The method of claim 15, wherein the access log includes an error URL and a non-error URL.
  - 17. The method of claim 14, wherein generating the sitemap for the website includes:
    - performing a database scan or a directory crawl to obtain one or more URLs associated with the website;
      
      applying one or more filters to the one or more URLs, thereby producing one or more filtered URLs; and
      
      generating the sitemap using the one or more filtered URLs.
  - 18. The method of claim 1, wherein sitemap information includes a differential sitemap identifying, for a website in the plurality of websites, a difference between a first sitemap for the website and a second sitemap of the website.
  - 19. The method of claim 18, wherein the first sitemap is a previous sitemap and the second sitemap is a current sitemap.
  - 20. The method of claim 18, wherein the differential sitemap includes URLs included in the first sitemap but not included in the second sitemap.
  - 21. The method of claim 18, wherein scheduling documents for crawling in accordance with the updated sitemap information uses the differential sitemap.
  - 22. The method of claim 1, wherein the identified website includes a plurality of URLs, and scheduling documents for crawling in accordance with the updated sitemap information for the identified website includes scheduling a first URL in the plurality of URLs to be crawled during a first crawling session and scheduling a second URL in the plurality of URLs to be crawled during a second crawling session.
  - 23. The method of claim 1, wherein the identified website includes a plurality of URLs, wherein the plurality of URLs is distributed into a plurality of URL segments, and scheduling documents for crawling in accordance with the updated sitemap information for the identified website includes, at each crawling session, selecting a URL from a different URL segment for crawl.
  - 24. The method of claim 1, further comprising generating a sitemap for a website in the plurality of websites by a method comprising:
    - obtaining one or more URLs associated with the website;
      
      indexing content of a webpage at a respective URL in the one or more URLs;
      
      generating records of out-bound links included in the web page;
      
      detecting duplicates pages;
      
      creating one or more log records for the web page; and
      
      generating the sitemap for the website using the one or more log records.
  - 25. The method of claim 24, further comprising generating an anchor map for the website using anchor text included in a respective URL in the one or more URLs.
  - 26. The method of claim 25, wherein the anchor map includes records keyed by a fingerprint of an out-bound link included in the webpage.
  - 27. The method of claim 25, further comprising generating search indices for the website, using the anchor map.
  - 28. The method of claim 1, wherein the sitemap information includes one or more sitemaps;
    - and further comprising;
      
      selecting a sitemap in the one or more sitemaps for crawling in accordance with one of;
      
      a last modification date associated with the sitemap and an update rate associated with the sitemap.
  - 29. The method of claim 1, wherein scheduling documents for crawling in accordance with the updated sitemap information for the identified website includes:
    - obtaining a sitemap for the website, the sitemap having a plurality of URLs;
      
      selecting one or more candidate URLs from the plurality of URLs;
      
      assigning a score to each candidate URL in the one or more candidate URLs;
      
      applying one or more filtering criteria to each candidate URL; and
      
      scheduling for crawling filtered candidate URLs by a first crawler.
  - 30. The method of claim 29, further comprising distributing a filtered candidate URL or a non-candidate URL in the plurality of URLs, to a second crawler distinct from the first crawler.
  - 31. The method of claim 29, wherein the one or more filtering criteria include:
    - a crawling budget for the first crawler and one or more site constraints for the identified website.

32. A computer system comprising:
- one or more processors;
  
  a memory storing one or more programs for execution by the one or more processors, wherein the one or more programs comprising instructions for;
  
  obtaining sitemap information for a plurality of websites;
  
  analyzing the sitemap information to identify a website, in the plurality of websites, having sitemap information that is at least potentially out of date;
  
  updating the sitemap information for the identified website by downloading updated sitemap information for the identified website; and
  
  scheduling documents for crawling in accordance with the updated sitemap information for the identified website;
  
  wherein the sitemap information includes one or more sitemap indexes, wherein each sitemap index in the one or more sitemap indices includes a list of URLs corresponding to documents stored at a website in the plurality of websites, andeach sitemap index in the one or more sitemap indexes includes information identifying one or more of;
  
  a last modification date of a URL in the list of URLs, a change frequency of a document specified by the URL, a title of the document, an authority of the document, and a priority of the document.

33. A non-transitory computer readable storage medium storing one or more programs to be executed by a computer system, the one or more programs comprising instructions for:
- obtaining sitemap information for a plurality of websites;
  
  analyzing the sitemap information to identify a website, in the plurality of websites, having sitemap information that is at least potentially out of date;
  
  updating the sitemap information for the identified website by downloading updated sitemap information for the identified website; and
  
  scheduling documents for crawling in accordance with the updated sitemap information for the identified website;
  
  wherein the sitemap information includes one or more sitemap indexes, wherein each sitemap index in the one or more sitemap indices includes a list of URLs corresponding to documents stored at a website in the plurality of websites, andeach sitemap index in the one or more sitemap indexes includes information identifying one or more of;
  
  a last modification date of a URL in the list of URLs, a change frequency of a document specified by the URL, a title of the document, an authority of the document, and a priority of the document.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Brawer, Sascha B., Ibel, Maximilian, Keller, Ralph Michael, Shivakumar, Narayanan
Primary Examiner(s)
Beausoliel, Robert W.
Assistant Examiner(s)
Liu, Hexing

Application Number

US13/858,872
Publication Number

US 20130226898A1
Time in Patent Office

729 Days
Field of Search

None
US Class Current

707/709
CPC Class Codes

G06F 16/951 Indexing; Web crawling tech...

Web crawler scheduler that utilizes sitemaps from websites

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

58 Citations

33 Claims

Specification

Use Cases

Quick Links

Others

Web crawler scheduler that utilizes sitemaps from websites

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

58 Citations

33 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others