×

Web crawler scheduler that utilizes sitemaps from websites

  • US 9,002,819 B2
  • Filed: 04/08/2013
  • Issued: 04/07/2015
  • Est. Priority Date: 05/31/2005
  • Status: Active Grant
First Claim
Patent Images

1. A method of scheduling documents for crawling, performed on a computer system having one or more processors and memory storing one or more programs for execution by the one or more processors, the method comprising:

  • obtaining sitemap information for a plurality of websites;

    analyzing the sitemap information to identify a website, in the plurality of websites, having sitemap information that is at least potentially out of date;

    updating the sitemap information for the identified website by downloading updated sitemap information for the identified website; and

    scheduling documents for crawling in accordance with the updated sitemap information for the identified website;

    whereinthe sitemap information includes one or more sitemap indexes, wherein each sitemap index in the one or more sitemap indices includes a list of URLs corresponding to documents stored at a website in the plurality of websites, andeach sitemap index in the one or more sitemap indexes includes information identifying one or more of;

    a last modification date of a URL in the list of URLs, a change frequency of a document specified by the URL, a title of the document, an authority of the document, and a priority of the document.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×