×

Web crawler scheduler that utilizes sitemaps from websites

  • US 7,769,742 B1
  • Filed: 06/30/2005
  • Issued: 08/03/2010
  • Est. Priority Date: 05/31/2005
  • Status: Expired due to Fees
First Claim
Patent Images

1. A method of scheduling documents for crawling, performed on a computer system having one or more processors and memory storing one or more programs for execution by the one or more processors, the method comprising:

  • receiving from a website a notification that includes a sitemap URL corresponding to a sitemap for the website;

    in response to the notification;

    accessing the sitemap at the sitemap URL; and

    retrieving from the sitemap document location information and metadata for a plurality of documents associated with the website, wherein the metadata comprises, for at least a plurality of respective documents associated with the website, document update rate information indicating update frequencies associated with the respective documents;

    scheduling for downloading documents, from among the plurality of documents, based at least in part on the metadata retrieved from the sitemap, wherein the scheduling includes scheduling a respective document for downloading when one of a date or date and time at which the respective document was last downloaded differs from a current date or a current date and time by an amount that is greater than a duration corresponding to the document update rate information for the respective document; and

    downloading at least a subset of the documents scheduled for downloading.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×