Web crawler scheduler that utilizes sitemaps from websites
First Claim
Patent Images
1. A method of scheduling documents for crawling, performed on a computer system having one or more processors and memory storing one or more programs for execution by the one or more processors, the method comprising:
- receiving from a website a notification that includes a sitemap URL corresponding to a sitemap for the website;
in response to the notification;
accessing the sitemap at the sitemap URL; and
retrieving from the sitemap document location information and metadata for a plurality of documents associated with the website;
scheduling for downloading documents, from among the plurality of documents, based at least in part on the metadata retrieved from the sitemap; and
downloading at least a subset of the documents scheduled for downloading.
1 Assignment
0 Petitions
Accused Products
Abstract
Methods and systems for a web crawler scheduler that utilizes sitemaps from websites are described. A web crawler scheduling system receives a notification from a website or web server. In response to the notification, the system accesses one or more sitemap(s) for documents associated with the website or web server. The system schedules crawls of the documents based on information identified from the sitemaps. The system crawls at least a subset of the documents scheduled for crawling.
-
Citations
15 Claims
-
1. A method of scheduling documents for crawling, performed on a computer system having one or more processors and memory storing one or more programs for execution by the one or more processors, the method comprising:
-
receiving from a website a notification that includes a sitemap URL corresponding to a sitemap for the website; in response to the notification; accessing the sitemap at the sitemap URL; and retrieving from the sitemap document location information and metadata for a plurality of documents associated with the website; scheduling for downloading documents, from among the plurality of documents, based at least in part on the metadata retrieved from the sitemap; and downloading at least a subset of the documents scheduled for downloading. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A system for scheduling documents for crawling, comprising:
-
one or more processors; and memory storing one or more modules; the one or more modules including instructions to; receive from a website a notification that includes a sitemap URL corresponding to a sitemap for the website; in response to the notification; access the sitemap at the sitemap URL; and retrieve from the sitemap document location information and metadata for a plurality of documents associated with the website; schedule for downloading documents, from among the plurality of documents, based at least in part on the metadata retrieved from the sitemap; and download at least a subset of the documents scheduled for downloading. - View Dependent Claims (7, 8, 9, 10)
-
-
11. A computer program product for use in conjunction with a computer system, the computer program product comprising a computer readable storage medium and a computer program mechanism embedded therein, the computer program mechanism comprising instructions for:
-
receiving from a website a notification that includes a sitemap URL corresponding to a sitemap for the website; in response to the notification; accessing the sitemap at the sitemap URL; and retrieving from the sitemap document location information and metadata for a plurality of documents associated with the website; scheduling for downloading documents, from among the plurality of documents, based at least in part on the metadata retrieved from the sitemap; and downloading at least a subset of the documents scheduled for downloading. - View Dependent Claims (12, 13, 14, 15)
-
Specification