Web crawler scheduler that utilizes sitemaps from websites
First Claim
Patent Images
1. A method of scheduling documents for crawling, performed on a computer system having one or more processors and memory storing one or more programs for execution by the one or more processors, the method comprising:
- storing sitemap information for a plurality of websites, wherein the information includes a predicted update period for at least a plurality of documents identified by the sitemap information;
analyzing the stored sitemap information to identify a respective website having sitemap information that is at least potentially out of date;
updating the stored sitemap information for the identified respective website by downloading updated sitemap information for the identified respective website; and
scheduling documents for crawling in accordance with the updated stored sitemap information for the identified respective website.
1 Assignment
0 Petitions
Accused Products
Abstract
Methods and systems for a web crawler scheduler that utilizes sitemaps from websites are described. A web crawler scheduling system receives a notification from a website or web server. In response to the notification, the system accesses one or more sitemap(s) for documents associated with the website or web server. The system schedules crawls of the documents based on information identified from the sitemaps. The system crawls at least a subset of the documents scheduled for crawling.
-
Citations
21 Claims
-
1. A method of scheduling documents for crawling, performed on a computer system having one or more processors and memory storing one or more programs for execution by the one or more processors, the method comprising:
-
storing sitemap information for a plurality of websites, wherein the information includes a predicted update period for at least a plurality of documents identified by the sitemap information; analyzing the stored sitemap information to identify a respective website having sitemap information that is at least potentially out of date; updating the stored sitemap information for the identified respective website by downloading updated sitemap information for the identified respective website; and scheduling documents for crawling in accordance with the updated stored sitemap information for the identified respective website. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system for scheduling documents for crawling, comprising:
-
one or more processors; and memory storing one or more modules; the one or more modules including instructions for; storing sitemap information for a plurality of websites, wherein the information includes a predicted update period for at least a plurality of documents identified by the sitemap information; analyzing the stored sitemap information to identify a respective website having sitemap information that is at least potentially out of date; updating the stored sitemap information for the identified respective website by downloading updated sitemap information for the identified respective website; and scheduling documents for crawling in accordance with the updated stored sitemap information for the identified respective website. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A non-transitory computer readable storage medium storing one or more programs configured for execution by a computer, the one or more programs comprising instructions for:
-
storing sitemap information for a plurality of websites, wherein the information includes a predicted update period for at least a plurality of documents identified by the sitemap information; analyzing the stored sitemap information to identify a respective website having sitemap information that is at least potentially out of date; updating the stored sitemap information for the identified respective website by downloading updated sitemap information for the identified respective website; and scheduling documents for crawling in accordance with the updated stored sitemap information for the identified respective website. - View Dependent Claims (16, 17, 18, 19, 20, 21)
-
Specification