Web crawler scheduler that utilizes sitemaps from websites
First Claim
1. A method of scheduling documents for crawling, performed on a computer system having one or more processors and memory storing one or more programs for execution by the one or more processors, the method comprising:
- obtaining sitemap information for a plurality of websites;
analyzing the sitemap information to identify a website, in the plurality of websites, having sitemap information that is at least potentially out of date;
updating the sitemap information for the identified website by downloading updated sitemap information for the identified website; and
scheduling documents for crawling in accordance with the updated sitemap information for the identified website;
whereinthe sitemap information includes one or more sitemap indexes, wherein each sitemap index in the one or more sitemap indices includes a list of URLs corresponding to documents stored at a website in the plurality of websites, andeach sitemap index in the one or more sitemap indexes includes information identifying one or more of;
a last modification date of a URL in the list of URLs, a change frequency of a document specified by the URL, a title of the document, an authority of the document, and a priority of the document.
1 Assignment
0 Petitions
Accused Products
Abstract
Systems and methods for scheduling documents for crawling are disclosed. In some implementations, a method includes obtaining sitemap information for a plurality of websites; and analyzing the sitemap information to identify a website, in the plurality of websites. The website has sitemap information that is at least potentially out of date. The method also includes updating the sitemap information for the identified website by downloading updated sitemap information for the identified website; and scheduling documents for crawling in accordance with the updated sitemap information for the identified website.
58 Citations
33 Claims
-
1. A method of scheduling documents for crawling, performed on a computer system having one or more processors and memory storing one or more programs for execution by the one or more processors, the method comprising:
-
obtaining sitemap information for a plurality of websites; analyzing the sitemap information to identify a website, in the plurality of websites, having sitemap information that is at least potentially out of date; updating the sitemap information for the identified website by downloading updated sitemap information for the identified website; and scheduling documents for crawling in accordance with the updated sitemap information for the identified website;
whereinthe sitemap information includes one or more sitemap indexes, wherein each sitemap index in the one or more sitemap indices includes a list of URLs corresponding to documents stored at a website in the plurality of websites, and each sitemap index in the one or more sitemap indexes includes information identifying one or more of;
a last modification date of a URL in the list of URLs, a change frequency of a document specified by the URL, a title of the document, an authority of the document, and a priority of the document. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31)
-
-
32. A computer system comprising:
-
one or more processors; a memory storing one or more programs for execution by the one or more processors, wherein the one or more programs comprising instructions for; obtaining sitemap information for a plurality of websites; analyzing the sitemap information to identify a website, in the plurality of websites, having sitemap information that is at least potentially out of date; updating the sitemap information for the identified website by downloading updated sitemap information for the identified website; and scheduling documents for crawling in accordance with the updated sitemap information for the identified website; wherein the sitemap information includes one or more sitemap indexes, wherein each sitemap index in the one or more sitemap indices includes a list of URLs corresponding to documents stored at a website in the plurality of websites, and each sitemap index in the one or more sitemap indexes includes information identifying one or more of;
a last modification date of a URL in the list of URLs, a change frequency of a document specified by the URL, a title of the document, an authority of the document, and a priority of the document.
-
-
33. A non-transitory computer readable storage medium storing one or more programs to be executed by a computer system, the one or more programs comprising instructions for:
-
obtaining sitemap information for a plurality of websites; analyzing the sitemap information to identify a website, in the plurality of websites, having sitemap information that is at least potentially out of date; updating the sitemap information for the identified website by downloading updated sitemap information for the identified website; and scheduling documents for crawling in accordance with the updated sitemap information for the identified website; wherein the sitemap information includes one or more sitemap indexes, wherein each sitemap index in the one or more sitemap indices includes a list of URLs corresponding to documents stored at a website in the plurality of websites, and each sitemap index in the one or more sitemap indexes includes information identifying one or more of;
a last modification date of a URL in the list of URLs, a change frequency of a document specified by the URL, a title of the document, an authority of the document, and a priority of the document.
-
Specification