Sitemap generating client for web crawler
First Claim
1. A method of listing documents performed by a website server system having one or more processors and memory storing one or more programs for execution by the one or more processors, comprising:
- accessing one or more sources of document information, wherein the one or more sources of document information are associated with a website server;
extracting the document information including metadata from the sources;
generating a sitemap of a website at the website server, the sitemap including a list of documents and corresponding metadata for each of a plurality of documents in the list of documents based on the document information,wherein the metadata comprises at least one of the group consisting of;
document modification date information associated with the plurality of documents, document access frequency information associated with the plurality of documents, document priority information associated with the plurality of documents, and document update rate information associated with the plurality of documents,wherein the document priority information indicates a crawling priority; and
storing the sitemap at a location; and
transmitting a notification from the website server to a remote computer associated with a web crawler system, the notification including information that identifies the location of the sitemap, the notification functioning as an indication that the sitemap is available for access.
2 Assignments
0 Petitions
Accused Products
Abstract
Methods and systems for a sitemap generating client for web crawlers are described. The client accesses one or more sources of document information about the documents available on a website, such as the file system, access logs, or pre-made URL lists. Document information is extracted from the sources and one or more sitemaps are generated based on the extracted document information. A notification is transmitted to a remote computer, informing that the sitemap(s) are available for access and likely have been updated. If the remote computer is associated with a web crawler, the remote computer may access the sitemap(s) and use the sitemaps to schedule a crawl of documents included or available on the website.
-
Citations
22 Claims
-
1. A method of listing documents performed by a website server system having one or more processors and memory storing one or more programs for execution by the one or more processors, comprising:
-
accessing one or more sources of document information, wherein the one or more sources of document information are associated with a website server; extracting the document information including metadata from the sources; generating a sitemap of a website at the website server, the sitemap including a list of documents and corresponding metadata for each of a plurality of documents in the list of documents based on the document information, wherein the metadata comprises at least one of the group consisting of; document modification date information associated with the plurality of documents, document access frequency information associated with the plurality of documents, document priority information associated with the plurality of documents, and document update rate information associated with the plurality of documents, wherein the document priority information indicates a crawling priority; and storing the sitemap at a location; and transmitting a notification from the website server to a remote computer associated with a web crawler system, the notification including information that identifies the location of the sitemap, the notification functioning as an indication that the sitemap is available for access. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A system for listing documents, comprising:
-
one or more processors and memory, the memory comprising one or more sources of document information; and one or more modules including instructions to; access the sources of document information, wherein the sources are associated with a website server; extract the document information including metadata from the sources; generate a sitemap of a website at the website server, the sitemap including a list of documents and corresponding metadata for each of a plurality of documents in the list of documents based on the document information, wherein the metadata comprises at least one of the group consisting of; document modification date information associated with the plurality of documents, document access frequency information associated with the plurality of documents, document priority information associated with the plurality of documents, and document update rate information associated with the plurality of documents, wherein the document priority information indicates a crawling priority; and store the sitemap at a location; and transmit a notification from the website server to a remote computer associated with a web crawler system, the notification including information that identifies the location of the sitemap, the notification functioning as an indication that the sitemap is available for access. - View Dependent Claims (11, 12, 13, 14, 15)
-
-
16. A computer readable storage medium and one or more computer programs embedded therein, the computer programs comprising instructions, which when executed by a computer system, cause the computer system to:
-
access one or more sources of document information, wherein the sources are associated with a website server; extract the document information including metadata from the sources; generate a sitemap of a website at the website server, the sitemap including a list of documents and corresponding metadata for each of a plurality of documents in the list of documents based on the document information, wherein the metadata comprises at least one of the group consisting of; document modification date information associated with the plurality of documents, document access frequency information associated with the plurality of documents, document priority information associated with the plurality of documents, and document update rate information associated with the plurality of documents, wherein the document priority information indicates a crawling priority; and store the sitemap at a location; and transmit a notification from the website server to a remote computer associated with a web crawler system, the notification including information that identifies the location of the sitemap, the notification functioning as an indication that the sitemap is available for access. - View Dependent Claims (17, 18, 19, 20, 21)
-
-
22. A system for listing documents, comprising:
-
one or more processors and memory, the memory comprising one or more sources of document information; means for accessing the sources of document information, wherein the sources are associated with a website server; means for extracting the document information including metadata from the sources; means for generating a sitemap of a website at the website server, the sitemap including a list of documents and corresponding metadata for each of a plurality of documents in the list of documents based on the document information, wherein the metadata comprises at least one of the group consisting of; document modification date information associated with the plurality of documents, document access frequency information associated with the plurality of documents, document priority information associated with the plurality of documents, and document update rate information associated with the plurality of documents, wherein the document priority information indicates a crawling priority; and means for storing the sitemap at a location; and means for transmitting a notification from the website server to a remote computer associated with a web crawler system, the notification including information that identifies the location of the sitemap, the notification functioning as an indication that the sitemap is available for access.
-
Specification