Sitemap generating client for web crawler

US 7,801,881 B1
Filed: 06/30/2005
Issued: 09/21/2010
Est. Priority Date: 05/31/2005
Status: Expired due to Fees

First Claim

Patent Images

1. A method of listing documents performed by a website server system having one or more processors and memory storing one or more programs for execution by the one or more processors, comprising:

accessing one or more sources of document information, wherein the one or more sources of document information are associated with a website server;

extracting the document information including metadata from the sources;

generating a sitemap of a website at the website server, the sitemap including a list of documents and corresponding metadata for each of a plurality of documents in the list of documents based on the document information,wherein the metadata comprises at least one of the group consisting of;

document modification date information associated with the plurality of documents, document access frequency information associated with the plurality of documents, document priority information associated with the plurality of documents, and document update rate information associated with the plurality of documents,wherein the document priority information indicates a crawling priority; and

storing the sitemap at a location; and

transmitting a notification from the website server to a remote computer associated with a web crawler system, the notification including information that identifies the location of the sitemap, the notification functioning as an indication that the sitemap is available for access.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods and systems for a sitemap generating client for web crawlers are described. The client accesses one or more sources of document information about the documents available on a website, such as the file system, access logs, or pre-made URL lists. Document information is extracted from the sources and one or more sitemaps are generated based on the extracted document information. A notification is transmitted to a remote computer, informing that the sitemap(s) are available for access and likely have been updated. If the remote computer is associated with a web crawler, the remote computer may access the sitemap(s) and use the sitemaps to schedule a crawl of documents included or available on the website.

Citations

22 Claims

1. A method of listing documents performed by a website server system having one or more processors and memory storing one or more programs for execution by the one or more processors, comprising:
- accessing one or more sources of document information, wherein the one or more sources of document information are associated with a website server;
  
  extracting the document information including metadata from the sources;
  
  generating a sitemap of a website at the website server, the sitemap including a list of documents and corresponding metadata for each of a plurality of documents in the list of documents based on the document information,wherein the metadata comprises at least one of the group consisting of;
  
  document modification date information associated with the plurality of documents, document access frequency information associated with the plurality of documents, document priority information associated with the plurality of documents, and document update rate information associated with the plurality of documents,wherein the document priority information indicates a crawling priority; and
  
  storing the sitemap at a location; and
  
  transmitting a notification from the website server to a remote computer associated with a web crawler system, the notification including information that identifies the location of the sitemap, the notification functioning as an indication that the sitemap is available for access.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, wherein the sitemap metadata provides information for at least one of:
    - prioritizing documents for crawling by a web crawler, and selecting documents for inclusion in a crawl by the web crawler.
  - 3. The method of claim 1, wherein the sources of document information comprise at least one of the group consisting of:
    - a file system, one or more access logs, and one or more document location lists.
  - 4. The method of claim 1, wherein the document information comprises document location information and the plurality of documents are accessible to other computers via a network.
  - 5. The method of claim 1, wherein the document metadata information comprises at least document update rate information associated with the plurality of documents.
  - 6. The method of claim 1, wherein the document metadata information comprises at least relative priority information associated with the plurality of documents.
  - 7. The method of claim 1, wherein generating the sitemap comprises generating a list of documents modified after a particular time.
  - 8. The method of claim 1, further comprising generating a plurality of sitemaps, and generating an index referencing the plurality of sitemaps;
    - wherein the notification identifies the index.
  - 9. The method of claim 1, wherein the sitemap comprises a current sitemap, the method further comprising:
    - determining a difference between the current sitemap and a prior sitemap; and
      
      generating a differential sitemap based on the difference.

10. A system for listing documents, comprising:
- one or more processors and memory, the memory comprising one or more sources of document information; and
  
  one or more modules including instructions to;
  
  access the sources of document information, wherein the sources are associated with a website server;
  
  extract the document information including metadata from the sources;
  
  generate a sitemap of a website at the website server, the sitemap including a list of documents and corresponding metadata for each of a plurality of documents in the list of documents based on the document information,wherein the metadata comprises at least one of the group consisting of;
  
  document modification date information associated with the plurality of documents, document access frequency information associated with the plurality of documents, document priority information associated with the plurality of documents, and document update rate information associated with the plurality of documents,wherein the document priority information indicates a crawling priority; and
  
  store the sitemap at a location; and
  
  transmit a notification from the website server to a remote computer associated with a web crawler system, the notification including information that identifies the location of the sitemap, the notification functioning as an indication that the sitemap is available for access.
- View Dependent Claims (11, 12, 13, 14, 15)
- - 11. The system of claim 10, wherein the document information comprises document location information, and the plurality of documents are accessible to other computers via a network.
  - 12. The system of claim 10, wherein the document metadata information comprises at least relative priority information associated with the plurality of documents, wherein the relative priority information indicates a crawling priority.
  - 13. The system of claim 10, wherein the instructions to generate the sitemap include instructions to generate a list of documents modified after a particular time.
  - 14. The system of claim 10, wherein the one or more modules further include instructions to generate a plurality of sitemaps, and to generate an index referencing the plurality of sitemaps;
    - wherein the notification identifies the index.
  - 15. The system of claim 10, wherein the sitemap comprises a current sitemap, the one or more modules further including instructions to:
    - determine a difference between the current sitemap and a prior sitemap; and
      
      generate a differential sitemap based on the difference.

16. A computer readable storage medium and one or more computer programs embedded therein, the computer programs comprising instructions, which when executed by a computer system, cause the computer system to:
- access one or more sources of document information, wherein the sources are associated with a website server;
  
  extract the document information including metadata from the sources;
  
  generate a sitemap of a website at the website server, the sitemap including a list of documents and corresponding metadata for each of a plurality of documents in the list of documents based on the document information,wherein the metadata comprises at least one of the group consisting of;
  
  document modification date information associated with the plurality of documents, document access frequency information associated with the plurality of documents, document priority information associated with the plurality of documents, and document update rate information associated with the plurality of documents,wherein the document priority information indicates a crawling priority; and
  
  store the sitemap at a location; and
  
  transmit a notification from the website server to a remote computer associated with a web crawler system, the notification including information that identifies the location of the sitemap, the notification functioning as an indication that the sitemap is available for access.
- View Dependent Claims (17, 18, 19, 20, 21)
- - 17. The computer readable storage medium of claim 16, wherein the document information comprises document location information and the plurality of documents are accessible to other computers via a network.
  - 18. The computer readable storage medium of claim 16, wherein the document metadata information comprises at least relative priority information associated with the plurality of documents, wherein the relative priority information indicates a crawling priority.
  - 19. The computer readable storage medium of claim 16, wherein the instructions, which when executed by a computer system, cause the computer system to generate the sitemap comprise instructions for generating a list of documents modified after a particular time.
  - 20. The computer readable storage medium of claim 16, further comprising instructions, which when executed by a computer system, cause the computer system to generate a plurality of sitemaps, and generating an index referencing the plurality of sitemaps;
    - wherein the notification identifies the index.
  - 21. The computer readable storage medium of claim 16, wherein the sitemap comprises a current sitemap, the computer programs further comprising instructions, which when executed by a computer system, cause the computer system to:
    - determine a difference between the current sitemap and a prior sitemap; and
      
      generate a differential sitemap based on the difference.

22. A system for listing documents, comprising:
- one or more processors and memory, the memory comprising one or more sources of document information;
  
  means for accessing the sources of document information, wherein the sources are associated with a website server;
  
  means for extracting the document information including metadata from the sources;
  
  means for generating a sitemap of a website at the website server, the sitemap including a list of documents and corresponding metadata for each of a plurality of documents in the list of documents based on the document information,wherein the metadata comprises at least one of the group consisting of;
  
  document modification date information associated with the plurality of documents, document access frequency information associated with the plurality of documents, document priority information associated with the plurality of documents, and document update rate information associated with the plurality of documents,wherein the document priority information indicates a crawling priority; and
  
  means for storing the sitemap at a location; and
  
  means for transmitting a notification from the website server to a remote computer associated with a web crawler system, the notification including information that identifies the location of the sitemap, the notification functioning as an indication that the sitemap is available for access.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Shivakumar, Narayanan, Ibel, Maximilian, Brawer, Sascha B., Keller, Ralph Michael
Primary Examiner(s)
Cottingham; John R.
Assistant Examiner(s)
Liu; Hexing

Application Number

US11/172,692
Time in Patent Office

1,909 Days
Field of Search

None
US Class Current

707/709
CPC Class Codes

G06F 16/951 Indexing; Web crawling tech...

Sitemap generating client for web crawler

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Sitemap generating client for web crawler

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links