Web crawler scheduler that utilizes sitemaps from websites

US 7,769,742 B1
Filed: 06/30/2005
Issued: 08/03/2010
Est. Priority Date: 05/31/2005
Status: Expired due to Fees

First Claim

Patent Images

1. A method of scheduling documents for crawling, performed on a computer system having one or more processors and memory storing one or more programs for execution by the one or more processors, the method comprising:

receiving from a website a notification that includes a sitemap URL corresponding to a sitemap for the website;

in response to the notification;

accessing the sitemap at the sitemap URL; and

retrieving from the sitemap document location information and metadata for a plurality of documents associated with the website, wherein the metadata comprises, for at least a plurality of respective documents associated with the website, document update rate information indicating update frequencies associated with the respective documents;

scheduling for downloading documents, from among the plurality of documents, based at least in part on the metadata retrieved from the sitemap, wherein the scheduling includes scheduling a respective document for downloading when one of a date or date and time at which the respective document was last downloaded differs from a current date or a current date and time by an amount that is greater than a duration corresponding to the document update rate information for the respective document; and

downloading at least a subset of the documents scheduled for downloading.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods and systems for a web crawler scheduler that utilizes sitemaps from websites are described. A web crawler scheduling system receives a notification from a website or web server. In response to the notification, the system accesses one or more sitemap(s) for documents associated with the website or web server. The system schedules crawls of the documents based on information identified from the sitemaps. The system crawls at least a subset of the documents scheduled for crawling.

84 Citations

View as Search Results

31 Claims

1. A method of scheduling documents for crawling, performed on a computer system having one or more processors and memory storing one or more programs for execution by the one or more processors, the method comprising:
- receiving from a website a notification that includes a sitemap URL corresponding to a sitemap for the website;
  
  in response to the notification;
  
  accessing the sitemap at the sitemap URL; and
  
  retrieving from the sitemap document location information and metadata for a plurality of documents associated with the website, wherein the metadata comprises, for at least a plurality of respective documents associated with the website, document update rate information indicating update frequencies associated with the respective documents;
  
  scheduling for downloading documents, from among the plurality of documents, based at least in part on the metadata retrieved from the sitemap, wherein the scheduling includes scheduling a respective document for downloading when one of a date or date and time at which the respective document was last downloaded differs from a current date or a current date and time by an amount that is greater than a duration corresponding to the document update rate information for the respective document; and
  
  downloading at least a subset of the documents scheduled for downloading.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of claim 1, wherein the metadata retrieved from the sitemap comprises, for at least a plurality of respective documents associated with the website, document modification date information indicating when the respective documents were last modified;
    - andwherein the scheduling is performed in accordance with the document modification date information for the respective documents.
  - 3. The method of claim 2, wherein the scheduling comprises deferring scheduling a respective document for downloading when the document modification date information for the respective document corresponds to a download date or a download date and time that is no later than a respective download date or a download date and time at which the respective document was last downloaded.
  - 4. The method of claim 1, the metadata providing information related to at least one of prioritizing documents for crawling by a web crawler and selecting documents for inclusion in a crawl;
    - the scheduling including at least one of prioritizing documents for crawling and selecting documents for crawling in accordance with the metadata.
  - 5. The method of claim 1, wherein scheduling documents for downloading comprises generating a list of document identifiers that identify the scheduled documents.

6. A method of scheduling documents for crawling, performed on a computer system having one or more processors and memory storing one or more programs for execution by the one or more processors, the method comprising:
- receiving from a website a notification that includes a sitemap URL corresponding to a sitemap for the website;
  
  in response to the notification;
  
  accessing the sitemap at the sitemap URL; and
  
  retrieving from the sitemap document location information and metadata for a plurality of documents associated with the website, wherein the metadata retrieved from the sitemap includes, for at least a plurality of respective documents associated with the website, document importance information indicating relative importance values associated with the respective documents;
  
  scheduling for downloading documents, from among the plurality of documents, based at least in part on the metadata retrieved from the sitemap, wherein the scheduling is performed in accordance with a score assigned to each document deemed eligible for downloading, and wherein the score assigned to a respective document is adjusted by a boost factor corresponding to the relative importance value indicated by the document importance information for the respective document; and
  
  downloading at least a subset of the documents scheduled for downloading.
- View Dependent Claims (7, 8, 9, 10)
- - 7. The method of claim 6, wherein the metadata retrieved from the sitemap comprises, for at least a plurality of respective documents associated with the website, document modification date information indicating when the respective documents were last modified;
    - andwherein the scheduling is performed in accordance with the document modification date information for the respective documents.
  - 8. The method of claim 7, wherein the scheduling comprises deferring scheduling a respective document for downloading when the document modification date information for the respective document corresponds to a download date or a download date and time that is no later than a respective download date or a download date and time at which the respective document was last downloaded.
  - 9. The method of claim 6, the metadata providing information related to at least one of prioritizing documents for crawling by a web crawler and selecting documents for inclusion in a crawl;
    - the scheduling including at least one of prioritizing documents for crawling and selecting documents for crawling in accordance with the metadata.
  - 10. The method of claim 6, wherein scheduling documents for downloading comprises generating a list of document identifiers that identify the scheduled documents.

11. A system for scheduling documents for crawling, comprising:
- one or more processors; and
  
  memory storing one or more modules;
  
  the one or more modules including instructions to;
  
  receive from a website a notification that includes a sitemap URL corresponding to a sitemap for the website;
  
  in response to the notification;
  
  access the sitemap at the sitemap URL; and
  
  retrieve from the sitemap document location information and metadata for a plurality of documents associated with the website, wherein the metadata comprises, for at least a plurality of respective documents associated with the website, document update rate information indicating update frequencies associated with the respective documents;
  
  schedule for downloading documents, from among the plurality of documents, based at least in part on the metadata retrieved from the sitemap, wherein the scheduling includes scheduling a respective document for downloading when one of a date or date and time at which the respective document was last downloaded differs from a current date or a current date and time by an amount that is greater than a duration corresponding to the document update rate information for the respective document; and
  
  download at least a subset of the documents scheduled for downloading.
- View Dependent Claims (12, 13, 14, 15)
- - 12. The system of claim 11, wherein the metadata retrieved from the sitemap comprises, for at least a plurality of respective documents associated with the website, document modification date information indicating when the respective documents were last modified;
    - andwherein the scheduling is performed in accordance with the document modification date information for the respective documents.
  - 13. The system of claim 12, wherein the scheduling comprises deferring scheduling a respective document for downloading when the document modification date information for the respective document corresponds to a download date or a download date and time that is no later than a respective download date or a download date and time at which the respective document was last downloaded.
  - 14. The system of claim 11, the metadata providing information related to at least one of prioritizing documents for crawling by a web crawler and selecting documents for inclusion in a crawl;
    - the scheduling including at least one of prioritizing documents for crawling and selecting documents for crawling in accordance with the metadata.
  - 15. The system of claim 11, wherein the instructions for scheduling documents for downloading comprise instructions for generating a list of document identifiers that identify the scheduled documents.

16. A system for scheduling documents for crawling, comprising:
- one or more processors; and
  
  memory storing one or more modules;
  
  the one or more modules including instructions to;
  
  receive from a website a notification that includes a sitemap URL corresponding to a sitemap for the website;
  
  in response to the notification;
  
  access the sitemap at the sitemap URL; and
  
  retrieve from the sitemap document location information and metadata for a plurality of documents associated with the website, wherein the metadata retrieved from the sitemap includes, for at least a plurality of respective documents associated with the website, document importance information indicating relative importance values associated with the respective documents;
  
  schedule for downloading documents, from among the plurality of documents, based at least in part on the metadata retrieved from the sitemap, wherein the scheduling is performed in accordance with a score assigned to each document deemed eligible for downloading, and wherein the score assigned to a respective document is adjusted by a boost factor corresponding to the relative importance value indicated by the document importance information for the respective document; and
  
  download at least a subset of the documents scheduled for downloading.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The system of claim 16, wherein the metadata retrieved from the sitemap comprises, for at least a plurality of respective documents associated with the website, document modification date information indicating when the respective documents were last modified;
    - andwherein the scheduling is performed in accordance with the document modification date information for the respective documents.
  - 18. The system of claim 17, wherein the scheduling comprises deferring scheduling a respective document for downloading when the document modification date information for the respective document corresponds to a download date or a download date and time that is no later than a respective download date or a download date and time at which the respective document was last downloaded.
  - 19. The system of claim 16, the metadata providing information related to at least one of prioritizing documents for crawling by a web crawler and selecting documents for inclusion in a crawl;
    - the scheduling including at least one of prioritizing documents for crawling and selecting documents for crawling in accordance with the metadata.
  - 20. The system of claim 16, wherein scheduling documents for downloading comprises generating a list of document identifiers that identify the scheduled documents.

21. A computer program product for use in conjunction with a computer system, the computer program product comprising a computer readable storage medium and a computer program mechanism embedded therein, the computer program mechanism comprising instructions for:
- receiving from a website a notification that includes a sitemap URL corresponding to a sitemap for the website;
  
  in response to the notification;
  
  accessing the sitemap at the sitemap URL; and
  
  retrieving from the sitemap document location information and metadata for a plurality of documents associated with the website, wherein the metadata comprises, for at least a plurality of respective documents associated with the website, document update rate information indicating update frequencies associated with the respective documents;
  
  scheduling for downloading documents, from among the plurality of documents, based at least in part on the metadata retrieved from the sitemap, wherein the scheduling includes scheduling a respective document for downloading when one of a date or date and time at which the respective document was last downloaded differs from a current date or a current date and time by an amount that is greater than a duration corresponding to the document update rate information for the respective document; and
  
  downloading at least a subset of the documents scheduled for downloading.
- View Dependent Claims (22, 23, 24, 25)
- - 22. The computer program product of claim 21, wherein the metadata retrieved from the sitemap comprises, for at least a plurality of respective documents associated with the website, document modification date information indicating when the respective documents were last modified;
    - andwherein the scheduling is performed in accordance with the document modification date information for the respective documents.
  - 23. The computer program product of claim 22, wherein the scheduling comprises deferring scheduling a respective document for downloading when the document modification date information for the respective document corresponds to a download date or a download date and time that is no later than a respective download date or a download date and time at which the respective document was last downloaded.
  - 24. The computer program product of claim 21, the metadata providing information related to at least one of prioritizing documents for crawling by a web crawler and selecting documents for inclusion in a crawl;
    - the scheduling including at least one of prioritizing documents for crawling and selecting documents for crawling in accordance with the metadata.
  - 25. The computer program product of claim 21, wherein the instructions for scheduling documents for downloading comprise instructions for generating a list of document identifiers that identify the scheduled documents.

26. A computer program product for use in conjunction with a computer system, the computer program product comprising a computer readable storage medium and a computer program mechanism embedded therein, the computer program mechanism comprising instructions for:
- receiving from a website a notification that includes a sitemap URL corresponding to a sitemap for the website;
  
  in response to the notification;
  
  accessing the sitemap at the sitemap URL; and
  
  retrieving from the sitemap document location information and metadata for a plurality of documents associated with the website, wherein the metadata retrieved from the sitemap includes, for at least a plurality of respective documents associated with the website, document importance information indicating relative importance values associated with the respective documents;
  
  scheduling for downloading documents, from among the plurality of documents, based at least in part on the metadata retrieved from the sitemap, wherein the scheduling is performed in accordance with a score assigned to each document deemed eligible for downloading, and wherein the score assigned to a respective document is adjusted by a boost factor corresponding to the relative importance value indicated by the document importance information for the respective document; and
  
  downloading at least a subset of the documents scheduled for downloading.
- View Dependent Claims (27, 28, 29, 30)
- - 27. The computer program product of claim 26, wherein the metadata retrieved from the sitemap comprises, for at least a plurality of respective documents associated with the website, document modification date information indicating when the respective documents were last modified;
    - andwherein the scheduling is performed in accordance with the document modification date information for the respective documents.
  - 28. The computer program product of claim 27, wherein the scheduling comprises deferring scheduling a respective document for downloading when the document modification date information for the respective document corresponds to a download date or a download date and time that is no later than a respective download date or a download date and time at which the respective document was last downloaded.
  - 29. The computer program product of claim 26, the metadata providing information related to at least one of prioritizing documents for crawling by a web crawler and selecting documents for inclusion in a crawl;
    - the scheduling including at least one of prioritizing documents for crawling and selecting documents for crawling in accordance with the metadata.
  - 30. The computer program product of claim 26, wherein scheduling documents for downloading comprises generating a list of document identifiers that identify the scheduled documents.

31. A system for scheduling documents for crawling, comprisingone or more processors;
- andmemory storing one or more programs to be executed by the one or more processors;
  
  the system including;
  
  means for receiving from a website a notification that includes a sitemap URL corresponding to a sitemap for the website;
  
  means for, in response to the notification;
  
  accessing the sitemap at the sitemap URL; and
  
  retrieving from the sitemap document location information and metadata for a plurality of documents associated with the website, wherein the metadata comprises, for at least a plurality of respective documents associated with the website, document update rate information indicating update frequencies associated with the respective documents;
  
  means for scheduling for downloading documents, from among the plurality of documents, based at least in part on the metadata retrieved from the sitemap, wherein the scheduling includes scheduling a respective document for downloading when one of a date or date and time at which the respective document was last downloaded differs from a current date or a current date and time by an amount that is greater than a duration corresponding to the document update rate information for the respective document; and
  
  means for downloading at least a subset of the documents scheduled for downloading.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Shivakumar, Narayanan, Ibel, Maximilian, Brawer, Sascha B., Keller, Ralph Michael
Primary Examiner(s)
Cottingham; John R.
Assistant Examiner(s)
Liu; Hexing

Application Number

US11/172,764
Time in Patent Office

1,860 Days
Field of Search

None
US Class Current

707/709
CPC Class Codes

G06F 16/951 Indexing; Web crawling tech...

Web crawler scheduler that utilizes sitemaps from websites

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

84 Citations

31 Claims

Specification

Solutions

Use Cases

Quick Links

Web crawler scheduler that utilizes sitemaps from websites

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

84 Citations

31 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links