×

Minimizing visibility of stale content in web searching including revising web crawl intervals of documents

  • US 7,987,172 B1
  • Filed: 08/30/2004
  • Issued: 07/26/2011
  • Est. Priority Date: 08/30/2004
  • Status: Active Grant
First Claim
Patent Images

1. A method for scheduling documents to be crawled by a search engine in an appropriate order to reduce visibility of stale content in web searching, comprising:

  • on a computer system having one or more processors and memory storing one or more programs for execution by the one or more processors;

    associating with each of a plurality of documents a respective initial web crawl interval, wherein the initial web crawl interval is determined based at least in part on a document'"'"'s importance metric;

    retrieving each of the plurality of documents from a respective host at one or more times associated with each document'"'"'s respective initial web crawl interval;

    associating a revised web crawl interval with a respective document of the plurality of documents based on the document'"'"'s respective initial web crawl interval and any changes to content of the document;

    partitioning the plurality of documents into multiple tiers according to their respective web crawl intervals, each tier having a distinct associated range of web crawl intervals, including storing data for each tier identifying documents in the plurality of documents assigned to that tier in accordance with the documents'"'"' respective web crawl intervals;

    for each tier of the multiple tiers, determining a web crawl order for the documents assigned to the tier in accordance with their respective revised web crawl interval; and

    moving a respective document between tiers of the multiple tiers when the respective revised web crawl interval of the respective document is associated with a different tier of the multiple tiers than a previous web crawl interval of the respective document.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×