×

High refresh-rate retrieval of freshly published content using distributed crawling

  • US 7,299,219 B2
  • Filed: 05/08/2001
  • Issued: 11/20/2007
  • Est. Priority Date: 05/08/2001
  • Status: Expired due to Term
First Claim
Patent Images

1. A method for data delivery, comprising:

  • (A) distributing a plurality of crawlers directed by at least one coordinating Link Server through low bandwidth commands, said plurality of crawlers being deployed on a plurality of contributor computers throughout a network;

    (B) sending at least two links to one of said plurality of crawlers instructed by the at least one coordinating Link Server to check the pages corresponding to the at least two links, wherein each link includes URL name, last time checked, and a last crawl date page digest;

    (C) connecting the instructed crawler to the first link of the at least two links and commanding the instructed crawler to read a header of the to-be-checked page corresponding to the first link, and(1) commanding the instructed crawler that if the to-be-checked page header returns a last modified date, the crawler check the last modified date against the last time checked,(i) if the to-be-checked page is found to be unchanged, the instructed crawler bypasses and does not process the to-be-checked page and proceeds to the second link;

    (ii) if the to-be-checked page is found to have changed since the last checked time, the instructed crawler sends the to-be-checked page to the at least one coordinating Link Server;

    (2) commanding the instructed crawler that if no last modification date is found in the to-be-checked page header, the instructed crawler downloads the to-be-checked page, and then runs the downloaded page through a function at the instructed crawler to obtain a new page digest for matching against the last crawl page digest,(i) if the new page digest is matched with the last crawl page digest, the crawler proceeds to the second link,(ii) if no match is found, the instructed crawler transmits the new page digest to the at least one coordinating Link Server with a crawl time for updating;

    (D) extracting content of the sent and downloaded to-be-checked page for inclusion in a stream of events, said content included in said stream of events being only new or changed pages; and

    (E) delivering the extracted content from the at least one coordinating Link Server to a plurality of web-mining applications via said stream of events, said plurality of web-mining applications being event driven by said stream.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×