High refresh-rate retrieval of freshly published content using distributed crawling

US 7,299,219 B2
Filed: 05/08/2001
Issued: 11/20/2007
Est. Priority Date: 05/08/2001
Status: Expired due to Term

First Claim

Patent Images

1. A method for data delivery, comprising:

(A) distributing a plurality of crawlers directed by at least one coordinating Link Server through low bandwidth commands, said plurality of crawlers being deployed on a plurality of contributor computers throughout a network;

(B) sending at least two links to one of said plurality of crawlers instructed by the at least one coordinating Link Server to check the pages corresponding to the at least two links, wherein each link includes URL name, last time checked, and a last crawl date page digest;

(C) connecting the instructed crawler to the first link of the at least two links and commanding the instructed crawler to read a header of the to-be-checked page corresponding to the first link, and(1) commanding the instructed crawler that if the to-be-checked page header returns a last modified date, the crawler check the last modified date against the last time checked,(i) if the to-be-checked page is found to be unchanged, the instructed crawler bypasses and does not process the to-be-checked page and proceeds to the second link;

(ii) if the to-be-checked page is found to have changed since the last checked time, the instructed crawler sends the to-be-checked page to the at least one coordinating Link Server;

(2) commanding the instructed crawler that if no last modification date is found in the to-be-checked page header, the instructed crawler downloads the to-be-checked page, and then runs the downloaded page through a function at the instructed crawler to obtain a new page digest for matching against the last crawl page digest,(i) if the new page digest is matched with the last crawl page digest, the crawler proceeds to the second link,(ii) if no match is found, the instructed crawler transmits the new page digest to the at least one coordinating Link Server with a crawl time for updating;

(D) extracting content of the sent and downloaded to-be-checked page for inclusion in a stream of events, said content included in said stream of events being only new or changed pages; and

(E) delivering the extracted content from the at least one coordinating Link Server to a plurality of web-mining applications via said stream of events, said plurality of web-mining applications being event driven by said stream.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system for maximal gathering of fresh information added to a network such as the as the Internet and for processing the gathered fresh information. A link server (2) sends a batch of links to check (3) to a crawler (1B). Crawler (1B) them executes its crawling assignment by filtering the encountered content and extracting only that which is new or changed (4). Crawler (1B) then returns this content (4) to at least one data center and any interested web mining application (5). By using the crawlers (1A-E) to filter the data and only return or notify regarding, the fresh content, less bandwidth is needed to get the information to the web mining application (5).

80 Citations

View as Search Results

19 Claims

1. A method for data delivery, comprising:
- (A) distributing a plurality of crawlers directed by at least one coordinating Link Server through low bandwidth commands, said plurality of crawlers being deployed on a plurality of contributor computers throughout a network;
  
  (B) sending at least two links to one of said plurality of crawlers instructed by the at least one coordinating Link Server to check the pages corresponding to the at least two links, wherein each link includes URL name, last time checked, and a last crawl date page digest;
  
  (C) connecting the instructed crawler to the first link of the at least two links and commanding the instructed crawler to read a header of the to-be-checked page corresponding to the first link, and(1) commanding the instructed crawler that if the to-be-checked page header returns a last modified date, the crawler check the last modified date against the last time checked,(i) if the to-be-checked page is found to be unchanged, the instructed crawler bypasses and does not process the to-be-checked page and proceeds to the second link;
  
  (ii) if the to-be-checked page is found to have changed since the last checked time, the instructed crawler sends the to-be-checked page to the at least one coordinating Link Server;
  
  (2) commanding the instructed crawler that if no last modification date is found in the to-be-checked page header, the instructed crawler downloads the to-be-checked page, and then runs the downloaded page through a function at the instructed crawler to obtain a new page digest for matching against the last crawl page digest,(i) if the new page digest is matched with the last crawl page digest, the crawler proceeds to the second link,(ii) if no match is found, the instructed crawler transmits the new page digest to the at least one coordinating Link Server with a crawl time for updating;
  
  (D) extracting content of the sent and downloaded to-be-checked page for inclusion in a stream of events, said content included in said stream of events being only new or changed pages; and
  
  (E) delivering the extracted content from the at least one coordinating Link Server to a plurality of web-mining applications via said stream of events, said plurality of web-mining applications being event driven by said stream.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, wherein after the downloaded to-be-checked page is determined to be new or changed, the crawler optionally extracts links on the downloaded to-be-checked page and reports the extracted links to the at least one coordinating Link Server.
  - 3. The method of claim 2, including identifying if extracted links are valid by commanding the instructed crawlers to attempt to connect to the extracted links from the downloaded to-be-checked page.
  - 4. The method of claim 2, including commanding the instructed crawler, once connected, to also filter the links and only extract and return HTML/TEXT links.
  - 5. The method of claim 2, including information processing by the crawlers on the downloaded pages.
  - 6. The method of claim 5, wherein the information processing is selected from the group consisting of:
    - stripping out HTML tags and using information retrieval and/or natural language processing techniques to characterize the downloaded to be checked page.
  - 7. The method of claim 1, including updating Link Server records on the at least two links and scheduling them for later crawling or re-crawling.
  - 8. The method of claim 7, including management by the at least one coordinating Link Server of link assignments for crawling.
  - 9. The method of claim 8, wherein the management by the at least one coordinating Link Server comprises assigning network-wise close links to a crawler and/or arranging for relatively more frequent crawling of links from domains with track records of frequent change.
  - 10. The method of claim 1, wherein the web mining application upon receiving the extracted content conducts at least one of the following:
    - (a) storage of the new or changed pages, (b) storage of only delta changes of a page, (c) data mining;
      
      (d) data processing;
      
      (e) application of data to at least one search engine, (f) intelligent caching.

11. A computer storage medium including instructions for delivering an event stream of new or changed web pages on a network, said instructions being executed by one or more processors to perform the steps of:
- (A) distributing a plurality of crawlers directed by at least one coordinating Link Server through low bandwidth commands, said plurality of crawlers being deployed on a plurality of contributor computers throughout a network;
  
  (B) sending at least two links to one of said plurality of crawlers instructed by the at least one coordinating Link Server to check the pages corresponding to the at least two links, wherein each link includes URL name, last time checked, and a last crawl date page digest;
  
  (C) connecting the instructed crawler to the first link of the at least two links and commanding the instructed crawler to read a header of the to-be-checked page corresponding to the first link, and(1) commanding the instructed crawler that if the to-be-checked page header returns a last modified date, the crawler check the last modified date against the last time checked,(i) if the to-be-checked page is found to be unchanged, the instructed crawler bypasses and does not process the to-be-checked page and proceeds to the second link;
  
  (ii) if the to-be-checked page is found to have changed since the last checked time, the instructed crawler sends the to-be-checked page to the at least one coordinating Link Server;
  
  (2) commanding the instructed crawler that if no last modification date is found in the to-be-checked page header, the instructed crawler downloads the to-be-checked page, and then runs the downloaded page through a function at the instructed crawler to obtain a new page digest for matching against the last crawl page digest,(i) if the new page digest is matched with the last crawl page digest, the crawler proceeds to the second link(ii) if no match is found, the instructed crawler transmits the new page digest to the at least one coordinating Link Server with a crawl time for updating;
  
  (D) extracting content of the sent and downloaded to-be-checked page for inclusion in said stream of events, said content included in said stream of events being only new or changed pages; and
  
  (E) delivering the extracted content from the at least one coordinating Link Server to a plurality of web-mining applications via said stream of events, said plurality of web-mining applications being event driven by said stream.

12. A metacomputer system for making available to a web mining application freshly published content on a computer network, comprising:
- a plurality of participating computers on the computer network, each said participating computer constituting a node of the metacomputer system; and
  
  a distributed crawling system configured for;
  
  (A) distributing a plurality of crawlers directed by at least one coordinating Link Server through low bandwidth commands, said plurality of crawlers being deployed on a plurality of contributor computers throughout a network;
  
  (B) sending at least two links to one of said plurality of crawlers instructed by the at least one coordinating Link Server to check the pages corresponding to the at least two links, wherein each link includes URL name, last time checked, and a last crawl date page digest;
  
  (C) connecting the instructed crawler to the first link of the at least two links and commanding the instructed crawler to read a header of the to-be-checked page corresponding to the first link, and(1) commanding the instructed crawler that if the to-be-checked page header returns a last modified date, the crawler check the last modified date against the last time checked,(i) if the to-be-checked page is found to be unchanged, the instructed crawler bypasses and does not process the to-be-checked page and proceeds to the second link;
  
  (ii) if the to-be-checked page is found to have changed since the last checked time, the instructed crawler sends the to-be-checked page to the at least one coordinating Link Server;
  
  (2) commanding the instructed crawler that if no last modification date is found in the to-be-checked page header, the instructed crawler downloads the to-be-checked page, and then runs the downloaded page through a function at the instructed crawler to obtain a new page digest for matching against the last crawl page digest,(i) if the new page digest is matched with the last crawl page digest, the crawler proceeds to the second link(ii) if no match is found, the instructed crawler transmits the new page digest to the at least one coordinating Link Server with a crawl time for updating;
  
  (D) extracting content of the sent and downloaded to-be-checked page for inclusion in a stream of events, said content included in said stream of events being only new or changed pages; and
  
  (E) delivering the extracted content from the at least one coordinating Link Server to a plurality of web-mining applications via said stream of events, said plurality of web-mining applications being event driven by said stream.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19)
- - 13. The metacomputer system of claim 12, wherein the plurality of participating computers on the computer network run a software application which constitutes a contributor environment (CE), the computer network further comprising an application server which deploys the plurality of web crawlers and coordinates the nodes of the metacomputer system by allocating jobs to the CE.
  - 14. The metacomputer system of claim 12, wherein the computer network is the Internet or an intranet.
  - 15. The metacomputer system of claim 14, wherein the plurality of web mining applications running on the network, and the extracted content encountered on the network being transmitted by the CE as a stream of events available to the plurality of web mining applications.
  - 16. The metacomputer system of claim 12, wherein the at least one coordinating link server receives content from the web crawlers.
  - 17. The metacomputer system of claim 16, wherein the instructed crawler is commanded by an allocation server (AS) to return only fresh encountered content to the at least one coordinating link server.
  - 18. The metacomputer system of claim 12, wherein data of the extracted content is compressed before being transmitted.
  - 19. The metacomputer system of claim 12, wherein the extracted content is rated before transmitting.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Johns Hopkins University
Original Assignee
Johns Hopkins University
Inventors
Amir, Yair, Goodrich, Michael Truman, Green, Jacob William, Schultz, John Lane
Primary Examiner(s)
Pham; Hung Q

Application Number

US10/257,255
Publication Number

US 20040044962A1
Time in Patent Office

2,387 Days
Field of Search

707/2, 707/3, 707/7, 707/10, 707/1, 707/200
US Class Current

1/1
CPC Class Codes

G06F 16/951   Indexing; Web crawling tech...

G06F 2216/03   Data mining

Y10S 707/99931   Database or file accessing

Y10S 707/99932   Access augmentation or opti...

High refresh-rate retrieval of freshly published content using distributed crawling

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

80 Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

High refresh-rate retrieval of freshly published content using distributed crawling

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

80 Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links