High refresh-rate retrieval of freshly published content using distributed crawling
First Claim
Patent Images
1. A method for data delivery, comprising:
- (A) distributing a plurality of crawlers directed by at least one coordinating Link Server through low bandwidth commands, said plurality of crawlers being deployed on a plurality of contributor computers throughout a network;
(B) sending at least two links to one of said plurality of crawlers instructed by the at least one coordinating Link Server to check the pages corresponding to the at least two links, wherein each link includes URL name, last time checked, and a last crawl date page digest;
(C) connecting the instructed crawler to the first link of the at least two links and commanding the instructed crawler to read a header of the to-be-checked page corresponding to the first link, and(1) commanding the instructed crawler that if the to-be-checked page header returns a last modified date, the crawler check the last modified date against the last time checked,(i) if the to-be-checked page is found to be unchanged, the instructed crawler bypasses and does not process the to-be-checked page and proceeds to the second link;
(ii) if the to-be-checked page is found to have changed since the last checked time, the instructed crawler sends the to-be-checked page to the at least one coordinating Link Server;
(2) commanding the instructed crawler that if no last modification date is found in the to-be-checked page header, the instructed crawler downloads the to-be-checked page, and then runs the downloaded page through a function at the instructed crawler to obtain a new page digest for matching against the last crawl page digest,(i) if the new page digest is matched with the last crawl page digest, the crawler proceeds to the second link,(ii) if no match is found, the instructed crawler transmits the new page digest to the at least one coordinating Link Server with a crawl time for updating;
(D) extracting content of the sent and downloaded to-be-checked page for inclusion in a stream of events, said content included in said stream of events being only new or changed pages; and
(E) delivering the extracted content from the at least one coordinating Link Server to a plurality of web-mining applications via said stream of events, said plurality of web-mining applications being event driven by said stream.
2 Assignments
0 Petitions
Accused Products
Abstract
A system for maximal gathering of fresh information added to a network such as the as the Internet and for processing the gathered fresh information. A link server (2) sends a batch of links to check (3) to a crawler (1B). Crawler (1B) them executes its crawling assignment by filtering the encountered content and extracting only that which is new or changed (4). Crawler (1B) then returns this content (4) to at least one data center and any interested web mining application (5). By using the crawlers (1A-E) to filter the data and only return or notify regarding, the fresh content, less bandwidth is needed to get the information to the web mining application (5).
80 Citations
19 Claims
-
1. A method for data delivery, comprising:
-
(A) distributing a plurality of crawlers directed by at least one coordinating Link Server through low bandwidth commands, said plurality of crawlers being deployed on a plurality of contributor computers throughout a network; (B) sending at least two links to one of said plurality of crawlers instructed by the at least one coordinating Link Server to check the pages corresponding to the at least two links, wherein each link includes URL name, last time checked, and a last crawl date page digest; (C) connecting the instructed crawler to the first link of the at least two links and commanding the instructed crawler to read a header of the to-be-checked page corresponding to the first link, and (1) commanding the instructed crawler that if the to-be-checked page header returns a last modified date, the crawler check the last modified date against the last time checked, (i) if the to-be-checked page is found to be unchanged, the instructed crawler bypasses and does not process the to-be-checked page and proceeds to the second link; (ii) if the to-be-checked page is found to have changed since the last checked time, the instructed crawler sends the to-be-checked page to the at least one coordinating Link Server; (2) commanding the instructed crawler that if no last modification date is found in the to-be-checked page header, the instructed crawler downloads the to-be-checked page, and then runs the downloaded page through a function at the instructed crawler to obtain a new page digest for matching against the last crawl page digest, (i) if the new page digest is matched with the last crawl page digest, the crawler proceeds to the second link, (ii) if no match is found, the instructed crawler transmits the new page digest to the at least one coordinating Link Server with a crawl time for updating; (D) extracting content of the sent and downloaded to-be-checked page for inclusion in a stream of events, said content included in said stream of events being only new or changed pages; and (E) delivering the extracted content from the at least one coordinating Link Server to a plurality of web-mining applications via said stream of events, said plurality of web-mining applications being event driven by said stream. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A computer storage medium including instructions for delivering an event stream of new or changed web pages on a network, said instructions being executed by one or more processors to perform the steps of:
-
(A) distributing a plurality of crawlers directed by at least one coordinating Link Server through low bandwidth commands, said plurality of crawlers being deployed on a plurality of contributor computers throughout a network; (B) sending at least two links to one of said plurality of crawlers instructed by the at least one coordinating Link Server to check the pages corresponding to the at least two links, wherein each link includes URL name, last time checked, and a last crawl date page digest; (C) connecting the instructed crawler to the first link of the at least two links and commanding the instructed crawler to read a header of the to-be-checked page corresponding to the first link, and (1) commanding the instructed crawler that if the to-be-checked page header returns a last modified date, the crawler check the last modified date against the last time checked, (i) if the to-be-checked page is found to be unchanged, the instructed crawler bypasses and does not process the to-be-checked page and proceeds to the second link; (ii) if the to-be-checked page is found to have changed since the last checked time, the instructed crawler sends the to-be-checked page to the at least one coordinating Link Server; (2) commanding the instructed crawler that if no last modification date is found in the to-be-checked page header, the instructed crawler downloads the to-be-checked page, and then runs the downloaded page through a function at the instructed crawler to obtain a new page digest for matching against the last crawl page digest, (i) if the new page digest is matched with the last crawl page digest, the crawler proceeds to the second link (ii) if no match is found, the instructed crawler transmits the new page digest to the at least one coordinating Link Server with a crawl time for updating; (D) extracting content of the sent and downloaded to-be-checked page for inclusion in said stream of events, said content included in said stream of events being only new or changed pages; and (E) delivering the extracted content from the at least one coordinating Link Server to a plurality of web-mining applications via said stream of events, said plurality of web-mining applications being event driven by said stream.
-
-
12. A metacomputer system for making available to a web mining application freshly published content on a computer network, comprising:
-
a plurality of participating computers on the computer network, each said participating computer constituting a node of the metacomputer system; and a distributed crawling system configured for; (A) distributing a plurality of crawlers directed by at least one coordinating Link Server through low bandwidth commands, said plurality of crawlers being deployed on a plurality of contributor computers throughout a network; (B) sending at least two links to one of said plurality of crawlers instructed by the at least one coordinating Link Server to check the pages corresponding to the at least two links, wherein each link includes URL name, last time checked, and a last crawl date page digest; (C) connecting the instructed crawler to the first link of the at least two links and commanding the instructed crawler to read a header of the to-be-checked page corresponding to the first link, and (1) commanding the instructed crawler that if the to-be-checked page header returns a last modified date, the crawler check the last modified date against the last time checked, (i) if the to-be-checked page is found to be unchanged, the instructed crawler bypasses and does not process the to-be-checked page and proceeds to the second link; (ii) if the to-be-checked page is found to have changed since the last checked time, the instructed crawler sends the to-be-checked page to the at least one coordinating Link Server; (2) commanding the instructed crawler that if no last modification date is found in the to-be-checked page header, the instructed crawler downloads the to-be-checked page, and then runs the downloaded page through a function at the instructed crawler to obtain a new page digest for matching against the last crawl page digest, (i) if the new page digest is matched with the last crawl page digest, the crawler proceeds to the second link (ii) if no match is found, the instructed crawler transmits the new page digest to the at least one coordinating Link Server with a crawl time for updating; (D) extracting content of the sent and downloaded to-be-checked page for inclusion in a stream of events, said content included in said stream of events being only new or changed pages; and (E) delivering the extracted content from the at least one coordinating Link Server to a plurality of web-mining applications via said stream of events, said plurality of web-mining applications being event driven by said stream. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19)
-
Specification