Relevant search rankings using high refresh-rate distributed crawling
First Claim
1. A system for processing fresh information added to a network, comprising:
- for a network, identifying fresh information added to the network; and
presenting the fresh information as a stream of events.
2 Assignments
0 Petitions
Accused Products
Abstract
A system for maximal gathering of fresh information added to a network such as the as the Internet and for processing the gathered fresh information. A link server (2) sends a batch of links to check (3) to a crawler (1B). Crawler (1B) them executes its crawling assignment by filtering the encountered content and extracting only that which is new or changed (4). Crawler (1B) then returns this content (4) to at least one data center and any interested web mining application (5). By using the crawlers (1A-E) to filter the data and only return or notify regarding, the fresh content, less bandwidth is needed to get the information to the web mining application (5).
-
Citations
45 Claims
-
1. A system for processing fresh information added to a network, comprising:
-
for a network, identifying fresh information added to the network; and
presenting the fresh information as a stream of events. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A method of gathering information freshly available on a network, comprising:
deploying a metacomputer to gather information freshly available on the network, wherein the metacomputer comprises information-gathering crawlers instructed to filter old or unchanged information. - View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14, 15)
-
16. A method of processing new information on a network, comprising:
-
(A) for information encountered on the network that is new relative to a data base of existing content, identifying at least one existing document within a predetermined distance from the newly encountered information; and
,(B) identifying an already-established weight of the at least one existing nearby document identified according to step (A). - View Dependent Claims (17, 18, 19)
-
-
20. A high scan rate, decreased bandwidth method for data delivery, comprising:
-
(A) providing at least one coordinating Link Server to direct a plurality of crawlers through low bandwidth commands;
(B) providing that when a crawler is instructed by the Link Server to check a page link, for the to-be-checked page link the crawler also is told information including URL name, last time checked, and a last crawl date page digest from when the link was last checked;
(C) connecting a crawler to the to-be-checked page and commanding the crawler to read a header of the to-be-checked page, and (1) commanding the crawler that if the to-be-checked page header returns a last modified date, the crawler check the page against the last crawl date associated with the to-be-checked page;
further provided that;
(i) for a to-be-checked page found to be unchanged, the crawler bypasses and does not download/process the to-be-checked page;
but (ii) if the to-be-checked page is found to have changed since the last checked time, the crawler notifies the Data Center that the to-be-checked page has been changed, downloads, processes, compresses and sends the to-be-checked page content to the Data Center;
(2) commanding the crawler that if no last modification date is found in the to-be-checked page header, the crawler downloads the page, and then runs the downloaded page through a function at the crawler to obtain a new page digest for matching against a last crawl page digest, if any, provided that;
(i) if and only if the new page digest can be matched to a last crawl page digest, the crawler proceeds to the next link to be checked;
but (ii) if for the new page digest no matching last crawl page digest is found, the crawler then notifies the Data Center and/or transmits the new page digest to the Data Center,further provided that the crawler returns the links originally received from the Link Server with updated digests and crawl times. - View Dependent Claims (21, 22, 23, 24, 25, 26, 27, 28, 29)
-
- 30. A ranking method for new or changed content on a network, comprising partially ranking the new or changed content based on at least one neighboring page.
-
34. Computer-readable information produced
(A) from a stream of events comprising fresh information identified for a network; - or
(B) by deploying a metacomputer to gather information freshly available on the network, wherein the metacomputer comprises information-gathering crawlers instructed to filter old or unchanged information.
- or
-
35. An index prepared from a computer data base of computer-readable information produced
(A) from a stream of events comprising fresh information identified for a network; - or
(B) by deploying a metacomputer to gather information freshly available on the network, wherein the metacomputer comprises information-gathering crawlers instructed to filter old or unchanged information.
- or
-
36. An electronic library wherein the library consists essentially of an index prepared from a computer data base of computer-readable information produced
(A) from a stream of events comprising fresh information identified for a network; - or
(B) by deploying a metacomputer to gather information freshly available on the network, wherein the metacomputer comprises information-gathering crawlers instructed to filter old or unchanged information.
- or
-
37. A computerized search engine wherein the search engine queries an index prepared from computer-readable information produced
(A) from a stream of events comprising fresh information identified for a network, or (B) by deploying a metacomputer to gather information freshly available on the network, wherein the metacomputer comprises information-gathering crawlers instructed to filter old or unchanged information.
-
38. A distributed system of crawlers returning content from a network to a link server, wherein each crawler:
- (1) minimizes time spent on old and unchanged content;
(2) filters and excludes from returning old or unchanged content to the link server; and
(3) gathers and returns fresh content to the link server.
- (1) minimizes time spent on old and unchanged content;
- 39. A monitoring method for at least one web mining application, comprising screening web documents for changed content, wherein the screening occurs in a system external to the web mining application.
Specification