MULTI-TIERED CASCADING CRAWLING SYSTEM
First Claim
1. A method of searching a network for information related to a topic of interest, wherein the network comprises a plurality of documents containing information, and wherein one or more of the documents are grouped together into a collection of documents so that the network comprises a plurality of collections of documents, the method comprising:
- exploring the contents of one or more individual documents of a collection of documents;
making a determination of the relevancy of the one or more individual documents of the collection to the topic of interest; and
making a determination of the relevancy of the collection based at least partially on the relevancy of the one or more individual documents in the collection.
1 Assignment
0 Petitions
Accused Products
Abstract
Provided is a multi-tiered cascading crawling system for finding on a network information related to one or more predetermined topics or subtopics of interest. In general, embodiments of the present invention provide a system that operates in multiple “tiers,” where at least some of the output of one tier is used to comprise the input of the next tier. Each tier generally analyzes collections of documents on the network using successively more restrictive criteria about the subject matter of each collection and/or about which collections may be related to the one or more topics or subtopics. In general, only the final tier performs an exhaustive crawl of all of the documents of the collections that are identified by the system as being relevant to the topic or subtopic of interest.
-
Citations
80 Claims
-
1. A method of searching a network for information related to a topic of interest, wherein the network comprises a plurality of documents containing information, and wherein one or more of the documents are grouped together into a collection of documents so that the network comprises a plurality of collections of documents, the method comprising:
-
exploring the contents of one or more individual documents of a collection of documents; making a determination of the relevancy of the one or more individual documents of the collection to the topic of interest; and making a determination of the relevancy of the collection based at least partially on the relevancy of the one or more individual documents in the collection. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. A system for gathering information related to at least one topic of interest, the system comprising:
a multi-tiered system configured for searching a network for information related to at least one topic of interest, wherein each tier comprises more restrictive criteria than the previous tier for locating the at least one topic of interest. - View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30)
-
31. A method for requesting web pages from a plurality of web hosts, each web host supporting a finite number of web pages, the method comprising:
-
grouping the plurality of web hosts into one or more arrays of web hosts, each array comprising a finite number of web hosts; and submitting a first web page request to each web host in an array before submitting a second web page request to any web host in the array. - View Dependent Claims (32, 33, 34, 35, 36)
-
-
37. A method of ranking hyperlinks found on the Internet during a web crawling scheme, the method comprising:
-
analyzing text in the immediate vicinity of a hyperlink as the hyperlink is found; computing a weight for that hyperlink based on the relevancy of the text to a set of interest; and storing the hyperlink in a datastore where hyperlinks stored therein are ranked based on the relative computed weights of each hyperlink. - View Dependent Claims (38, 39, 40)
-
-
41. A system for ranking hyperlinks found in a document on the Internet, the system comprising:
-
a link weighting system for analyzing text in the immediate vicinity of a hyperlink and for computing a weight for the hyperlink based on the relevancy of the text to a set of interest; and a datastore for storing the hyperlink with other hyperlinks in a ranked list based on the relative computed weights of each hyperlink. - View Dependent Claims (42, 43, 44, 45, 46, 47, 48)
-
-
49. A method of determining whether a collection of web pages relates to a topic of interest, the method comprising:
-
exploring a web page of the collection of web pages; determining the relevancy of the web page to the particular topic of interest; and making a determination of the relevancy of the entire collection of web pages to the topic of interest based at least partially on the determined relevancy of the web page. - View Dependent Claims (50, 51)
-
-
52. A system of using a web crawler to classify a collection of web pages, the system comprising:
-
a web crawler module configured to search a web page from the collection for links to other web pages in the collection, and further configured to request web pages corresponding to the link that the web crawler finds and to examine such web pages for more links to other web pages in the collection; a classifier module configured to make a determination of the relevancy of each web page that the web crawler examines to a set of interest; and a collection classifying system for making a determination of the relevancy of the collection to the set of interest based on the determined relevancy of the web pages that the web crawler examines. - View Dependent Claims (53)
-
-
54. A method of gathering information related to a particular topic of interest, the method comprising:
-
searching a network for information related to at least one topic of interest, wherein searching the network comprises searching the network in a multi-tiered format such that each tier comprises more restrictive criteria for locating the at least one topic of interest than a previous tier; and extracting data from the searched information relating to the at least one topic of interest searched on the network. - View Dependent Claims (55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67)
-
-
68. A system for gathering information related to a particular topic of interest, the system comprising:
-
a multi-tiered system configured for searching a network for information related to at least one topic of interest, wherein each tier comprises more restrictive criteria for locating the at least one topic of interest than a previous tier; and an information extraction engine configured for extracting data from the searched information relating to the at least one topic of interest searched on the network. - View Dependent Claims (69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80)
-
Specification