Methods and apparatus for intelligent crawling on the world wide web
First Claim
1. A computer-based method of performing document retrieval in accordance with an information network, the method comprising the steps of:
- initially retrieving one or more documents from the information network that satisfy a user-defined predicate, wherein the initial document retrieval operation is performed without assuming a specific model of a linkage structure such that the initial document retrieval operation retrieves the one or more documents without assuming that a relationship exists between a feature of a first one of the one or more documents and a feature of at least another one of the one or more documents that links to the first one;
collecting at least a set of aggregate statistical information and a set of predicate-specific statistical information about the one or more retrieved documents as the one or more retrieved documents are analyzed; and
using the collected statistical information to automatically determine further document retrieval operations to be performed in accordance with the information network, wherein the statistical information using step further comprises learning a linkage structure from at least a portion of the collected statistical information with each successive document retrieval operation such that the learned linkage structure is available for use in performing subsequent document retrieval operations requested by a user.
1 Assignment
0 Petitions
Accused Products
Abstract
Methods and apparatus for performing intelligent crawling are provided. Particularly, the intelligent crawling techniques of the invention provide a crawler mechanism which is capable of learning as it crawls in order to focus the search for documents on the information network being explored, e.g., world wide web. This crawler mechanism stores information about the crawled documents as it retrieves the documents, and then uses the information to further focus its search appropriately. The inventive techniques result in the crawling of a small percentage of the documents on the world wide web.
-
Citations
27 Claims
-
1. A computer-based method of performing document retrieval in accordance with an information network, the method comprising the steps of:
-
initially retrieving one or more documents from the information network that satisfy a user-defined predicate, wherein the initial document retrieval operation is performed without assuming a specific model of a linkage structure such that the initial document retrieval operation retrieves the one or more documents without assuming that a relationship exists between a feature of a first one of the one or more documents and a feature of at least another one of the one or more documents that links to the first one; collecting at least a set of aggregate statistical information and a set of predicate-specific statistical information about the one or more retrieved documents as the one or more retrieved documents are analyzed; and using the collected statistical information to automatically determine further document retrieval operations to be performed in accordance with the information network, wherein the statistical information using step further comprises learning a linkage structure from at least a portion of the collected statistical information with each successive document retrieval operation such that the learned linkage structure is available for use in performing subsequent document retrieval operations requested by a user. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. Apparatus for performing document retrieval in accordance with an information network, the apparatus comprising:
at least one processor operative to;
(i) initially retrieve one or more documents from the information network that satisfy a user-defined predicate, wherein the initial document retrieval operation is performed without assuming a specific model of a linkage structure such that the initial document retrieval operation retrieves the one or more documents without assuming that a relationship exists between a feature of a first one of the one or more documents and a feature of at least another one of the one or more documents that links to the first one;
(ii) collect at least a set of aggregate statistical information and a set of predicate-specific statistical information about the one or more retrieved documents as the one or more retrieved documents are analyzed; and
(iii) use the collected statistical information to automatically determine further document retrieval operations to be performed in accordance with the information network, wherein the statistical information using operation further comprises learning a linkage structure from at least a portion of the collected statistical information with each successive document retrieval operation such that the learned linkage structure is available for use in performing subsequent document retrieval operations requested by a user.- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
19. An article of manufacture for performing document retrieval in accordance with an information network, comprising a machine readable medium containing one or more programs which when executed implement the steps of:
-
initially retrieving one or more documents from the information network that satisfy a user-defined predicate, wherein the initial document retrieval operation is performed without assuming a specific model of a linkage structure such that the initial document retrieval operation retrieves the one or more documents without assuming that a relationship exists between a feature of a first one of the one or more documents and a feature of at least another one of the one or more documents that links to the first one; collecting at least a set of aggregate statistical information and a set of predicate-specific statistical information about the one or more retrieved documents as the one or more retrieved documents are analyzed; and using the collected statistical information to automatically determine further document retrieval operations to be performed in accordance with the information network, wherein the statistical information using step further comprises learning a linkage structure from at least a portion of the collected statistical information with each successive document retrieval operation such that the learned linkage structure is available for use in performing subsequent document retrieval operations requested by a user. - View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27)
-
Specification