Adaptive Web crawling using a statistical model
First Claim
1. A computer-implemented method for selectively accessing a document during a current crawl of a server computer, the document being identified by a document address specification, the document having been retrieved during a previous crawl, the method comprising:
- determining whether to access the document during the current crawl with the aid of a probabilistic model that is based on the probability that the document has changed since the previous crawl; and
accessing the document if the determination produces an instruction indicative that the document at the document address specification should be accessed during the current crawl.
1 Assignment
0 Petitions
Accused Products
Abstract
A computer based system and method of retrieving information pertaining to documents on a computer network is disclosed. The method includes selecting a set of documents to be accessed during a Web crawl by utilizing a statistical model to determine which previously retrieved documents are most likely to have changed since last accessed. The statistical model is continuously improving its accuracy by training internal probability distributions to reflect the actual experience with change rate patterns of the documents accessed. The decision made whether to access the document is based on the probability of change compared against a desired synchronization level, random selections, maximum limits on the amount of time since the document was last accessed, and other criterion. Once the decision to access is made, the document is checked for changes and this information is used to train the statistical model.
-
Citations
20 Claims
-
1. A computer-implemented method for selectively accessing a document during a current crawl of a server computer, the document being identified by a document address specification, the document having been retrieved during a previous crawl, the method comprising:
-
determining whether to access the document during the current crawl with the aid of a probabilistic model that is based on the probability that the document has changed since the previous crawl; and
accessing the document if the determination produces an instruction indicative that the document at the document address specification should be accessed during the current crawl. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A computer-readable medium having computer-executable instructions for retrieving one document in a plurality of documents from a remote server, which when executed comprise:
-
maintaining historical information associated with changes to the one document;
initiating a crawl procedure for retrieving particular documents in the plurality of documents; and
determining whether to access the one document from the remote server based on a probabilistic analysis of the historical information associated with the changes to the one document, said probabilistic analysis of the historical information being based on the probability that the one document has changed since a previous crawl. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
-
Specification