×

Adaptive web crawling using a statistical model

  • US 7,328,401 B2
  • Filed: 12/22/2004
  • Issued: 02/05/2008
  • Est. Priority Date: 01/28/2000
  • Status: Expired due to Fees
First Claim
Patent Images

1. A computer-implemented method for selectively accessing a document during a current crawl of a server computer, the document being identified by a document address specification, the document having been retrieved during a previous crawl, the method comprising:

  • (a) determining whether to access the document during the current crawl with the aid of a probabilistic model that is based on the probability that the document has changed since the previous crawl, wherein determining whether to access the document with the aid of a probabilistic model comprises computing a probability that the document has changed since the document was retrieved during the previous crawl, and wherein computing the probability that the document has changed comprises;

    (i) calculating, based on the experience with the document during a plurality of previous crawls, a discrete random variable distribution that includes a plurality of training probabilities, wherein the training probabilities are calculated using a Poisson process, the Poisson process including a Poisson equation (e^(−

    r*dt)) and a complementary Poisson equation (1−

    e^(r*dt));

    (ii) selecting an active probability indicative of a proportion of documents in a plurality of documents that are changing at various change rates, the plurality of documents including the document;

    (iii) training the active probability to reflect experience with the document during the plurality of previous crawls; and

    (iv) using the trained active probability to compute the probability that the document has changed; and

    b) accessing the document if the determination produces an instruction indicative that the document at the document address specification should be accessed during the current crawl.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×