Adaptive Web crawling using a statistical model

US 20050165778A1
Filed: 12/22/2004
Published: 07/28/2005
Est. Priority Date: 01/28/2000
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for selectively accessing a document during a current crawl of a server computer, the document being identified by a document address specification, the document having been retrieved during a previous crawl, the method comprising:

determining whether to access the document during the current crawl with the aid of a probabilistic model that is based on the probability that the document has changed since the previous crawl; and

accessing the document if the determination produces an instruction indicative that the document at the document address specification should be accessed during the current crawl.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer based system and method of retrieving information pertaining to documents on a computer network is disclosed. The method includes selecting a set of documents to be accessed during a Web crawl by utilizing a statistical model to determine which previously retrieved documents are most likely to have changed since last accessed. The statistical model is continuously improving its accuracy by training internal probability distributions to reflect the actual experience with change rate patterns of the documents accessed. The decision made whether to access the document is based on the probability of change compared against a desired synchronization level, random selections, maximum limits on the amount of time since the document was last accessed, and other criterion. Once the decision to access is made, the document is checked for changes and this information is used to train the statistical model.

Citations

20 Claims

1. A computer-implemented method for selectively accessing a document during a current crawl of a server computer, the document being identified by a document address specification, the document having been retrieved during a previous crawl, the method comprising:
- determining whether to access the document during the current crawl with the aid of a probabilistic model that is based on the probability that the document has changed since the previous crawl; and
  
  accessing the document if the determination produces an instruction indicative that the document at the document address specification should be accessed during the current crawl.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, wherein determining whether to access the document with the aid of a probabilistic model comprises computing a probability that the document has changed since the document was retrieved during the previous crawl.
  - 3. The method of claim 2, wherein computing the probability that a document has changed comprises:
    - selecting an active probability indicative of a proportion of documents in a plurality of documents that are changing at various change rates, the plurality of documents including the document;
      
      training the active probability to reflect experience with the document during a plurality of previous crawls; and
      
      using the trained active probability to compute the probability that the document has changed.
  - 4. The method of claim 3, further comprising:
    - selecting the probability that the document has changed from the previous crawl as the active probability in the current crawl; and
      
      repeating the method of claim 3 for the current crawl.
  - 5. The method of claim 3, wherein training the active probability includes multiplying the active probability indicative of a change in the document by a training probability calculated using a probabilistic model.
  - 6. The method of claim 1, wherein the probabilistic model further comprises:
    - training a document probability distribution corresponding to the document address specification to reflect experience with the document during a plurality of previous crawls, the document probability distribution including a plurality of probabilities;
      
      determining from the document probability distribution a probability that the document has changed; and
      
      making a determination of whether to access the document in a current crawl based on the probability that the document has changed.
  - 7. The method of claim 6, further comprising:
    - calculating, based on the experience with the document during a plurality of previous crawls, a discrete random variable distribution that includes a plurality of training probabilities; and
      
      multiplying each probability in the document probability distribution by a corresponding training probability from the discrete random variable distribution.
  - 8. The method of claim 7, wherein the training probabilities are calculated using a Poisson process, the Poisson process including a Poisson equation (e{circumflex over (
    - )}(−
      
      r*dt)) and a complementary Poisson equation (1−
      
      e{circumflex over (
      
      )}(−
      
      r*dt)).
  - 9. The method of claim 8, wherein the experience with the document during the plurality of previous crawls is derived from historical information associated with the document address specification.

10. A computer-readable medium having computer-executable instructions for retrieving one document in a plurality of documents from a remote server, which when executed comprise:
- maintaining historical information associated with changes to the one document;
  
  initiating a crawl procedure for retrieving particular documents in the plurality of documents; and
  
  determining whether to access the one document from the remote server based on a probabilistic analysis of the historical information associated with the changes to the one document, said probabilistic analysis of the historical information being based on the probability that the one document has changed since a previous crawl.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 11. The computer-readable medium of claim 10, further comprising:
    - if the determination to access the one document is positive, identifying the one document for retrieval during the crawl procedure; and
      
      attempting to retrieve all documents identified for retrieval during the crawl procedure.
  - 12. The computer-readable medium of claim 10, wherein the probabilistic analysis comprises:
    - computing a probability that the one document has changed since the one document was last retrieved from the remote server.
  - 13. The computer-readable medium of claim 12, wherein computing the probability that the one document has changed further comprises:
    - beginning with a probability that a pre-defined proportion of documents in the plurality of documents has changed, training the probability that the pre-defined proportion of documents has changed using the historical information associated with the one document to achieve the probability that the one document has changed.
  - 14. The computer-readable medium of claim 12, further comprising making a random decision to retrieve the one document wherein the random decision is biased by the probability that the one document has changed.
  - 15. The computer-readable medium of claim 14, wherein the random decision is further biased by a synchronization level configured to influence the random decision based on a predetermined degree of tolerance for not retrieving the one document if the document is likely to have changed.
  - 16. The computer-readable medium of claim 14, wherein the random decision is made by a software routine adapted to simulate a flip of a coin.
  - 17. The computer-readable medium of claim 10, wherein:
    - the historical information associated with changes to the one document includes a time stamp for the one document, the time stamp being indicative of the time that the one document was last modified when the one document was last retrieved from the remote server; and
      
      the probabilistic analysis includes a comparison of the time stamp included in the historical information with another time stamp associated with the one document stored on the remote server.
  - 18. The computer-readable medium of claim 17, further comprising:
    - if the time stamp included in the historical information does not match the other time stamp associated with the one document stored on the remote server, identifying the one document for retrieval during the crawl procedure.
  - 19. The computer-readable medium of claim 10, wherein:
    - the historical information associated with changes to the one document includes a hash value associated with the one document, the hash value being a representation of the one document; and
      
      the probabilistic analysis includes a comparison of the hash value included in the historical information with another hash value calculated from information retrieved from the one document stored on the remote server.
  - 20. The computer-readable medium of claim 19, if the hash value included in the historical information does not match the other hash value associated with the one document stored on the remote server, identifying the one document for retrieval during the crawl procedure.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Meyerzon, Dmitriy, Obata, Kenji C.

Granted Patent

US 7,328,401 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/951   Indexing; Web crawling tech...

Y10S 707/99931   Database or file accessing

Y10S 707/99933   Query processing, i.e. sear...

Adaptive Web crawling using a statistical model

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Adaptive Web crawling using a statistical model

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links