Method and apparatus for managing a backlog of pending URL crawls

US 8,676,783 B1
Filed: 06/28/2011
Issued: 03/18/2014
Est. Priority Date: 06/28/2011
Status: Active Grant

First Claim

Patent Images

1. A method of reducing a URL crawl backlog in view of a limited URL crawl capacity, for use with a URL crawler executed by a computing device, comprising:

receiving a set of pending URL crawl requests at the URL crawler, each URL crawl request arriving with an assigned priority;

placing, into a backlog data structure, a first sub-set of the set of pending URL crawl requests, the backlog data structure having an associated maximum wait time;

rejecting from the backlog data structure a second sub-set of the set of pending URL crawl requests having priorities failing a priority threshold, such that said rejecting happens without the pending URL crawl requests in the second sub-set being performed, and such that said rejecting happens without the pending URL crawl requests in the second sub-set waiting in the backlog data structure until the maximum wait time; and

adjusting the priority threshold based on an estimate of a probability that newly requested URL crawl requests will be satisfied.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The technology described relates to reducing a backlog of pending URL crawls in view of a limited URL crawl capacity. This technology is useful for crawling URLs with low latency. Because of the limited crawl capacity, uncrawled URLs from crawl requests are entered into a backlog data structure of pending crawl requests. Various criteria are applied to the URL'"'"'s that are requested to be crawled, so that less important URL crawls are rejected early from the backlog data structure. This early rejection tends to limit the backlog data structure to the more important pending URL crawls, and tends to keep the average latency low by quickly failing the less important requested URL crawls.

12 Citations

View as Search Results

21 Claims

1. A method of reducing a URL crawl backlog in view of a limited URL crawl capacity, for use with a URL crawler executed by a computing device, comprising:
- receiving a set of pending URL crawl requests at the URL crawler, each URL crawl request arriving with an assigned priority;
  
  placing, into a backlog data structure, a first sub-set of the set of pending URL crawl requests, the backlog data structure having an associated maximum wait time;
  
  rejecting from the backlog data structure a second sub-set of the set of pending URL crawl requests having priorities failing a priority threshold, such that said rejecting happens without the pending URL crawl requests in the second sub-set being performed, and such that said rejecting happens without the pending URL crawl requests in the second sub-set waiting in the backlog data structure until the maximum wait time; and
  
  adjusting the priority threshold based on an estimate of a probability that newly requested URL crawl requests will be satisfied.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, further comprising:
    - identifying the priority threshold based on a changing plurality of priorities of pending URL crawl requests.
  - 3. The method of claim 1, further comprising:
    - identifying the priority threshold based on at least one priority of historical pending URL crawl requests rejected from the backlog data structure without being performed and without waiting until the maximum wait time.
  - 4. The method of claim 1, further comprising:
    - identifying the priority threshold based on at least one priority of historical pending URL crawl requests that were performed and are no longer in the backlog data structure.
  - 5. The method of claim 1, further comprising:
    - storing the priorities of the pending URL crawl requests in a record of the priorities of the pending URL crawl requests, to be used to determine the priority threshold, regardless of whether the pending URL crawl requests are in the backlog data structure.
  - 6. The method of claim 1, further comprising:
    - sorting the URL crawl requests in the backlog data structure according to priority to determine a particular one of the pending URL crawl requests to be performed next.
  - 7. The method of claim 1,wherein the probability increases with a time interval between new URL crawl requests, andwherein the probability increases with a throughput of performed URL crawls.
  - 8. The method of claim 1,wherein responsive to the estimate exceeding a threshold probability that newly requested URL crawls will be satisfied, the priority threshold is sufficiently relaxed such that no newly requested URL crawls are rejected from the backlog data structure due to the priority threshold.
  - 9. The method of claim 1, further comprising:
    - the computing device rejecting from the backlog data structure, a third sub-set of the set of pending URL crawl requests having priorities at the priority threshold, at a rate that increases with a number of the priorities failing the priority threshold.

10. An apparatus to reduce a URL crawl backlog in view of a limited URL crawl capacity, comprising:
- a processor and memory, having instructions executable by the processor to perform;
  
  receiving a set of pending URL crawls at a URL crawler, each URL crawl request arriving with an assigned priority;
  
  placing, into a backlog data structure, a first sub-set of the set of pending URL crawl requests, the backlog data structure having an associated maximum wait time;
  
  rejecting from the backlog data structure a second sub-set of the set of pending URL crawl requests having priorities failing a priority threshold, such that said rejecting happens without the pending URL crawl requests in the second sub-set being performed, and such that said rejecting happens without the pending URL crawl requests in the second sub-set waiting in the backlog data structure until the maximum wait time; and
  
  adjusting the priority threshold based on an estimate of a probability that newly requested URL crawl requests will be satisfied.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
- - 11. The apparatus of claim 10, the processor and memory further including instructions executable by the processor to perform:
    - identifying the priority threshold based on a changing plurality of priorities of pending URL crawl requests.
  - 12. The apparatus of claim 10, the processor and memory further including instructions executable by the processor to perform:
    - identifying the priority threshold based on at least one priority of historical pending URL crawl requests rejected from the backlog data structure without being performed and without waiting until the maximum wait time.
  - 13. The apparatus of claim 10, the processor and memory further including instructions executable by the processor to perform:
    - identifying the priority threshold based on at least one priority of historical pending URL crawl requests that were performed and are no longer in the backlog data structure.
  - 14. The apparatus of claim 10, the processor and memory further including instructions executable by the processor to perform:
    - storing priorities of the pending URL crawl requests in a record of the priorities of the pending URL crawl requests to be used to determine the priority threshold, regardless of whether the pending URL crawl requests are in the backlog data structure.
  - 15. The apparatus of claim 10, the processor and memory further including instructions executable by the processor to perform:
    - sorting the URL crawl requests in the backlog data structure according to priority to determine a particular one of the pending URL crawl requests to be performed next.
  - 16. The apparatus of claim 10,wherein the probability increases with a time interval between new URL crawl requests, andwherein the probability increases with a throughput of performed URL crawls.
  - 17. The apparatus of claim 10,wherein responsive to the estimate exceeding a threshold probability that newly requested URL crawls will be satisfied, the priority threshold is sufficiently relaxed such that no newly requested URL crawls are rejected from the backlog data structure due to the priority threshold.
  - 18. The apparatus of claim 10, wherein the apparatus is a computing device and the processor and memory further including instructions executable by the processor to perform:
    - the computing device rejecting from the backlog data structure, the pending URL crawl requests having priorities at the priority threshold, at a rate that increases with a number of the priorities failing the priority threshold.

19. A non-transitory computer-readable medium storing instructions reducing a URL crawl backlog in view of a limited URL crawl capacity, executable by a computing device to perform:
- receiving a set of pending URL crawl requests at a URL crawler, each URL crawl request arriving with an assigned priority;
  
  placing, into a backlog data structure, a first sub-set of the set of pending URL crawl requests, the backlog data structure having an associated maximum wait time;
  
  rejecting from the backlog data structure a second sub-set of the set of pending URL crawl requests, the second sub-set having priorities failing a priority threshold, such that said rejecting happens without the pending URL crawl requests in the second sub-set being performed, and such that said rejecting happens without the pending URL crawl requests in the second sub-set waiting in the backlog data structure until the maximum wait time; and
  
  adjusting the priority threshold based on an estimate of a probability that newly requested URL crawl requests will be satisfied.
- View Dependent Claims (20, 21)
- - 20. The computer-readable medium of claim 19, wherein the instructions are executable by the computing device to further perform:
    - rejecting from the backlog data structure the pending URL crawl requests having priorities at the priority threshold at a rate that increases with a number of the priorities failing the priority threshold.
  - 21. The computer-readable medium of claim 19, wherein the instructions are executable by the computing device to further perform:
    - storing priorities of the pending URL crawl requests in a record of the priorities of the pending URL crawl requests to be used to determine the priority threshold, regardless of whether the pending URL crawl requests are in the backlog data structure.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Fedorynski, Pawel Aleksander, Samaddar, Sumitro
Primary Examiner(s)
WILSON, KIMBERLY LOVEL

Application Number

US13/170,890
Time in Patent Office

994 Days
Field of Search

707/709
US Class Current

707/709
CPC Class Codes

G06F 16/951 Indexing; Web crawling tech...

G06F 16/9566 URL specific, e.g. using al...

Method and apparatus for managing a backlog of pending URL crawls

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

12 Citations

21 Claims

Specification

Use Cases

Quick Links

Others

Method and apparatus for managing a backlog of pending URL crawls

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

12 Citations

21 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others