Method and apparatus for managing a backlog of pending URL crawls
First Claim
1. A method of reducing a URL crawl backlog in view of a limited URL crawl capacity, for use with a URL crawler executed by a computing device, comprising:
- receiving a set of pending URL crawl requests at the URL crawler, each URL crawl request arriving with an assigned priority;
placing, into a backlog data structure, a first sub-set of the set of pending URL crawl requests, the backlog data structure having an associated maximum wait time;
rejecting from the backlog data structure a second sub-set of the set of pending URL crawl requests having priorities failing a priority threshold, such that said rejecting happens without the pending URL crawl requests in the second sub-set being performed, and such that said rejecting happens without the pending URL crawl requests in the second sub-set waiting in the backlog data structure until the maximum wait time; and
adjusting the priority threshold based on an estimate of a probability that newly requested URL crawl requests will be satisfied.
3 Assignments
0 Petitions
Accused Products
Abstract
The technology described relates to reducing a backlog of pending URL crawls in view of a limited URL crawl capacity. This technology is useful for crawling URLs with low latency. Because of the limited crawl capacity, uncrawled URLs from crawl requests are entered into a backlog data structure of pending crawl requests. Various criteria are applied to the URL'"'"'s that are requested to be crawled, so that less important URL crawls are rejected early from the backlog data structure. This early rejection tends to limit the backlog data structure to the more important pending URL crawls, and tends to keep the average latency low by quickly failing the less important requested URL crawls.
12 Citations
21 Claims
-
1. A method of reducing a URL crawl backlog in view of a limited URL crawl capacity, for use with a URL crawler executed by a computing device, comprising:
-
receiving a set of pending URL crawl requests at the URL crawler, each URL crawl request arriving with an assigned priority; placing, into a backlog data structure, a first sub-set of the set of pending URL crawl requests, the backlog data structure having an associated maximum wait time; rejecting from the backlog data structure a second sub-set of the set of pending URL crawl requests having priorities failing a priority threshold, such that said rejecting happens without the pending URL crawl requests in the second sub-set being performed, and such that said rejecting happens without the pending URL crawl requests in the second sub-set waiting in the backlog data structure until the maximum wait time; and adjusting the priority threshold based on an estimate of a probability that newly requested URL crawl requests will be satisfied. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. An apparatus to reduce a URL crawl backlog in view of a limited URL crawl capacity, comprising:
-
a processor and memory, having instructions executable by the processor to perform; receiving a set of pending URL crawls at a URL crawler, each URL crawl request arriving with an assigned priority; placing, into a backlog data structure, a first sub-set of the set of pending URL crawl requests, the backlog data structure having an associated maximum wait time; rejecting from the backlog data structure a second sub-set of the set of pending URL crawl requests having priorities failing a priority threshold, such that said rejecting happens without the pending URL crawl requests in the second sub-set being performed, and such that said rejecting happens without the pending URL crawl requests in the second sub-set waiting in the backlog data structure until the maximum wait time; and adjusting the priority threshold based on an estimate of a probability that newly requested URL crawl requests will be satisfied. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A non-transitory computer-readable medium storing instructions reducing a URL crawl backlog in view of a limited URL crawl capacity, executable by a computing device to perform:
-
receiving a set of pending URL crawl requests at a URL crawler, each URL crawl request arriving with an assigned priority; placing, into a backlog data structure, a first sub-set of the set of pending URL crawl requests, the backlog data structure having an associated maximum wait time; rejecting from the backlog data structure a second sub-set of the set of pending URL crawl requests, the second sub-set having priorities failing a priority threshold, such that said rejecting happens without the pending URL crawl requests in the second sub-set being performed, and such that said rejecting happens without the pending URL crawl requests in the second sub-set waiting in the backlog data structure until the maximum wait time; and adjusting the priority threshold based on an estimate of a probability that newly requested URL crawl requests will be satisfied. - View Dependent Claims (20, 21)
-
Specification