System and method for near-uniform sampling of web page addresses
First Claim
Patent Images
1. A method of generating a list of near-uniform samples of data sets from among a plurality of host computers, comprising the steps of:
- (a) generating a set of randomly selected addresses, wherein each address in the set of randomly selected addresses corresponds to a data set, and two or more addresses in the set are distinct addresses;
(b) for each distinct address in the set of randomly selected addresses, computing a reachability measure; and
(c) selecting samples from the set of randomly selected addresses, such that the probability of selecting a given distinct address is inversely proportional to the reachability measure for the distinct address, the selected samples comprising the list of near-uniform samples.
4 Assignments
0 Petitions
Accused Products
Abstract
A system generates a list of near-uniform samples of data sets (e.g., web pages) from among a plurality of host computers. The system performs a random walk so as to generate a set of visited addresses. For each address in the set, a reachability measure is computed. Then, samples are selected from the set, such that the probability of selecting a given address is inversely proportional to the reachability measure for the address. The selected samples form the list of near-uniform samples.
-
Citations
36 Claims
-
1. A method of generating a list of near-uniform samples of data sets from among a plurality of host computers, comprising the steps of:
-
(a) generating a set of randomly selected addresses, wherein each address in the set of randomly selected addresses corresponds to a data set, and two or more addresses in the set are distinct addresses;
(b) for each distinct address in the set of randomly selected addresses, computing a reachability measure; and
(c) selecting samples from the set of randomly selected addresses, such that the probability of selecting a given distinct address is inversely proportional to the reachability measure for the distinct address, the selected samples comprising the list of near-uniform samples. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
selecting a current address;
downloading a data set corresponding to the current address;
adding the current address to the set of randomly selected addresses; and
when the current address is not in a seed set, adding the current address to the seed set.
-
-
3. The method of claim 2, wherein the selecting step comprises the following steps:
-
(a1) when selecting an initial current address, selecting as the current address an address uniformly at random from the seed set;
(a2) otherwise, selecting as the current address a new address, wherein the selection comprises;
(a2-1) when there are no referred data sets in the data set corresponding to the current address, selecting a new address uniformly at random from the seed set; and
(a2-2) when there is at least one referred data set in the data set corresponding to the current address, selecting a uniform real-valued random value r, and where r is less than a predetermined value, selecting an address uniformly at random from the seed set to be the new address, and where r is greater than or equal to the predetermined value, selecting uniformly at random an address of one of the referred data sets to be the new address.
-
-
4. The method of claim 3, wherein the computing step (b) comprises the following steps:
for each respective distinct address in the set, setting the reachability measure equal to a visited ratio comprising a ratio of visits to the respective distinct address to a total number of a addresses visited during performance of step (a) the method.
-
5. The method of claim 1, wherein the computing step (b) comprises the following steps:
for each respective distinct address in the set, setting the reachability measure equal to a visited ratio comprising a ratio of visits to the respective distinct address to visits to all addresses in the set during performance of step (a).
-
6. The method of claim 3, wherein the computing step (b) comprises the following steps:
for each respective distinct address in the set, setting the reachability measure equal to a page rank comprising an estimate, computed using a predefined page rank function, of what fraction of visits in an infinitely long random walk of reachable data sets at the plurality of hosts would be to a data set at the respective distinct address.
-
7. The method of claim 1, wherein the computing step (b) comprises the following steps:
for each respective distinct address in the set, setting the reachability measure equal to a page rank comprising an estimate, computed using a predefined page rank function, of what fraction of visits in an infinitely long random walk of reachable data sets at the plurality of hosts would be to a data set at the respective distinct address.
-
8. The method of claim 1, wherein the sampling step (c) comprises the following steps:
-
(c1) for each distinct address in the set, computing a cumulative probability density function using the reachability measure; and
(c2) for each sample desired in a list of near-uniform samples (c2-1) selecting a random value of the cumulative probability density function; and
(c2-2) adding to the list a distinct address corresponding to the random value of the cumulative probability density function.
-
-
9. The method of claim 4, wherein the sampling step (c) comprises the following steps:
-
(c1) for each distinct address in the set, computing a cumulative probability density function using the reachability measure; and
(c2) for each sample desired in a list of near-uniform samples (c2-1) selecting a random value of the cumulative probability density function; and
(c2-2) adding to the list a distinct address corresponding to the random value of the cumulative probability density function.
-
-
10. The method of claim 5, wherein the sampling step (c) comprises the following steps:
-
(c1) for each distinct address in the set, computing a cumulative probability density function using the reachability measure; and
(c2) for each sample desired in a list of near-uniform samples (c2-1) selecting a random value of the cumulative probability density function; and
(c2-2) adding to the list a distinct address corresponding to the random value of the cumulative probability density function.
-
-
11. The method of claim 6, wherein the sampling step (c) comprises the following steps:
-
(c1) for each distinct address in the set, computing a cumulative probability density function using the reachability measure; and
(c2) for each sample desired in a list of near-uniform samples (c2-1) selecting a random value of the cumulative probability density function; and
(c2-2) adding to the list a distinct address corresponding to the random value of the cumulative probability density function.
-
-
12. The method of claim 7, wherein the sampling step (c) comprises the following steps:
-
(c1) for each distinct address in the set, computing a cumulative probability density function using the reachability measure; and
(c2) for each sample desired in a list of near-uniform samples (c2-1) selecting a random value of the cumulative probability density function; and
(c2-2) adding to the list a distinct address corresponding to the random value of the cumulative probability density function.
-
-
13. A computer program product for use in conjunction with a computer system, the computer program product comprising a computer readable storage medium and a computer program mechanism embedded therein, the computer program mechanism comprising:
-
a random walk module for accessing data sets at addresses, the addresses including a seed set of addresses, and generating a set of randomly selected addresses comprising addresses of referred data sets in the accessed data sets;
the set of randomly selected addresses including a plurality of distinct addresses; and
a sampling module for computing a reachability measure for each distinct address in the set of randomly selected addresses, and selecting samples from the set of randomly selected addresses, such that the probability of selecting a given distinct address is inversely proportional to the reachability measure for the distinct address, the selected samples comprising a list of near-uniform samples. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
(a) selecting a current address;
(b) downloading a data set corresponding to the current address;
(c) adding the current address to the set of randomly selected addresses; and
(d) when the current address is not in a seed set, adding the current address to the seed set.
-
-
15. The computer program product of claim 14, wherein the random walk module further includes instructions, to e executed for each item desired in the set of randomly selected addresses, for:
-
(a1) when there is no existing current address, selecting as the current address an address uniformly at random from the seed set;
(a2) when there is an existing current address, selecting as the current address a new address, wherein the selection comprises;
(a2-1) when there are no more referred data sets in the data set corresponding to the current address, selecting a new address uniformly at random from the seed set; and
(a2-2) when there are referred data s et s in the data set corresponding to the current address, selecting a uniform real-valued random value r, and where r is less than a predetermined value, selecting an address uniformly at random from the seed set to be the new address, and where r is greater than or equal to the predetermined value, selecting uniformly at random an address of one of the referred data sets to be the new address.
-
-
16. The computer program product of claim 15, wherein the sampling module includes instructions, to be executed for each respective distinct address in the set, for setting the reachability measure equal to a visit ed ratio comprising a ratio of visits to the respective distinct address to visits to all addresses in the set during performance of step (a).
-
17. The computer program product of claim 13, wherein the sampling module includes instructions, to be executed for each respective distinct address in the set, for setting the reachability measure equal to a visited ratio comprising a ratio of visits to the respective distinct address to visits to all addresses in the set during performance of step (a).
-
18. The computer program product of claim 15, wherein the sampling module includes instructions, to be executed for each respective distinct address in the set, for setting the reachability measure equal to a page rank comprising an estimate, computed using a predefined page rank function, of what fraction of visits in an infinitely long random walk of reachable data sets at the plurality of hosts would be to a data set at the respective distinct address.
-
19. The computer program product of claim 13, wherein the sampling module includes instructions, to be executed for each respective distinct address in the set, for setting the reachability measure equal to a page rank comprising an estimate, computed using a predefined page rank function, of what fraction of visits in an infinitely long random walk of reachable data sets at the plurality of hosts would be to a data set at the respective distinct address.
-
20. The computer program product of claim 13, wherein the sampling module includes instructions:
-
(c1) to be executed for each distinct address in the set, for computing a cumulative probability density function using the reachability measure; and
(c2) to be executed for each sample desired in a list of near-uniform samples, for;
(c2-1) selecting a random value of the cumulative probability density function; and
(c2-2) adding to the list a distinct address corresponding to the random value of the cumulative probability density function.
-
-
21. The computer program product of claim 16, wherein the sampling module includes instructions:
-
(c1) to be executed for each distinct address in the set, computing a cumulative probability density function using the reachability measure; and
(c2) to be executed for each sample desired in a list of near-uniform samples (c2-1) selecting a random value of the cumulative probability density function; and
(c2-2) adding to the list a distinct address corresponding to the random value of the cumulative probability density function.
-
-
22. The computer program product of claim 17, wherein the sampling module includes instructions:
-
(c1) to be executed for each distinct address in the set, computing a cumulative probability density function using the reachability measure; and
(c2) to be executed for each sample desired in a list of near-uniform samples (c2-1) selecting a random value of the cumulative probability density function; and
(c2-2) adding to the list a distinct address corresponding to the random value of the cumulative probability density function.
-
-
23. The computer program product of claim 18, wherein the sampling module includes instructions:
-
(c1) to be executed for each distinct address in the set, computing a cumulative probability density function using the reachability measure; and
(c2) to be executed for each sample desired in a list of near-uniform samples (c2-1) selecting a random value of the cumulative probability density function; and
(c2-2) adding to the list a distinct address corresponding to the random value of the cumulative probability density function.
-
-
24. The computer program product of claim 19, wherein the sampling module includes instructions:
-
(c1) to be executed for each distinct address in the set, computing a cumulative probability density function using the reachability measure; and
(c2) to be executed for each sample desired in a list of near-uniform samples (c2-1) selecting a random value of the cumulative probability density function; and
(c2-2) adding to the list a distinct address corresponding to the random value of the cumulative probability density function.
-
-
25. A web crawler for downloading data sets from among a plurality of host computers, comprising:
-
one or more central processing units;
a communications interface for communicating with the host computers;
memory for storing addresses of data sets and downloaded data sets;
a random walk module, executed by the one or more central processing units, for accessing data sets at addresses, the addresses including a seed set of addresses, and generating a set of randomly selected addresses comprising addresses of referred data sets in the accessed data sets;
the set of randomly selected addresses including a plurality of distinct addresses; and
a sampling module, executed by the one or more central processing units, for computing a reachability measure for each distinct address in the set of randomly selected addresses, and selecting samples from the set of randomly selected addresses, such that the probability of selecting a given distinct address is inversely proportional to the reachability measure for the distinct address, the selected samples comprising a list of near-uniform samples. - View Dependent Claims (26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36)
(a) selecting a current address;
(b) downloading a data set corresponding to the current address;
(c) adding the current address to the set of randomly selected addresses; and
(d) when the current address is not in a seed set, adding the current address to the seed set.
-
-
27. The web crawler of claim 26, wherein the random walk module further includes instructions, to e executed for each item desired in the set of randomly selected addresses, for:
-
(a1) when there is no existing current address, selecting as the current address an address uniformly at random from the seed set;
(a2) when there is an existing current address, selecting as the current address a new address, wherein the selection comprises;
(a2-1) when there are no more referred data sets in the data set corresponding to the current address, selecting a new address uniformly at random from the seed set; and
(a2-2) when there are referred data sets in the data set corresponding to the current address, selecting a uniform real-valued random value r, and where r is less than a predetermined value, selecting an address uniformly at random from the seed set to be the new address, and where r is greater than or equal to the predetermined value, selecting uniformly at random an address of one of the referred data sets to be the new address.
-
-
28. The web crawler of claim 27, wherein the sampling module includes instructions, to be executed for each respective distinct address in the set, for setting the reachability measure equal to a visited ratio comprising a ratio of visits to the respective distinct address to visits to all addresses in the set during performance of step (a).
-
29. The web crawler of claim 25, wherein the sampling module includes instructions, to be executed for each respective distinct address in the set, for setting the reachability measure equal to a visited;
- ratio comprising a ratio of visits to the respective distinct address to visits to all addresses in the set during performance of step (a).
-
30. The web crawler of claim 27, wherein the sampling module includes instructions, to be executed for each respective distinct address in the set, for setting the reachability measure equal to a page rank comprising an estimate, computed using a predefined page rank function, of what fraction of visits in an infinitely long random walk of reachable data sets at the plurality of hosts would be to a data set at the respective distinct address.
-
31. The web crawler of claim 25, wherein the sampling module includes instructions, to be executed for each respective distinct address in the set, for setting the reachability measure equal to a page rank comprising an estimate, computed using a predefined page rank function, of what fraction of visits in an infinitely long random walk of reachable data sets at the plurality of hosts would be to a data set at the respective distinct address.
-
32. The web crawler of claim 25, wherein the sampling module includes instructions:
-
(c1) to be executed for each distinct address in the set, for computing a cumulative probability density function using the reachability measure; and
(c2) to be executed for each sample desired in a list of near-uniform samples, for;
(c2-1) selecting a random value of the cumulative probability density function; and
(c2-2) adding to the list a distinct address corresponding to the random value of the cumulative probability density function.
-
-
33. The web crawler of claim 28, wherein the sampling module includes instructions:
-
(c1) to be executed for each distinct address in the set, computing a cumulative probability density function using the reachability measure; and
(c2) to be executed for each sample desired in a list of near-uniform samples (c2-1) selecting a random value of the cumulative probability density function; and
(c2-2) adding to the list a distinct address corresponding to the random value of the cumulative probability density function.
-
-
34. The web crawler of claim 29, wherein the sampling module includes instructions:
-
(c1) to be executed for each distinct address in the set, computing a cumulative probability density function using the reachability measure; and
(c2) to be executed for each sample desired in a list of near-uniform samples (c2-1) selecting a random value of the cumulative probability density function; and
(c2-2) adding to the list a distinct address corresponding to the random value of the cumulative probability density function.
-
-
35. The web crawler of claim 30, wherein the sampling module includes instructions:
-
(c1) to be executed for each distinct address in the set, computing a cumulative probability density function using the reachability measure; and
(c2) to be executed for each sample desired in a list of near-uniform samples (c2-1) selecting a random value of the cumulative probability density function; and
(c2-2) adding to the list a distinct address corresponding to the random value of the cumulative probability density function.
-
-
36. The web crawler of claim 31, wherein the sampling module includes instructions:
-
(c1) to be executed for each distinct address in the set, computing a cumulative probability density function using the reachability measure; and
(c2) to be executed for each sample desired in a list of near-uniform samples (c2-1) selecting a random value of the cumulative probability density function; and
(c2-2) adding to the list a distinct address corresponding to the random value of the cumulative probability density function.
-
Specification