Method for estimating coverage of web search engines
First Claim
1. A computerized method for comparing search engine indices and estimating coverage of at least one search engine, each search engine maintaining an index of words of pages located at specific addresses in a network, wherein the estimate of coverage indicates the relative sizes of the indices of the first and second search engine, and the relative amount of overlap between the first and second search engine, comprising:
- generating a random query, the random query being a logical combination of words found in a lexicon of words;
submitting the random query to the first search engine;
receiving a set of URLs in response to the random query;
randomly selecting a particular URL identifying a sample page;
generating a strong query for the sample page;
submitting the strong query to a second search engine;
comparing result information received in response to the strong query to determine if the second search engine has indexed the sample page; and
estimating the relative sizes of the indices of the first and second search engines by dividing a fraction of a first set of pages sampled from the second search engine that are contained in the first search engine by a fraction of a second set of pages sampled from the first search engine that are contained in the second search engine.
9 Assignments
0 Petitions
Accused Products
Abstract
A computerized method is used to estimate the relative coverage of Web search engines. Each search engine maintains an index of words of pages located at specific URL addresses in a network. The method generates a random query. The random query is a logical combination of words found in a subset of the pages. The random query is submitted to a first search engine. In response a set of URLs of pages matching the query are received. Each URL identifies a page indexed by the first search engine that satisfies the random query. A particular URL identifying a sample page is randomly selected. A strong query corresponding to the sample page is generated, and the strong query is submitted to a second search engine. Result information received in response to the strong query is compared to determine if the second search engine has indexed the sample page, or a page substantially similar to the sample page. This procedure is repeated to gather statistical data which is used to estimate the relative sizes and amount of overlap of search engines.
-
Citations
45 Claims
-
1. A computerized method for comparing search engine indices and estimating coverage of at least one search engine, each search engine maintaining an index of words of pages located at specific addresses in a network, wherein the estimate of coverage indicates the relative sizes of the indices of the first and second search engine, and the relative amount of overlap between the first and second search engine, comprising:
-
generating a random query, the random query being a logical combination of words found in a lexicon of words;
submitting the random query to the first search engine;
receiving a set of URLs in response to the random query;
randomly selecting a particular URL identifying a sample page;
generating a strong query for the sample page;
submitting the strong query to a second search engine;
comparing result information received in response to the strong query to determine if the second search engine has indexed the sample page; and
estimating the relative sizes of the indices of the first and second search engines by dividing a fraction of a first set of pages sampled from the second search engine that are contained in the first search engine by a fraction of a second set of pages sampled from the first search engine that are contained in the second search engine. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
sorting the words in the lexicon according to the frequencies of the words; and
establishing an upper frequency threshold and a lower frequency threshold so that when words equidistant from the upper and lower thresholds are combined in the conjunctive query, the set of addresses is less than or equal to a predetermined maximum number of members.
-
-
13. The computer program product of claim 6 wherein the random query is a conjunctive query combining a pair of words and an AND operator.
-
14. The computer program product of claim 13 wherein the process further includes:
-
sorting the words in the lexicon according to the frequencies of the words; and
establishing an upper frequency threshold and a lower frequency threshold so that when words equidistant from the upper and lower thresholds are combined in the conjunctive query, the set of addresses is less than or equal to a predetermined maximum number of members.
-
-
15. The method of claim 1 wherein the network is the World Wide Web and further including:
-
fetching the URL from the first search engine;
fetching a corresponding page from the World Wide Web; and
constructing the strong query to be representative of the sample page.
-
-
16. The method of claim 15 wherein the strong query is a disjunction of a first and second conjunctive query.
-
17. The method of claim 1 wherein the result information includes URLs of pages indexed by the second search engine.
-
18. The method of claim 17 wherein the URLs of the pages indexed and the particular address identifying the sample pages are normalized before the comparing.
-
19. The method of claim 17 wherein the result information being compared is the content of the sample page, and the content of the pages indexed by the second search engine.
-
20. The method of claim 17 wherein the result information includes host names.
-
21. The method of claim 1 wherein dynamic and impoverished pages are discarded before the comparing.
-
22. The method of claim 1 wherein privileged access is provided to the first search engine.
-
23. A computerized method for comparing search engine indices and estimating coverage of at least one search engine, each search engine maintaining an index of words of pages located at specific addresses in a network, comprising:
-
generating a random query, the random query being a logical combination of words found in a lexicon of words;
submitting the random query to the first search engine;
receiving a set of URLs in response to the random query;
randomly selecting a particular URL identifying a sample page;
generating a strong query for the sample page;
submitting the strong query to a second search engine; and
comparing result information received in response to the strong query to determine if the second search engine has indexed the sample page;
wherein ranking bias and query bias are compensated, the ranking bias being compensated by comparing result information for any page indexed by the second search engine that is responsive to the strong query, the query bias being compensated by probabilistically selecting the particular address identifying the sample page.
-
-
24. A computer program product readable by a computing system and encoding a computer program of instructions for executing a computer process for comparing search engine indices and estimating coverage of at least one search engine, each search engine maintaining an index of words of pages located at specific addresses in a network, wherein the estimate of coverage indicates the relative sizes of the indices of the first and second search engine, and the relative amount of overlap between the first and second search engine, said computer process comprising:
-
generating a random query, the random query being a logical combination of words found in a lexicon of words;
submitting the random query to the first search engine;
receiving a set of URLs in response to the random query;
randomly selecting a particular URL identifying a sample page;
generating a strong query for the sample page;
submitting the strong query to a second search engine;
comparing result information received in response to the strong query to determine if the second search engine has indexed the sample page; and
estimating the relative sizes of the indices of the first and second search engines by dividing a fraction of a first set of pages sampled from the second search engine that are contained in the first search engine by a fraction of a second set of pages sampled from the first search engine that are contained in the second search engine. - View Dependent Claims (25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41)
fetching the URL from the first search engine;
fetching a corresponding page from the World Wide Web; and
constructing the strong query to be representative of the sample page.
-
-
35. The computer program product of claim 34 wherein the strong query is a disjunction of a first and second conjunctive query.
-
36. The computer program product of claim 24 wherein the result information includes URLs of pages indexed by the second search engine.
-
37. The computer program product of claim 36 wherein the URLs of the pages indexed and the particular address identifying the sample pages are normalized before the comparing.
-
38. The computer program product of claim 36 wherein the result information being compared by the process is the content of the sample page, and the content of the pages indexed by the second search engine.
-
39. The computer program product of claim 36 wherein the result information includes host names.
-
40. The computer program product of claim 24 wherein the process discards dynamic and impoverished pages before the comparing.
-
41. The computer program product of claim 24 wherein privileged access is provided to the first search engine during the process.
-
42. A computer program product readable by a computing system and encoding a computer program of instructions for executing a computer process for comparing search engine indices and estimating coverage of at least one search engine, each search engine associated with an index of words of pages located at specific addresses in a network, said computer process comprising:
-
generating a random query, the random query being a logical combination of words found in a lexicon of words;
submitting the random query to the first search engine;
receiving a set of URLs in response to the random query;
randomly selecting a particular URL identifying a sample page;
generating a strong query for the sample page;
submitting the strong query to a second search engine; and
comparing result information received in response to the strong query to determine if the second search engine has indexed the sample page;
wherein ranking bias and query bias are compensated, the ranking bias being compensated by comparing result information for any page indexed by the second search engine that is responsive to the strong query, the query bias being compensated by probabilistically selecting the particular address identifying the sample page.
-
-
43. A system for comparing search engine indices and estimating coverage of at least one search engine, each search engine associated with an index of words of pages located at specific addresses in a network, wherein the estimate of coverage indicates the relative sizes of the indices of the first and second search engine, and the relative amount of overlap between the first and second search engine, comprising:
-
a processor for generating a random query, the random query being a logical combination of words found in a lexicon of words;
a communications device for submitting the random query to the first search engine;
the communications device further for receiving a set of URLs in response to the random query;
the processor further for randomly selecting a particular URL identifying a sample page;
the processor further for generating a strong query for the sample page;
the communications device further for submitting the strong query to a second search engine;
the processor further for comparing result information received in response to the strong query to determine if the second search engine has indexed the sample page; and
the processor further for estimating the relative sizes of the indices of the first and second search engines by dividing a fraction of a first set of pages sampled from the second search engine that are contained in the first search engine by a fraction of a second set of pages sampled from the first search engine that are contained in the second search engine.
-
-
44. A method for comparing search engine indices, comprising:
-
generating an initial query;
submitting the initial query to a first search engine;
receiving a set of URLs in response to the initial query;
selecting a particular URL identifying a sample page;
generating a strong query for the sample page;
submitting the strong query to a second search engine;
comparing result information received in response to the strong query to determine if the second search engine has indexed the sample page; and
estimating the relative sizes of the indices of the first and second search engines by dividing a fraction of a first set of pages sampled from the second search engine that are contained in the first search engine by a fraction of a second set of pages sampled from the first search engine that are contained in the second search engine.
-
-
45. A method for comparing search engine indices, comprising:
-
generating an initial query;
submitting the initial query to a first search engine;
receiving a set of URLs in response to the initial query;
selecting a particular URL identifying a sample page;
generating a strong query for the sample page;
submitting the strong query to a second search engine; and
comparing result information received in response to the strong query to determine if the second search engine has indexed the sample page;
wherein ranking bias and query bias are compensated, the ranking bias being compensated by comparing result information for any page indexed by the second search engine that is responsive to the strong query, the query bias being compensated by probabilistically selecting the particular address identifying the sample page.
-
Specification