Method for estimating coverage of web search engines

US 6,711,568 B1
Filed: 11/08/2000
Issued: 03/23/2004
Est. Priority Date: 11/25/1997
Status: Expired due to Term

First Claim

Patent Images

1. A computerized method for comparing search engine indices and estimating coverage of at least one search engine, each search engine maintaining an index of words of pages located at specific addresses in a network, wherein the estimate of coverage indicates the relative sizes of the indices of the first and second search engine, and the relative amount of overlap between the first and second search engine, comprising:

generating a random query, the random query being a logical combination of words found in a lexicon of words;

submitting the random query to the first search engine;

receiving a set of URLs in response to the random query;

randomly selecting a particular URL identifying a sample page;

generating a strong query for the sample page;

submitting the strong query to a second search engine;

comparing result information received in response to the strong query to determine if the second search engine has indexed the sample page; and

estimating the relative sizes of the indices of the first and second search engines by dividing a fraction of a first set of pages sampled from the second search engine that are contained in the first search engine by a fraction of a second set of pages sampled from the first search engine that are contained in the second search engine.

View all claims

9 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computerized method is used to estimate the relative coverage of Web search engines. Each search engine maintains an index of words of pages located at specific URL addresses in a network. The method generates a random query. The random query is a logical combination of words found in a subset of the pages. The random query is submitted to a first search engine. In response a set of URLs of pages matching the query are received. Each URL identifies a page indexed by the first search engine that satisfies the random query. A particular URL identifying a sample page is randomly selected. A strong query corresponding to the sample page is generated, and the strong query is submitted to a second search engine. Result information received in response to the strong query is compared to determine if the second search engine has indexed the sample page, or a page substantially similar to the sample page. This procedure is repeated to gather statistical data which is used to estimate the relative sizes and amount of overlap of search engines.

Citations

45 Claims

1. A computerized method for comparing search engine indices and estimating coverage of at least one search engine, each search engine maintaining an index of words of pages located at specific addresses in a network, wherein the estimate of coverage indicates the relative sizes of the indices of the first and second search engine, and the relative amount of overlap between the first and second search engine, comprising:
- generating a random query, the random query being a logical combination of words found in a lexicon of words;
  
  submitting the random query to the first search engine;
  
  receiving a set of URLs in response to the random query;
  
  randomly selecting a particular URL identifying a sample page;
  
  generating a strong query for the sample page;
  
  submitting the strong query to a second search engine;
  
  comparing result information received in response to the strong query to determine if the second search engine has indexed the sample page; and
  
  estimating the relative sizes of the indices of the first and second search engines by dividing a fraction of a first set of pages sampled from the second search engine that are contained in the first search engine by a fraction of a second set of pages sampled from the first search engine that are contained in the second search engine.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
- - 2. The method of claim 1 wherein the relative amount of overlap of the indices of the first and second search engine is estimated by computing a fraction of a set of pages sampled from the second search engine that are combined in the first search engine.
  - 3. The method of claim 1 wherein the content of the lexicon of words is expressed in a particular language.
  - 4. The method of claim 1 wherein the content of the braining set of pages relates to a particular context domain.
  - 5. The method of claim 1 wherein the lexicon of words is constructed from a training set of pages, and wherein the training set of pages represents pages of interest for which coverage is being estimated.
  - 6. The method of claim 1 wherein the lexicon of words is constructed from a training set of pages, and the method further comprises determining the frequencies of unique words in the lexicon.
  - 7. The method of claim 6 wherein the random query combines random words selected from the lexicon with a logical operator.
  - 8. The method of claim 6 wherein the random query is a disjunctive query.
  - 9. The method of claim 8 wherein the disjunctive query combines a set of words using OR operators, the set of words having a predetermined size.
  - 10. The method of claim 9 wherein the words of the set have relative frequencies that are substantially similar.
  - 11. The method of claim 6 wherein the random query is a conjunctive query combining a pair of words and an AND operator.
  - 12. The method of claim 11 further including:
13. The computer program product of claim 6 wherein the random query is a conjunctive query combining a pair of words and an AND operator.
14. The computer program product of claim 13 wherein the process further includes:
- sorting the words in the lexicon according to the frequencies of the words; and
  
  establishing an upper frequency threshold and a lower frequency threshold so that when words equidistant from the upper and lower thresholds are combined in the conjunctive query, the set of addresses is less than or equal to a predetermined maximum number of members.
15. The method of claim 1 wherein the network is the World Wide Web and further including:
- fetching the URL from the first search engine;
  
  fetching a corresponding page from the World Wide Web; and
  
  constructing the strong query to be representative of the sample page.
16. The method of claim 15 wherein the strong query is a disjunction of a first and second conjunctive query.
17. The method of claim 1 wherein the result information includes URLs of pages indexed by the second search engine.
18. The method of claim 17 wherein the URLs of the pages indexed and the particular address identifying the sample pages are normalized before the comparing.
19. The method of claim 17 wherein the result information being compared is the content of the sample page, and the content of the pages indexed by the second search engine.
20. The method of claim 17 wherein the result information includes host names.
21. The method of claim 1 wherein dynamic and impoverished pages are discarded before the comparing.
22. The method of claim 1 wherein privileged access is provided to the first search engine.

23. A computerized method for comparing search engine indices and estimating coverage of at least one search engine, each search engine maintaining an index of words of pages located at specific addresses in a network, comprising:
- generating a random query, the random query being a logical combination of words found in a lexicon of words;
  
  submitting the random query to the first search engine;
  
  receiving a set of URLs in response to the random query;
  
  randomly selecting a particular URL identifying a sample page;
  
  generating a strong query for the sample page;
  
  submitting the strong query to a second search engine; and
  
  comparing result information received in response to the strong query to determine if the second search engine has indexed the sample page;
  
  wherein ranking bias and query bias are compensated, the ranking bias being compensated by comparing result information for any page indexed by the second search engine that is responsive to the strong query, the query bias being compensated by probabilistically selecting the particular address identifying the sample page.

24. A computer program product readable by a computing system and encoding a computer program of instructions for executing a computer process for comparing search engine indices and estimating coverage of at least one search engine, each search engine maintaining an index of words of pages located at specific addresses in a network, wherein the estimate of coverage indicates the relative sizes of the indices of the first and second search engine, and the relative amount of overlap between the first and second search engine, said computer process comprising:
- generating a random query, the random query being a logical combination of words found in a lexicon of words;
  
  submitting the random query to the first search engine;
  
  receiving a set of URLs in response to the random query;
  
  randomly selecting a particular URL identifying a sample page;
  
  generating a strong query for the sample page;
  
  submitting the strong query to a second search engine;
  
  comparing result information received in response to the strong query to determine if the second search engine has indexed the sample page; and
  
  estimating the relative sizes of the indices of the first and second search engines by dividing a fraction of a first set of pages sampled from the second search engine that are contained in the first search engine by a fraction of a second set of pages sampled from the first search engine that are contained in the second search engine.
- View Dependent Claims (25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41)
- - 25. The computer program product of claim 24 wherein the relative amount of overlap of the indices of the first and second search engine is estimated by the computer process by computing a fraction of a set of pages sampled from the second search engine that are combined in the first search engine.
  - 26. The computer program product of claim 24 wherein the content of the lexicon of words is expressed in a particular language.
  - 27. The computer program product of claim 24 wherein the content of the braining set of pages relates to a particular context domain.
  - 28. The computer program product of claim 24 wherein the lexicon of words is based on a training set of pages, and wherein the training set of pages represents pages of interest for which coverage is being estimated.
  - 29. The computer program product of claim 24 wherein the lexicon of words is based on a training set of pages, and the process further comprises determining the frequencies of unique words in the lexicon.
  - 30. The computer program product of claim 29 wherein the random query combines random words selected from the lexicon with a logical operator.
  - 31. The computer program product of claim 29 wherein the random query is a disjunctive query.
  - 32. The computer program product of claim 31 wherein the disjunctive query combines a set of words using OR operators, the set of words having a predetermined size.
  - 33. The computer program product of claim 32 wherein the words of the set have relative frequencies that are substantially similar.
  - 34. The computer program product of claim 24 wherein the network is the World Wide Web and the process further including:
35. The computer program product of claim 34 wherein the strong query is a disjunction of a first and second conjunctive query.
36. The computer program product of claim 24 wherein the result information includes URLs of pages indexed by the second search engine.
37. The computer program product of claim 36 wherein the URLs of the pages indexed and the particular address identifying the sample pages are normalized before the comparing.
38. The computer program product of claim 36 wherein the result information being compared by the process is the content of the sample page, and the content of the pages indexed by the second search engine.
39. The computer program product of claim 36 wherein the result information includes host names.
40. The computer program product of claim 24 wherein the process discards dynamic and impoverished pages before the comparing.
41. The computer program product of claim 24 wherein privileged access is provided to the first search engine during the process.

42. A computer program product readable by a computing system and encoding a computer program of instructions for executing a computer process for comparing search engine indices and estimating coverage of at least one search engine, each search engine associated with an index of words of pages located at specific addresses in a network, said computer process comprising:
- generating a random query, the random query being a logical combination of words found in a lexicon of words;
  
  submitting the random query to the first search engine;
  
  receiving a set of URLs in response to the random query;
  
  randomly selecting a particular URL identifying a sample page;
  
  generating a strong query for the sample page;
  
  submitting the strong query to a second search engine; and
  
  comparing result information received in response to the strong query to determine if the second search engine has indexed the sample page;
  
  wherein ranking bias and query bias are compensated, the ranking bias being compensated by comparing result information for any page indexed by the second search engine that is responsive to the strong query, the query bias being compensated by probabilistically selecting the particular address identifying the sample page.

43. A system for comparing search engine indices and estimating coverage of at least one search engine, each search engine associated with an index of words of pages located at specific addresses in a network, wherein the estimate of coverage indicates the relative sizes of the indices of the first and second search engine, and the relative amount of overlap between the first and second search engine, comprising:
- a processor for generating a random query, the random query being a logical combination of words found in a lexicon of words;
  
  a communications device for submitting the random query to the first search engine;
  
  the communications device further for receiving a set of URLs in response to the random query;
  
  the processor further for randomly selecting a particular URL identifying a sample page;
  
  the processor further for generating a strong query for the sample page;
  
  the communications device further for submitting the strong query to a second search engine;
  
  the processor further for comparing result information received in response to the strong query to determine if the second search engine has indexed the sample page; and
  
  the processor further for estimating the relative sizes of the indices of the first and second search engines by dividing a fraction of a first set of pages sampled from the second search engine that are contained in the first search engine by a fraction of a second set of pages sampled from the first search engine that are contained in the second search engine.

44. A method for comparing search engine indices, comprising:
- generating an initial query;
  
  submitting the initial query to a first search engine;
  
  receiving a set of URLs in response to the initial query;
  
  selecting a particular URL identifying a sample page;
  
  generating a strong query for the sample page;
  
  submitting the strong query to a second search engine;
  
  comparing result information received in response to the strong query to determine if the second search engine has indexed the sample page; and
  
  estimating the relative sizes of the indices of the first and second search engines by dividing a fraction of a first set of pages sampled from the second search engine that are contained in the first search engine by a fraction of a second set of pages sampled from the first search engine that are contained in the second search engine.

45. A method for comparing search engine indices, comprising:
- generating an initial query;
  
  submitting the initial query to a first search engine;
  
  receiving a set of URLs in response to the initial query;
  
  selecting a particular URL identifying a sample page;
  
  generating a strong query for the sample page;
  
  submitting the strong query to a second search engine; and
  
  comparing result information received in response to the strong query to determine if the second search engine has indexed the sample page;
  
  wherein ranking bias and query bias are compensated, the ranking bias being compensated by comparing result information for any page indexed by the second search engine that is responsive to the strong query, the query bias being compensated by probabilistically selecting the particular address identifying the sample page.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
R2 Solutions LLC (Acacia Research Corporation)
Original Assignee
Overture Services Incorporated (Apollo Global Management, Inc.)
Inventors
Broder, Andrei Zary, Bharat, Krishna Asur
Primary Examiner(s)
Pardo, Thuy N.

Application Number

US09/709,003
Time in Patent Office

1,231 Days
Field of Search

707/3, 707/4, 707/5, 707/10, 707/103, 707/6
US Class Current

1/1
CPC Class Codes

G06F 16/951   Indexing; Web crawling tech...

G06F 16/9532   Query formulation

G06F 16/9538   Presentation of query results

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99934   Query formulation, input pr...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99936   Pattern matching access

Method for estimating coverage of web search engines

First Claim

9 Assignments

0 Petitions

Accused Products

Abstract

Citations

45 Claims

Specification

Solutions

Use Cases

Quick Links

Method for estimating coverage of web search engines

First Claim

9 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

45 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links