Method and apparatus for retrieving and indexing hidden pages

US 7,685,112 B2
Filed: 05/27/2005
Issued: 03/23/2010
Est. Priority Date: 06/17/2004
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-implemented method of downloading Hidden Web pages comprising:

a) selecting a query term;

b) issuing a query to a site-specific search interface containing Hidden Web pages;

c) acquiring a results index;

d) downloading the Hidden Web pages from the results index;

e) identifying a plurality of potential query terms from the downloaded Hidden Web pages;

f) estimating the efficiency of each potential query term based on a ratio of the number of new pages returned for a particular query to the cost of issuing the particular query wherein the cost of issuing the particular query is equal to c_q+c_rP(q_i)+c_dP_new(q_i) where P(q_i) represents the fraction of pages returned for a particular query (q_i) and P_new(q_i) represents the fraction of new pages returned for a particular query (q_i), and where c_qrepresents the cost of submitting the particular query, c_rrepresents the cost of retrieving a results index page, and c_drepresents the cost for downloading a matching document;

g) selecting a next query term from the plurality of potential query terms, wherein the next selected query term has the greatest efficiency; and

h) issuing a next query to the site-specific search interface using the next query term.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system for autonomously downloading and indexing Hidden Web pages from Websites includes the steps of selecting a query term and issuing a query to a site-specific search interface containing Hidden Web pages. A results index is then acquired and the Hidden Web pages are downloaded from the results index. A plurality of potential query terms are then identified from the downloaded Hidden Web pages. The efficiency of each potential query term is then estimated and a next query term is selected from the plurality of potential query terms, wherein the next selected query term has the greatest efficiency. The next selected query term is then issued to the site-specific search interface using the next query term. The process is repeated until all or most of the Hidden Web pages are discovered.

13 Citations

View as Search Results

18 Claims

1. A computer-implemented method of downloading Hidden Web pages comprising:
- a) selecting a query term;
  
  b) issuing a query to a site-specific search interface containing Hidden Web pages;
  
  c) acquiring a results index;
  
  d) downloading the Hidden Web pages from the results index;
  
  e) identifying a plurality of potential query terms from the downloaded Hidden Web pages;
  
  f) estimating the efficiency of each potential query term based on a ratio of the number of new pages returned for a particular query to the cost of issuing the particular query wherein the cost of issuing the particular query is equal to c_q+c_rP(q_i)+c_dP_new(q_i) where P(q_i) represents the fraction of pages returned for a particular query (q_i) and P_new(q_i) represents the fraction of new pages returned for a particular query (q_i), and where c_qrepresents the cost of submitting the particular query, c_rrepresents the cost of retrieving a results index page, and c_drepresents the cost for downloading a matching document;
  
  g) selecting a next query term from the plurality of potential query terms, wherein the next selected query term has the greatest efficiency; and
  
  h) issuing a next query to the site-specific search interface using the next query term.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The method of claim 1, wherein steps (c) through (h) are repeated a plurality of times.
  - 3. The method of claim 1, wherein steps (c) through (h) are repeated a plurality of times until all the Hidden Web pages are downloaded.
  - 4. The method of claim 1, wherein steps (c) through (h) are repeated a plurality of times until the number of new documents returned for one or more queries falls below a pre-set threshold.
  - 5. The method of claim 1, wherein the query term selected in step (a) is selected from a plurality of terms located on the Web page containing the site-specific search interface.
  - 6. The method of claim 1, further comprising the steps of creating an index of each downloaded Hidden Web page.
  - 7. The method of claim 1, wherein the efficiency is proportional to the number of new pages returned for a particular query.
  - 8. The method of claim 1, wherein the number of new pages returned (P_new(q_i)) for a particular query (q_i) is equal to P(q_i)−
    - P(q_i∪
      
      . . . ∪
      
      q_i−
      
      1) P(q_i|q₁∪
      
      . . . ∪
      
      q_i−
      
      1) where P(q_i) represents the fraction of pages returned for a particular query (q_i).
  - 9. The method of claim 1, wherein the site-specific search interface is a single-attribute search interface.
  - 10. The method of claim 1, wherein the site-specific search interface is a multi-attribute search interface.
  - 11. The method of claim 10, wherein for each attribute of the multi-attribute search interface, a plurality of potential query terms are identified from the downloaded Hidden Web pages.
  - 12. The method of claim 1, wherein in step (d), the Hidden Web pages are downloaded from a plurality of partial results indexes.
  - 13. The method of claim 1, wherein Hidden Web pages are obtained from a plurality of different Websites having Hidden Web pages.
  - 14. The method of claim 1, wherein the method is implemented using a crawler software program.
  - 15. The method of claim 1, wherein step (f) comprises updating a query statistics table with a number representative of how many times a query term q_iappears within Web pages downloaded from q₁, . . . , q_i−
    - 1.

16. A system for downloading Hidden Web pages comprising:
- a web crawler for issuing a plurality of queries to a site-specific search interface containing Hidden Web pages and downloading the Hidden Web pages, the Hidden Web pages containing a plurality of potential query terms; and
  
  wherein the web crawler is configured to;
  
  select a query term;
  
  issue a query to a site-specific search interface containing Hidden Web pages;
  
  acquire a results index;
  
  download the Hidden Web pages from the results index;
  
  identify a plurality of potential query terms from the downloaded Hidden Web pages;
  
  a computer configured to apply an algorithm to estimate the efficiency of each potential query term, wherein for each query, the most efficient query term is issued to the site-specific search interface by the web crawler;
  
  the algorithm comprising;
  
  estimating the efficiency of each potential query term based on a ratio of the number of new pages returned for a particular query to the cost of issuing the particular query wherein the cost of issuing the particular query is equal to c_q+c_rP(q_i)+c_dP_new(q_i) where P(q_i) represents the fraction of pages returned for a particular query (q_i) and P_new(q_i) represents the fraction of new pages returned for a particular query (q_i), and where c_qrepresents the cost of submitting the particular query, c_rrepresents the cost of retrieving a results index page, and c_drepresents the cost for downloading a matching document;
  
  selecting a next query term from the plurality of potential query terms, wherein the next selected query term has the greatest efficiency; and
  
  issuing a next query to the site-specific search interface using the next query term.
- View Dependent Claims (17, 18)
- - 17. The system of claim 16, wherein the system stores an index of each downloaded Hidden Web page.
  - 18. The system of claim 17, further comprising an Internet search engine having associated therewith an index of Web pages, wherein at least some of the indexed Web pages are Hidden Web pages.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Regents of the University of California (University of California)
Original Assignee
Regents of the University of California (University of California)
Inventors
Zerfos, Petros, Cho, Junghoo, Ntoulas, Alexandros
Primary Examiner(s)
Vo; Tim T.
Assistant Examiner(s)
Tran; Bao G

Application Number

US11/570,330
Publication Number

US 20080097958A1
Time in Patent Office

1,761 Days
Field of Search

707/3, 707/5, 708/200, 708422-446, 708/490, 715/255, 709/238
US Class Current

707/715
CPC Class Codes

G06F 16/951   Indexing; Web crawling tech...

G06F 16/9532   Query formulation

G06F 16/9538   Presentation of query results

Method and apparatus for retrieving and indexing hidden pages

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

13 Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for retrieving and indexing hidden pages

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

13 Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links