Method and apparatus for retrieving and indexing hidden pages
First Claim
1. A computer-implemented method of downloading Hidden Web pages comprising:
- a) selecting a query term;
b) issuing a query to a site-specific search interface containing Hidden Web pages;
c) acquiring a results index;
d) downloading the Hidden Web pages from the results index;
e) identifying a plurality of potential query terms from the downloaded Hidden Web pages;
f) estimating the efficiency of each potential query term based on a ratio of the number of new pages returned for a particular query to the cost of issuing the particular query wherein the cost of issuing the particular query is equal to cq+crP(qi)+cdPnew(qi) where P(qi) represents the fraction of pages returned for a particular query (qi) and Pnew(qi) represents the fraction of new pages returned for a particular query (qi), and where cq represents the cost of submitting the particular query, cr represents the cost of retrieving a results index page, and cd represents the cost for downloading a matching document;
g) selecting a next query term from the plurality of potential query terms, wherein the next selected query term has the greatest efficiency; and
h) issuing a next query to the site-specific search interface using the next query term.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and system for autonomously downloading and indexing Hidden Web pages from Websites includes the steps of selecting a query term and issuing a query to a site-specific search interface containing Hidden Web pages. A results index is then acquired and the Hidden Web pages are downloaded from the results index. A plurality of potential query terms are then identified from the downloaded Hidden Web pages. The efficiency of each potential query term is then estimated and a next query term is selected from the plurality of potential query terms, wherein the next selected query term has the greatest efficiency. The next selected query term is then issued to the site-specific search interface using the next query term. The process is repeated until all or most of the Hidden Web pages are discovered.
13 Citations
18 Claims
-
1. A computer-implemented method of downloading Hidden Web pages comprising:
-
a) selecting a query term; b) issuing a query to a site-specific search interface containing Hidden Web pages; c) acquiring a results index; d) downloading the Hidden Web pages from the results index; e) identifying a plurality of potential query terms from the downloaded Hidden Web pages; f) estimating the efficiency of each potential query term based on a ratio of the number of new pages returned for a particular query to the cost of issuing the particular query wherein the cost of issuing the particular query is equal to cq+crP(qi)+cdPnew(qi) where P(qi) represents the fraction of pages returned for a particular query (qi) and Pnew(qi) represents the fraction of new pages returned for a particular query (qi), and where cq represents the cost of submitting the particular query, cr represents the cost of retrieving a results index page, and cd represents the cost for downloading a matching document; g) selecting a next query term from the plurality of potential query terms, wherein the next selected query term has the greatest efficiency; and h) issuing a next query to the site-specific search interface using the next query term. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. A system for downloading Hidden Web pages comprising:
-
a web crawler for issuing a plurality of queries to a site-specific search interface containing Hidden Web pages and downloading the Hidden Web pages, the Hidden Web pages containing a plurality of potential query terms; and wherein the web crawler is configured to; select a query term; issue a query to a site-specific search interface containing Hidden Web pages; acquire a results index; download the Hidden Web pages from the results index; identify a plurality of potential query terms from the downloaded Hidden Web pages; a computer configured to apply an algorithm to estimate the efficiency of each potential query term, wherein for each query, the most efficient query term is issued to the site-specific search interface by the web crawler;
the algorithm comprising;estimating the efficiency of each potential query term based on a ratio of the number of new pages returned for a particular query to the cost of issuing the particular query wherein the cost of issuing the particular query is equal to cq+crP(qi)+cdPnew(qi) where P(qi) represents the fraction of pages returned for a particular query (qi) and Pnew(qi) represents the fraction of new pages returned for a particular query (qi), and where cq represents the cost of submitting the particular query, cr represents the cost of retrieving a results index page, and cd represents the cost for downloading a matching document; selecting a next query term from the plurality of potential query terms, wherein the next selected query term has the greatest efficiency; and issuing a next query to the site-specific search interface using the next query term. - View Dependent Claims (17, 18)
-
Specification